Message ID | 61d44519a5ffaf2c040198cf8d80d05a09de5de5.1696969994.git.me@ttaylorr.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | bloom: changed-path Bloom filters v2 (& sundries) | expand |
On Tue, Oct 10, 2023 at 04:33:49PM -0400, Taylor Blau wrote: > From: Jonathan Tan <jonathantanmy@google.com> > > The murmur3 implementation in bloom.c has a bug when converting series > of 4 bytes into network-order integers when char is signed (which is > controllable by a compiler option, and the default signedness of char is > platform-specific). When a string contains characters with the high bit > set, this bug causes results that, although internally consistent within > Git, does not accord with other implementations of murmur3 (thus, > the changed path filters wouldn't be readable by other off-the-shelf > implementatios of murmur3) and even with Git binaries that were compiled > with different signedness of char. This bug affects both how Git writes > changed path filters to disk and how Git interprets changed path filters > on disk. > > Therefore, introduce a new version (2) of changed path filters that > corrects this problem. The existing version (1) is still supported and > is still the default, but users should migrate away from it as soon > as possible. > > Because this bug only manifests with characters that have the high bit > set, it may be possible that some (or all) commits in a given repo would > have the same changed path filter both before and after this fix is > applied. However, in order to determine whether this is the case, the > changed paths would first have to be computed, at which point it is not > much more expensive to just compute a new changed path filter. > > So this patch does not include any mechanism to "salvage" changed path > filters from repositories. There is also no "mixed" mode - for each > invocation of Git, reading and writing changed path filters are done > with the same version number; this version number may be explicitly > stated (typically if the user knows which version they need) or > automatically determined from the version of the existing changed path > filters in the repository. > > There is a change in write_commit_graph(). graph_read_bloom_data() > makes it possible for chunk_bloom_data to be non-NULL but > bloom_filter_settings to be NULL, which causes a segfault later on. I > produced such a segfault while developing this patch, but couldn't find > a way to reproduce it neither after this complete patch (or before), > but in any case it seemed like a good thing to include that might help > future patch authors. > > The value in t0095 was obtained from another murmur3 implementation > using the following Go source code: > > package main > > import "fmt" > import "github.com/spaolacci/murmur3" > > func main() { > fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!"))) > fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff})) > } > > Signed-off-by: Jonathan Tan <jonathantanmy@google.com> > Signed-off-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: Taylor Blau <me@ttaylorr.com> > Signed-off-by: Junio C Hamano <gitster@pobox.com> > Signed-off-by: Taylor Blau <me@ttaylorr.com> > --- > Documentation/config/commitgraph.txt | 5 +- > bloom.c | 69 +++++++++++++++++- > bloom.h | 8 +- > commit-graph.c | 32 ++++++-- > t/helper/test-bloom.c | 9 ++- > t/t0095-bloom.sh | 8 ++ > t/t4216-log-bloom.sh | 105 +++++++++++++++++++++++++++ > 7 files changed, 223 insertions(+), 13 deletions(-) > > diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt > index 2dc9170622..acc74a2f27 100644 > --- a/Documentation/config/commitgraph.txt > +++ b/Documentation/config/commitgraph.txt > @@ -15,7 +15,7 @@ commitGraph.readChangedPaths:: > > commitGraph.changedPathsVersion:: > Specifies the version of the changed-path Bloom filters that Git will read and > - write. May be -1, 0 or 1. > + write. May be -1, 0, 1, or 2. > + > Defaults to -1. > + > @@ -28,4 +28,7 @@ filters when instructed to write. > If 1, Git will only read version 1 Bloom filters, and will write version 1 > Bloom filters. > + > +If 2, Git will only read version 2 Bloom filters, and will write version 2 > +Bloom filters. > ++ > See linkgit:git-commit-graph[1] for more information. > diff --git a/bloom.c b/bloom.c > index 3e78cfe79d..ebef5cfd2f 100644 > --- a/bloom.c > +++ b/bloom.c > @@ -66,7 +66,64 @@ int load_bloom_filter_from_graph(struct commit_graph *g, > * Not considered to be cryptographically secure. > * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm > */ > -uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) > +uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len) > +{ > + const uint32_t c1 = 0xcc9e2d51; > + const uint32_t c2 = 0x1b873593; > + const uint32_t r1 = 15; > + const uint32_t r2 = 13; > + const uint32_t m = 5; > + const uint32_t n = 0xe6546b64; > + int i; > + uint32_t k1 = 0; > + const char *tail; > + > + int len4 = len / sizeof(uint32_t); > + > + uint32_t k; > + for (i = 0; i < len4; i++) { > + uint32_t byte1 = (uint32_t)(unsigned char)data[4*i]; > + uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8; > + uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16; > + uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24; > + k = byte1 | byte2 | byte3 | byte4; > + k *= c1; > + k = rotate_left(k, r1); > + k *= c2; > + > + seed ^= k; > + seed = rotate_left(seed, r2) * m + n; > + } > + > + tail = (data + len4 * sizeof(uint32_t)); > + > + switch (len & (sizeof(uint32_t) - 1)) { > + case 3: > + k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16; > + /*-fallthrough*/ > + case 2: > + k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8; > + /*-fallthrough*/ > + case 1: > + k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0; > + k1 *= c1; > + k1 = rotate_left(k1, r1); > + k1 *= c2; > + seed ^= k1; > + break; > + } > + > + seed ^= (uint32_t)len; > + seed ^= (seed >> 16); > + seed *= 0x85ebca6b; > + seed ^= (seed >> 13); > + seed *= 0xc2b2ae35; > + seed ^= (seed >> 16); > + > + return seed; > +} > + > +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len) > { > const uint32_t c1 = 0xcc9e2d51; > const uint32_t c2 = 0x1b873593; > @@ -131,8 +188,14 @@ void fill_bloom_key(const char *data, > int i; > const uint32_t seed0 = 0x293ae76f; > const uint32_t seed1 = 0x7e646e2c; > - const uint32_t hash0 = murmur3_seeded(seed0, data, len); > - const uint32_t hash1 = murmur3_seeded(seed1, data, len); > + uint32_t hash0, hash1; > + if (settings->hash_version == 2) { > + hash0 = murmur3_seeded_v2(seed0, data, len); > + hash1 = murmur3_seeded_v2(seed1, data, len); > + } else { > + hash0 = murmur3_seeded_v1(seed0, data, len); > + hash1 = murmur3_seeded_v1(seed1, data, len); > + } > > key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); > for (i = 0; i < settings->num_hashes; i++) > diff --git a/bloom.h b/bloom.h > index 1e4f612d2c..138d57a86b 100644 > --- a/bloom.h > +++ b/bloom.h > @@ -8,9 +8,11 @@ struct commit_graph; > struct bloom_filter_settings { > /* > * The version of the hashing technique being used. > - * We currently only support version = 1 which is > + * The newest version is 2, which is > * the seeded murmur3 hashing technique implemented > - * in bloom.c. > + * in bloom.c. Bloom filters of version 1 were created > + * with prior versions of Git, which had a bug in the > + * implementation of the hash function. > */ > uint32_t hash_version; > > @@ -80,7 +82,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g, > * Not considered to be cryptographically secure. > * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm > */ > -uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len); > +uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len); > > void fill_bloom_key(const char *data, > size_t len, > diff --git a/commit-graph.c b/commit-graph.c > index ea677c87fb..db623afd09 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -314,17 +314,26 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start, > return 0; > } > > +struct graph_read_bloom_data_context { > + struct commit_graph *g; > + int *commit_graph_changed_paths_version; > +}; > + > static int graph_read_bloom_data(const unsigned char *chunk_start, > size_t chunk_size, void *data) > { > - struct commit_graph *g = data; > + struct graph_read_bloom_data_context *c = data; > + struct commit_graph *g = c->g; > uint32_t hash_version; > - g->chunk_bloom_data = chunk_start; > hash_version = get_be32(chunk_start); > > - if (hash_version != 1) > + if (*c->commit_graph_changed_paths_version == -1) { > + *c->commit_graph_changed_paths_version = hash_version; > + } else if (hash_version != *c->commit_graph_changed_paths_version) { > return 0; > + } In case we have `c->commit_graph_changed_paths_version == -1` we lose the check that the hash version is something that we know and support, don't we? And while we do start to handle `-1` in the writing path, I think we don't in the reading path unless I missed something. > + g->chunk_bloom_data = chunk_start; > g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); > g->bloom_filter_settings->hash_version = hash_version; > g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4); > @@ -412,10 +421,14 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s, > } > > if (s->commit_graph_changed_paths_version) { > + struct graph_read_bloom_data_context context = { > + .g = graph, > + .commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version > + }; > pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, > &graph->chunk_bloom_indexes); > read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, > - graph_read_bloom_data, graph); > + graph_read_bloom_data, &context); > } > > if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) { > @@ -2441,6 +2454,13 @@ int write_commit_graph(struct object_directory *odb, > } > if (!commit_graph_compatible(r)) > return 0; > + if (r->settings.commit_graph_changed_paths_version < -1 > + || r->settings.commit_graph_changed_paths_version > 2) { > + warning(_("attempting to write a commit-graph, but " > + "'commitgraph.changedPathsVersion' (%d) is not supported"), > + r->settings.commit_graph_changed_paths_version); > + return 0; > + } > > CALLOC_ARRAY(ctx, 1); > ctx->r = r; > @@ -2453,6 +2473,8 @@ int write_commit_graph(struct object_directory *odb, > ctx->write_generation_data = (get_configured_generation_version(r) == 2); > ctx->num_generation_data_overflows = 0; > > + bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2 > + ? 2 : 1; > bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY", > bloom_settings.bits_per_entry); > bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES", > @@ -2482,7 +2504,7 @@ int write_commit_graph(struct object_directory *odb, > g = ctx->r->objects->commit_graph; > > /* We have changed-paths already. Keep them in the next graph */ > - if (g && g->chunk_bloom_data) { > + if (g && g->bloom_filter_settings) { > ctx->changed_paths = 1; > ctx->bloom_settings = g->bloom_filter_settings; > } > diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c > index aabe31d724..3cbc0a5b50 100644 > --- a/t/helper/test-bloom.c > +++ b/t/helper/test-bloom.c > @@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid) > > static const char *bloom_usage = "\n" > " test-tool bloom get_murmur3 <string>\n" > +" test-tool bloom get_murmur3_seven_highbit\n" > " test-tool bloom generate_filter <string> [<string>...]\n" > " test-tool bloom get_filter_for_commit <commit-hex>\n"; > > @@ -64,7 +65,13 @@ int cmd__bloom(int argc, const char **argv) > uint32_t hashed; > if (argc < 3) > usage(bloom_usage); > - hashed = murmur3_seeded(0, argv[2], strlen(argv[2])); > + hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2])); > + printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); > + } > + > + if (!strcmp(argv[1], "get_murmur3_seven_highbit")) { > + uint32_t hashed; > + hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7); > printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); > } > > diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh > index b567383eb8..c8d84ab606 100755 > --- a/t/t0095-bloom.sh > +++ b/t/t0095-bloom.sh > @@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' ' > test_cmp expect actual > ' > > +test_expect_success 'compute unseeded murmur3 hash for test string 3' ' > + cat >expect <<-\EOF && > + Murmur3 Hash with seed=0:0xa183ccfd > + EOF > + test-tool bloom get_murmur3_seven_highbit >actual && > + test_cmp expect actual > +' > + > test_expect_success 'compute bloom key for empty string' ' > cat >expect <<-\EOF && > Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > index da67c40134..8f8b5d4966 100755 > --- a/t/t4216-log-bloom.sh > +++ b/t/t4216-log-bloom.sh > @@ -536,4 +536,109 @@ test_expect_success 'version 1 changed-path used when version 1 requested' ' > ) > ' > > +test_expect_success 'version 1 changed-path not used when version 2 requested' ' > + ( > + cd highbit1 && > + git config --add commitgraph.changedPathsVersion 2 && > + test_bloom_filters_not_used "-- another$CENT" > + ) > +' > + > +test_expect_success 'version 1 changed-path used when autodetect requested' ' > + ( > + cd highbit1 && > + git config --add commitgraph.changedPathsVersion -1 && > + test_bloom_filters_used "-- another$CENT" > + ) > +' > + > +test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' ' > + test_commit -C highbit1 c1double "$CENT$CENT" && > + git -C highbit1 commit-graph write --reachable --changed-paths && > + ( > + cd highbit1 && > + git config --add commitgraph.changedPathsVersion -1 && > + echo "options: bloom(1,10,7) read_generation_data" >expect && > + test-tool read-graph >full && > + grep options full >actual && > + test_cmp expect actual > + ) > +' > + > +test_expect_success 'set up repo with high bit path, version 2 changed-path' ' > + git init highbit2 && > + git -C highbit2 config --add commitgraph.changedPathsVersion 2 && > + test_commit -C highbit2 c2 "$CENT" && > + git -C highbit2 commit-graph write --reachable --changed-paths > +' > + > +test_expect_success 'check value of version 2 changed-path' ' > + ( > + cd highbit2 && > + echo "c01f" >expect && > + get_first_changed_path_filter >actual && > + test_cmp expect actual > + ) > +' > + > +test_expect_success 'setup make another commit' ' > + # "git log" does not use Bloom filters for root commits - see how, in > + # revision.c, rev_compare_tree() (the only code path that eventually calls > + # get_bloom_filter()) is only called by try_to_simplify_commit() when the commit > + # has one parent. Therefore, make another commit so that we perform the tests on > + # a non-root commit. > + test_commit -C highbit2 anotherc2 "another$CENT" > +' > + > +test_expect_success 'version 2 changed-path used when version 2 requested' ' > + ( > + cd highbit2 && > + test_bloom_filters_used "-- another$CENT" > + ) > +' > + > +test_expect_success 'version 2 changed-path not used when version 1 requested' ' > + ( > + cd highbit2 && > + git config --add commitgraph.changedPathsVersion 1 && > + test_bloom_filters_not_used "-- another$CENT" > + ) > +' > + > +test_expect_success 'version 2 changed-path used when autodetect requested' ' > + ( > + cd highbit2 && > + git config --add commitgraph.changedPathsVersion -1 && > + test_bloom_filters_used "-- another$CENT" > + ) > +' > + > +test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' ' > + test_commit -C highbit2 c2double "$CENT$CENT" && > + git -C highbit2 commit-graph write --reachable --changed-paths && > + ( > + cd highbit2 && > + git config --add commitgraph.changedPathsVersion -1 && > + echo "options: bloom(2,10,7) read_generation_data" >expect && > + test-tool read-graph >full && > + grep options full >actual && > + test_cmp expect actual > + ) > +' > + > +test_expect_success 'when writing commit graph, do not reuse changed-path of another version' ' > + git init doublewrite && > + test_commit -C doublewrite c "$CENT" && > + git -C doublewrite config --add commitgraph.changedPathsVersion 1 && > + git -C doublewrite commit-graph write --reachable --changed-paths && > + git -C doublewrite config --add commitgraph.changedPathsVersion 2 && > + git -C doublewrite commit-graph write --reachable --changed-paths && > + ( > + cd doublewrite && > + echo "c01f" >expect && > + get_first_changed_path_filter >actual && > + test_cmp expect actual > + ) > +' > + With the supposedly missing check in mind, should we also add tests for currently unknown versions like 3 or -2? Patrick > test_done > -- > 2.42.0.342.g8bb3a896ee >
On Tue, Oct 17, 2023 at 10:45:24AM +0200, Patrick Steinhardt wrote: > > @@ -314,17 +314,26 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start, > > return 0; > > } > > > > +struct graph_read_bloom_data_context { > > + struct commit_graph *g; > > + int *commit_graph_changed_paths_version; > > +}; > > + > > static int graph_read_bloom_data(const unsigned char *chunk_start, > > size_t chunk_size, void *data) > > { > > - struct commit_graph *g = data; > > + struct graph_read_bloom_data_context *c = data; > > + struct commit_graph *g = c->g; > > uint32_t hash_version; > > - g->chunk_bloom_data = chunk_start; > > hash_version = get_be32(chunk_start); > > > > - if (hash_version != 1) > > + if (*c->commit_graph_changed_paths_version == -1) { > > + *c->commit_graph_changed_paths_version = hash_version; > > + } else if (hash_version != *c->commit_graph_changed_paths_version) { > > return 0; > > + } > > In case we have `c->commit_graph_changed_paths_version == -1` we lose > the check that the hash version is something that we know and support, > don't we? And while we do start to handle `-1` in the writing path, I > think we don't in the reading path unless I missed something. We don't have to deal with c->commit_graph_changed_paths_version being -1 here, since we normalize it when reading the BDAT chunk. See commit-graph.c::graph_read_bloom_data(), particularly: if (*c->commit_graph_changed_paths_version == -1) *c->commit_graph_changed_paths_version = hash_version; else if (hash_version != *c->commit_graph_changed_paths_version) return 0; > > +test_expect_success 'when writing commit graph, do not reuse changed-path of another version' ' > > + git init doublewrite && > > + test_commit -C doublewrite c "$CENT" && > > + git -C doublewrite config --add commitgraph.changedPathsVersion 1 && > > + git -C doublewrite commit-graph write --reachable --changed-paths && > > + git -C doublewrite config --add commitgraph.changedPathsVersion 2 && > > + git -C doublewrite commit-graph write --reachable --changed-paths && > > + ( > > + cd doublewrite && > > + echo "c01f" >expect && > > + get_first_changed_path_filter >actual && > > + test_cmp expect actual > > + ) > > +' > > + > > With the supposedly missing check in mind, should we also add tests for > currently unknown versions like 3 or -2? Good idea, I'll update the test to reflect. Thanks, Taylor
diff --git a/Documentation/config/commitgraph.txt b/Documentation/config/commitgraph.txt index 2dc9170622..acc74a2f27 100644 --- a/Documentation/config/commitgraph.txt +++ b/Documentation/config/commitgraph.txt @@ -15,7 +15,7 @@ commitGraph.readChangedPaths:: commitGraph.changedPathsVersion:: Specifies the version of the changed-path Bloom filters that Git will read and - write. May be -1, 0 or 1. + write. May be -1, 0, 1, or 2. + Defaults to -1. + @@ -28,4 +28,7 @@ filters when instructed to write. If 1, Git will only read version 1 Bloom filters, and will write version 1 Bloom filters. + +If 2, Git will only read version 2 Bloom filters, and will write version 2 +Bloom filters. ++ See linkgit:git-commit-graph[1] for more information. diff --git a/bloom.c b/bloom.c index 3e78cfe79d..ebef5cfd2f 100644 --- a/bloom.c +++ b/bloom.c @@ -66,7 +66,64 @@ int load_bloom_filter_from_graph(struct commit_graph *g, * Not considered to be cryptographically secure. * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm */ -uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) +uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len) +{ + const uint32_t c1 = 0xcc9e2d51; + const uint32_t c2 = 0x1b873593; + const uint32_t r1 = 15; + const uint32_t r2 = 13; + const uint32_t m = 5; + const uint32_t n = 0xe6546b64; + int i; + uint32_t k1 = 0; + const char *tail; + + int len4 = len / sizeof(uint32_t); + + uint32_t k; + for (i = 0; i < len4; i++) { + uint32_t byte1 = (uint32_t)(unsigned char)data[4*i]; + uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8; + uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16; + uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24; + k = byte1 | byte2 | byte3 | byte4; + k *= c1; + k = rotate_left(k, r1); + k *= c2; + + seed ^= k; + seed = rotate_left(seed, r2) * m + n; + } + + tail = (data + len4 * sizeof(uint32_t)); + + switch (len & (sizeof(uint32_t) - 1)) { + case 3: + k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16; + /*-fallthrough*/ + case 2: + k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8; + /*-fallthrough*/ + case 1: + k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0; + k1 *= c1; + k1 = rotate_left(k1, r1); + k1 *= c2; + seed ^= k1; + break; + } + + seed ^= (uint32_t)len; + seed ^= (seed >> 16); + seed *= 0x85ebca6b; + seed ^= (seed >> 13); + seed *= 0xc2b2ae35; + seed ^= (seed >> 16); + + return seed; +} + +static uint32_t murmur3_seeded_v1(uint32_t seed, const char *data, size_t len) { const uint32_t c1 = 0xcc9e2d51; const uint32_t c2 = 0x1b873593; @@ -131,8 +188,14 @@ void fill_bloom_key(const char *data, int i; const uint32_t seed0 = 0x293ae76f; const uint32_t seed1 = 0x7e646e2c; - const uint32_t hash0 = murmur3_seeded(seed0, data, len); - const uint32_t hash1 = murmur3_seeded(seed1, data, len); + uint32_t hash0, hash1; + if (settings->hash_version == 2) { + hash0 = murmur3_seeded_v2(seed0, data, len); + hash1 = murmur3_seeded_v2(seed1, data, len); + } else { + hash0 = murmur3_seeded_v1(seed0, data, len); + hash1 = murmur3_seeded_v1(seed1, data, len); + } key->hashes = (uint32_t *)xcalloc(settings->num_hashes, sizeof(uint32_t)); for (i = 0; i < settings->num_hashes; i++) diff --git a/bloom.h b/bloom.h index 1e4f612d2c..138d57a86b 100644 --- a/bloom.h +++ b/bloom.h @@ -8,9 +8,11 @@ struct commit_graph; struct bloom_filter_settings { /* * The version of the hashing technique being used. - * We currently only support version = 1 which is + * The newest version is 2, which is * the seeded murmur3 hashing technique implemented - * in bloom.c. + * in bloom.c. Bloom filters of version 1 were created + * with prior versions of Git, which had a bug in the + * implementation of the hash function. */ uint32_t hash_version; @@ -80,7 +82,7 @@ int load_bloom_filter_from_graph(struct commit_graph *g, * Not considered to be cryptographically secure. * Implemented as described in https://en.wikipedia.org/wiki/MurmurHash#Algorithm */ -uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len); +uint32_t murmur3_seeded_v2(uint32_t seed, const char *data, size_t len); void fill_bloom_key(const char *data, size_t len, diff --git a/commit-graph.c b/commit-graph.c index ea677c87fb..db623afd09 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -314,17 +314,26 @@ static int graph_read_oid_lookup(const unsigned char *chunk_start, return 0; } +struct graph_read_bloom_data_context { + struct commit_graph *g; + int *commit_graph_changed_paths_version; +}; + static int graph_read_bloom_data(const unsigned char *chunk_start, size_t chunk_size, void *data) { - struct commit_graph *g = data; + struct graph_read_bloom_data_context *c = data; + struct commit_graph *g = c->g; uint32_t hash_version; - g->chunk_bloom_data = chunk_start; hash_version = get_be32(chunk_start); - if (hash_version != 1) + if (*c->commit_graph_changed_paths_version == -1) { + *c->commit_graph_changed_paths_version = hash_version; + } else if (hash_version != *c->commit_graph_changed_paths_version) { return 0; + } + g->chunk_bloom_data = chunk_start; g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); g->bloom_filter_settings->hash_version = hash_version; g->bloom_filter_settings->num_hashes = get_be32(chunk_start + 4); @@ -412,10 +421,14 @@ struct commit_graph *parse_commit_graph(struct repo_settings *s, } if (s->commit_graph_changed_paths_version) { + struct graph_read_bloom_data_context context = { + .g = graph, + .commit_graph_changed_paths_version = &s->commit_graph_changed_paths_version + }; pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, &graph->chunk_bloom_indexes); read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, - graph_read_bloom_data, graph); + graph_read_bloom_data, &context); } if (graph->chunk_bloom_indexes && graph->chunk_bloom_data) { @@ -2441,6 +2454,13 @@ int write_commit_graph(struct object_directory *odb, } if (!commit_graph_compatible(r)) return 0; + if (r->settings.commit_graph_changed_paths_version < -1 + || r->settings.commit_graph_changed_paths_version > 2) { + warning(_("attempting to write a commit-graph, but " + "'commitgraph.changedPathsVersion' (%d) is not supported"), + r->settings.commit_graph_changed_paths_version); + return 0; + } CALLOC_ARRAY(ctx, 1); ctx->r = r; @@ -2453,6 +2473,8 @@ int write_commit_graph(struct object_directory *odb, ctx->write_generation_data = (get_configured_generation_version(r) == 2); ctx->num_generation_data_overflows = 0; + bloom_settings.hash_version = r->settings.commit_graph_changed_paths_version == 2 + ? 2 : 1; bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY", bloom_settings.bits_per_entry); bloom_settings.num_hashes = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_NUM_HASHES", @@ -2482,7 +2504,7 @@ int write_commit_graph(struct object_directory *odb, g = ctx->r->objects->commit_graph; /* We have changed-paths already. Keep them in the next graph */ - if (g && g->chunk_bloom_data) { + if (g && g->bloom_filter_settings) { ctx->changed_paths = 1; ctx->bloom_settings = g->bloom_filter_settings; } diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index aabe31d724..3cbc0a5b50 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid) static const char *bloom_usage = "\n" " test-tool bloom get_murmur3 <string>\n" +" test-tool bloom get_murmur3_seven_highbit\n" " test-tool bloom generate_filter <string> [<string>...]\n" " test-tool bloom get_filter_for_commit <commit-hex>\n"; @@ -64,7 +65,13 @@ int cmd__bloom(int argc, const char **argv) uint32_t hashed; if (argc < 3) usage(bloom_usage); - hashed = murmur3_seeded(0, argv[2], strlen(argv[2])); + hashed = murmur3_seeded_v2(0, argv[2], strlen(argv[2])); + printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); + } + + if (!strcmp(argv[1], "get_murmur3_seven_highbit")) { + uint32_t hashed; + hashed = murmur3_seeded_v2(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7); printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); } diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index b567383eb8..c8d84ab606 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' ' test_cmp expect actual ' +test_expect_success 'compute unseeded murmur3 hash for test string 3' ' + cat >expect <<-\EOF && + Murmur3 Hash with seed=0:0xa183ccfd + EOF + test-tool bloom get_murmur3_seven_highbit >actual && + test_cmp expect actual +' + test_expect_success 'compute bloom key for empty string' ' cat >expect <<-\EOF && Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index da67c40134..8f8b5d4966 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -536,4 +536,109 @@ test_expect_success 'version 1 changed-path used when version 1 requested' ' ) ' +test_expect_success 'version 1 changed-path not used when version 2 requested' ' + ( + cd highbit1 && + git config --add commitgraph.changedPathsVersion 2 && + test_bloom_filters_not_used "-- another$CENT" + ) +' + +test_expect_success 'version 1 changed-path used when autodetect requested' ' + ( + cd highbit1 && + git config --add commitgraph.changedPathsVersion -1 && + test_bloom_filters_used "-- another$CENT" + ) +' + +test_expect_success 'when writing another commit graph, preserve existing version 1 of changed-path' ' + test_commit -C highbit1 c1double "$CENT$CENT" && + git -C highbit1 commit-graph write --reachable --changed-paths && + ( + cd highbit1 && + git config --add commitgraph.changedPathsVersion -1 && + echo "options: bloom(1,10,7) read_generation_data" >expect && + test-tool read-graph >full && + grep options full >actual && + test_cmp expect actual + ) +' + +test_expect_success 'set up repo with high bit path, version 2 changed-path' ' + git init highbit2 && + git -C highbit2 config --add commitgraph.changedPathsVersion 2 && + test_commit -C highbit2 c2 "$CENT" && + git -C highbit2 commit-graph write --reachable --changed-paths +' + +test_expect_success 'check value of version 2 changed-path' ' + ( + cd highbit2 && + echo "c01f" >expect && + get_first_changed_path_filter >actual && + test_cmp expect actual + ) +' + +test_expect_success 'setup make another commit' ' + # "git log" does not use Bloom filters for root commits - see how, in + # revision.c, rev_compare_tree() (the only code path that eventually calls + # get_bloom_filter()) is only called by try_to_simplify_commit() when the commit + # has one parent. Therefore, make another commit so that we perform the tests on + # a non-root commit. + test_commit -C highbit2 anotherc2 "another$CENT" +' + +test_expect_success 'version 2 changed-path used when version 2 requested' ' + ( + cd highbit2 && + test_bloom_filters_used "-- another$CENT" + ) +' + +test_expect_success 'version 2 changed-path not used when version 1 requested' ' + ( + cd highbit2 && + git config --add commitgraph.changedPathsVersion 1 && + test_bloom_filters_not_used "-- another$CENT" + ) +' + +test_expect_success 'version 2 changed-path used when autodetect requested' ' + ( + cd highbit2 && + git config --add commitgraph.changedPathsVersion -1 && + test_bloom_filters_used "-- another$CENT" + ) +' + +test_expect_success 'when writing another commit graph, preserve existing version 2 of changed-path' ' + test_commit -C highbit2 c2double "$CENT$CENT" && + git -C highbit2 commit-graph write --reachable --changed-paths && + ( + cd highbit2 && + git config --add commitgraph.changedPathsVersion -1 && + echo "options: bloom(2,10,7) read_generation_data" >expect && + test-tool read-graph >full && + grep options full >actual && + test_cmp expect actual + ) +' + +test_expect_success 'when writing commit graph, do not reuse changed-path of another version' ' + git init doublewrite && + test_commit -C doublewrite c "$CENT" && + git -C doublewrite config --add commitgraph.changedPathsVersion 1 && + git -C doublewrite commit-graph write --reachable --changed-paths && + git -C doublewrite config --add commitgraph.changedPathsVersion 2 && + git -C doublewrite commit-graph write --reachable --changed-paths && + ( + cd doublewrite && + echo "c01f" >expect && + get_first_changed_path_filter >actual && + test_cmp expect actual + ) +' + test_done