Message ID | 1cb0ee10253ab38b194c6554ecc68a5267e21145.1684790529.git.jonathantanmy@google.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | Changed path filter hash fix and version bump | expand |
On 5/22/2023 5:48 PM, Jonathan Tan wrote: > The murmur3 implementation in bloom.c has a bug when converting series > of 4 bytes into network-order integers when char is signed (which is > controllable by a compiler option, and the default signedness of char is > platform-specific). When a string contains characters with the high bit > set, this bug causes results that, although internally consistent within > Git, does not accord with other implementations of murmur3 and even with > Git binaries that were compiled with different signedness of char. This > bug affects both how Git writes changed path filters to disk and how Git > interprets changed path filters on disk. > > Therefore, fix this bug. And because changed path filters on disk might > no longer be compatible, teach Git to write "2" as the version when > writing changed path filters (instead of "1" currently), and only accept > "2" as the version when reading them (instead of "1" currently). I appreciate that you discovered and are presenting a way out of this problem, however the current approach does not preserve compatibility enough. > @@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) > > uint32_t k; > for (i = 0; i < len4; i++) { > - uint32_t byte1 = (uint32_t)data[4*i]; > - uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8; > - uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16; > - uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24; > + uint32_t byte1 = (uint32_t)(unsigned char)data[4*i]; > + uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8; > + uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16; > + uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24; > k = byte1 | byte2 | byte3 | byte4; > k *= c1; > k = rotate_left(k, r1); By changing this algorithm directly (instead of making an "unsigned" version, or renaming this one to the "maybe signed" version), you are making it impossible for us to ship a version that can read version 1 Bloom filters, so all read-only history operations will immediately slow down (because they will ignore v1 chunks, better than incorrectly parsing v1 chunks). > @@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, > g->chunk_bloom_data = chunk_start; > hash_version = get_be32(chunk_start); > > - if (hash_version != 1) > + if (hash_version != BLOOM_HASH_VERSION) > return 0; Here's where we would ignore v1 filters, instead of continuing to read them (with all the risks involved). In order for this to be something we can ship safely to environments that depend on changed-path Bloom filters, we need to be able to parse v1 filters. It would be even better if we didn't write v2 filters by default, but instead hid it behind a config option that is off by default for at least one major release. I notice that you didn't update the commit-graph format docs, which seems like a valuable place to describe the new version number, as well as any plans to completely deprecate v1. For instance, describing the v1 implementation as having inconsistent application of murmur3 is a valuable thing to have, but then describe the plans for deprecating it as an unsafe format. Here is a potential plan to consider: 1. v2.42.0 includes writing v2 format, off by default. 2. v2.43.0 writes v2 format by default. 3. v2.44.0 no longer parses v1 format (ignored without error). Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > I notice that you didn't update the commit-graph format docs, Ah, thanks for the reminder. > Here is a potential plan to consider: > > 1. v2.42.0 includes writing v2 format, off by default. > 2. v2.43.0 writes v2 format by default. > 3. v2.44.0 no longer parses v1 format (ignored without error). First of all, thanks for your comments on the migration process - that is indeed the most complicated part of this. The code change to support 2 versions seems not too hard (duplicate murmur3_seeded() and modify one so that we have one version 1 and one version 2, and teach fill_bloom_key() to call the appropriate one based on the struct bloom_filter_settings) but this requires both the code author and reviewer(s) to check that we don't have cases in which we read or write one version when we're supposed to do it with the other. And the benefit of doing this seems to just be giving server administrators an opportunity to perform the migration at a more relaxed pace, which I think there are other ways to accomplish if we really wanted to do this, so I wanted to avoid having 2 versions in the Git codebase.
Derrick Stolee <derrickstolee@github.com> writes: > I appreciate that you discovered and are presenting a way out of this > problem, however the current approach does not preserve compatibility > enough. > ... > By changing this algorithm directly (instead of making an "unsigned" version, > or renaming this one to the "maybe signed" version), you are making it > impossible for us to ship a version that can read version 1 Bloom filters, > so all read-only history operations will immediately slow down (because they > will ignore v1 chunks, better than incorrectly parsing v1 chunks). > > Here's where we would ignore v1 filters, instead of continuing to read them > (with all the risks involved). I do not know the "all the risks involved" comment. Is the risk something we can mitigate by still reading v1 data but be careful about when not to apply the filters? I may be misremembering the original discussion, but wasn't the conclusion that v1 data is salvageable in the sense that it can still reliably say that, given a pathname without bytes with high-bit set, it cannot possibly belong to the set of changed paths, even though, because the filter is contaminated with "signed" data, its false-positive rate may be higher than using "unsigned" version? And based on that observation, we can still read v1 data but only use the Bloom filters when the queried paths have no byte with high-bit set. Also if we are operating in such an environment then would it be possible to first compute as if we were going to generate v2 data, but write it as v1 after reading all the path and realizing there are no problematic paths? IOW, even if the version of Git is capable of writing and reading v2, it does not have to use v2, right? To put it the other way around, we will have to and can keep supporting data that is labeled as v1, no? > In order for this to be something we can ship safely to environments that depend > on changed-path Bloom filters, we need to be able to parse v1 filters. It would > be even better if we didn't write v2 filters by default, but instead hid it > behind a config option that is off by default for at least one major release. Is the concern that we will double the chunk size because both v1 and v2 will be written? Or is it that we will stop writing v1 if we start writing v2 and switching too early will mean the repositories will become slower for older implementations that haven't died out? Thanks.
Junio C Hamano <gitster@pobox.com> writes: > I may be misremembering the original discussion, but wasn't the > conclusion that v1 data is salvageable in the sense that it can > still reliably say that, given a pathname without bytes with > high-bit set, it cannot possibly belong to the set of changed paths, > even though, because the filter is contaminated with "signed" data, > its false-positive rate may be higher than using "unsigned" version? > And based on that observation, we can still read v1 data but only > use the Bloom filters when the queried paths have no byte with > high-bit set. There are at least 3 ways of salvaging the data that we've discussed: - Enumerating all of a repo's paths and if none of them have a high bit, retain the existing filters. - Walking all of a repo's trees (so that we know which tree corresponds to which commit) and if for a commit, all its trees have no high bit, retain the filter for that tree (otherwise recompute it). - Keep using a version 1 filter but only when the sought-for path has no high bit (as you describe here). (The first 2 is my interpretation of what Taylor described [1].) I'm not sure if we want to keep version 1 filters around at all - this would work with Git as long as it is not compiled with different signedness of char, but would not work with other implementations of Git (unless they replicate the hashing bug). There is also the issue of how we're going to indicate that in a commit graph file, some filters are version 1 and some filters are version 2 (unless the plan is to completely rewrite the filters in this case, but then we'll run into the issue that computing these filters en-masse is expensive, as Taylor describes also in [1]). > Also if we are operating in such an environment then would it be > possible to first compute as if we were going to generate v2 data, > but write it as v1 after reading all the path and realizing there > are no problematic paths? I think in this case, we would want to write it as v2 anyway, because there's no way to distinguish a v1 that has high bits and is written incorrectly versus a v1 that happens to have no high bits and therefore is identical under v2. > IOW, even if the version of Git is > capable of writing and reading v2, it does not have to use v2, > right? To put it the other way around, we will have to and can keep > supporting data that is labeled as v1, no? I think this is the main point - whether we want to continue supporting data labeled as v1. I personally think that we should migrate to v2 as quickly as possible. But if the consensus is that we should support both, at least for a few releases of Git, I'll go with that. [1] https://lore.kernel.org/git/ZF116EDcmAy7XEbC@nand.local/
On 5/24/2023 5:26 PM, Jonathan Tan wrote: > Junio C Hamano <gitster@pobox.com> writes: >> IOW, even if the version of Git is >> capable of writing and reading v2, it does not have to use v2, >> right? To put it the other way around, we will have to and can keep >> supporting data that is labeled as v1, no? > > I think this is the main point - whether we want to continue supporting > data labeled as v1. I personally think that we should migrate to v2 > as quickly as possible. But if the consensus is that we should support > both, at least for a few releases of Git, I'll go with that. I agree on migrating quickly as possible, within basic safety guidelines. Shipping a Git change that suddenly is unable to use on-disk data that it has previously relied upon is not a safe change. And that is the absolute minimum amount of safety required. The other side is to not make a Git change that suddenly changes the on-disk format without a switch to disable it. Think about it this way: if there was a bug in the code, could we safely roll it back? If we are immediately writing v2 filters after the deployment, then rolling it back will cause the previous version to not recognize those filters, leading to a delayed recovery. I'd be willing to modify my suggested steps: >>> 1. v2.42.0 includes writing v2 format, off by default. >>> 2. v2.43.0 writes v2 format by default. >>> 3. v2.44.0 no longer parses v1 format (ignored without error). to something simpler: 1. v2.42.0 writes v2 format by default, but can be disabled by config. 2. v2.43.0 no longer parses or writes v1 format. With this, we could proactively set the config value that disables the v2 format in our production environment, then slowly re-enable that config after the binaries have deployed. This allows us to limit the blast radius if something goes wrong, which is really important. Further, I'm describing an environment where we control all of the Git versions that are interacting with the repositories. Other environments don't have that luxury, such as typical client users. Even the three-version plan is an accelerated deprecation plan, based on previous examples in Git. Thanks, -Stolee
Derrick Stolee <derrickstolee@github.com> writes: > I'd be willing to modify my suggested steps: > > >>> 1. v2.42.0 includes writing v2 format, off by default. > >>> 2. v2.43.0 writes v2 format by default. > >>> 3. v2.44.0 no longer parses v1 format (ignored without error). > > to something simpler: > > 1. v2.42.0 writes v2 format by default, but can be disabled by config. > 2. v2.43.0 no longer parses or writes v1 format. > > With this, we could proactively set the config value that disables the > v2 format in our production environment, then slowly re-enable that > config after the binaries have deployed. This allows us to limit the > blast radius if something goes wrong, which is really important. > > Further, I'm describing an environment where we control all of the Git > versions that are interacting with the repositories. Other environments > don't have that luxury, such as typical client users. > > Even the three-version plan is an accelerated deprecation plan, based > on previous examples in Git. > > Thanks, > -Stolee OK, let me take a look and see what this (having at least a version of Git that supports both versions of hash functions) would look like. If we're going to have this, we might as well roll it out as safely as possible, so I'll aim for your original step 1 of 3 (write v2, off by default).
diff --git a/bloom.c b/bloom.c index aef6b5fea2..fec243b2f1 100644 --- a/bloom.c +++ b/bloom.c @@ -82,10 +82,10 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) uint32_t k; for (i = 0; i < len4; i++) { - uint32_t byte1 = (uint32_t)data[4*i]; - uint32_t byte2 = ((uint32_t)data[4*i + 1]) << 8; - uint32_t byte3 = ((uint32_t)data[4*i + 2]) << 16; - uint32_t byte4 = ((uint32_t)data[4*i + 3]) << 24; + uint32_t byte1 = (uint32_t)(unsigned char)data[4*i]; + uint32_t byte2 = ((uint32_t)(unsigned char)data[4*i + 1]) << 8; + uint32_t byte3 = ((uint32_t)(unsigned char)data[4*i + 2]) << 16; + uint32_t byte4 = ((uint32_t)(unsigned char)data[4*i + 3]) << 24; k = byte1 | byte2 | byte3 | byte4; k *= c1; k = rotate_left(k, r1); @@ -99,13 +99,13 @@ uint32_t murmur3_seeded(uint32_t seed, const char *data, size_t len) switch (len & (sizeof(uint32_t) - 1)) { case 3: - k1 ^= ((uint32_t)tail[2]) << 16; + k1 ^= ((uint32_t)(unsigned char)tail[2]) << 16; /*-fallthrough*/ case 2: - k1 ^= ((uint32_t)tail[1]) << 8; + k1 ^= ((uint32_t)(unsigned char)tail[1]) << 8; /*-fallthrough*/ case 1: - k1 ^= ((uint32_t)tail[0]) << 0; + k1 ^= ((uint32_t)(unsigned char)tail[0]) << 0; k1 *= c1; k1 = rotate_left(k1, r1); k1 *= c2; diff --git a/bloom.h b/bloom.h index adde6dfe21..8526fa948c 100644 --- a/bloom.h +++ b/bloom.h @@ -7,9 +7,11 @@ struct repository; struct bloom_filter_settings { /* * The version of the hashing technique being used. - * We currently only support version = 1 which is + * We currently only support version = 2 which is * the seeded murmur3 hashing technique implemented - * in bloom.c. + * in bloom.c. Bloom filters of version 1 were created + * with prior versions of Git, which had a bug in the + * implementation of the hash function. */ uint32_t hash_version; @@ -38,8 +40,9 @@ struct bloom_filter_settings { uint32_t max_changed_paths; }; +#define BLOOM_HASH_VERSION 2 #define DEFAULT_BLOOM_MAX_CHANGES 512 -#define DEFAULT_BLOOM_FILTER_SETTINGS { 1, 7, 10, DEFAULT_BLOOM_MAX_CHANGES } +#define DEFAULT_BLOOM_FILTER_SETTINGS { BLOOM_HASH_VERSION, 7, 10, DEFAULT_BLOOM_MAX_CHANGES } #define BITS_PER_WORD 8 #define BLOOMDATA_CHUNK_HEADER_SIZE 3 * sizeof(uint32_t) diff --git a/commit-graph.c b/commit-graph.c index 843bdb458d..2eb9b781f4 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -314,7 +314,7 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, g->chunk_bloom_data = chunk_start; hash_version = get_be32(chunk_start); - if (hash_version != 1) + if (hash_version != BLOOM_HASH_VERSION) return 0; g->bloom_filter_settings = xmalloc(sizeof(struct bloom_filter_settings)); @@ -2402,7 +2402,7 @@ int write_commit_graph(struct object_directory *odb, g = ctx->r->objects->commit_graph; /* We have changed-paths already. Keep them in the next graph */ - if (g && g->chunk_bloom_data) { + if (g && g->bloom_filter_settings) { ctx->changed_paths = 1; ctx->bloom_settings = g->bloom_filter_settings; } diff --git a/t/helper/test-bloom.c b/t/helper/test-bloom.c index aabe31d724..5624e4d2db 100644 --- a/t/helper/test-bloom.c +++ b/t/helper/test-bloom.c @@ -50,6 +50,7 @@ static void get_bloom_filter_for_commit(const struct object_id *commit_oid) static const char *bloom_usage = "\n" " test-tool bloom get_murmur3 <string>\n" +" test-tool bloom get_murmur3_seven_highbit\n" " test-tool bloom generate_filter <string> [<string>...]\n" " test-tool bloom get_filter_for_commit <commit-hex>\n"; @@ -68,6 +69,12 @@ int cmd__bloom(int argc, const char **argv) printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); } + if (!strcmp(argv[1], "get_murmur3_seven_highbit")) { + uint32_t hashed; + hashed = murmur3_seeded(0, "\x99\xaa\xbb\xcc\xdd\xee\xff", 7); + printf("Murmur3 Hash with seed=0:0x%08x\n", hashed); + } + if (!strcmp(argv[1], "generate_filter")) { struct bloom_filter filter; int i = 2; diff --git a/t/t0095-bloom.sh b/t/t0095-bloom.sh index b567383eb8..c8d84ab606 100755 --- a/t/t0095-bloom.sh +++ b/t/t0095-bloom.sh @@ -29,6 +29,14 @@ test_expect_success 'compute unseeded murmur3 hash for test string 2' ' test_cmp expect actual ' +test_expect_success 'compute unseeded murmur3 hash for test string 3' ' + cat >expect <<-\EOF && + Murmur3 Hash with seed=0:0xa183ccfd + EOF + test-tool bloom get_murmur3_seven_highbit >actual && + test_cmp expect actual +' + test_expect_success 'compute bloom key for empty string' ' cat >expect <<-\EOF && Hashes:0x5615800c|0x5b966560|0x61174ab4|0x66983008|0x6c19155c|0x7199fab0|0x771ae004| diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh index f14cc1c1f1..7a193aa143 100755 --- a/t/t4216-log-bloom.sh +++ b/t/t4216-log-bloom.sh @@ -48,7 +48,7 @@ graph_read_expect () { header: 43475048 1 $(test_oid oid_version) $NUM_CHUNKS 0 num_commits: $1 chunks: oid_fanout oid_lookup commit_metadata generation_data bloom_indexes bloom_data - options: bloom(1,10,7) read_generation_data + options: bloom(2,10,7) read_generation_data EOF test-tool read-graph >actual && test_cmp expect actual @@ -108,7 +108,7 @@ test_expect_success 'incompatible bloom filter versions are not used' ' # But the correct version number works cat old-commit-graph >new-commit-graph && - printf "\01" | + printf "\02" | dd of=new-commit-graph bs=1 count=1 \ seek=$((BDAT_OFFSET + 3)) conv=notrunc && mv new-commit-graph .git/objects/info/commit-graph && @@ -209,10 +209,10 @@ test_expect_success 'persist filter settings' ' GIT_TEST_BLOOM_SETTINGS_NUM_HASHES=9 \ GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY=15 \ git commit-graph write --reachable --changed-paths && - grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt && + grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2.txt && GIT_TRACE2_EVENT="$(pwd)/trace2-auto.txt" \ git commit-graph write --reachable --changed-paths && - grep "{\"hash_version\":1,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt + grep "{\"hash_version\":2,\"num_hashes\":9,\"bits_per_entry\":15,\"max_changed_paths\":512" trace2-auto.txt ' test_max_changed_paths () {
The murmur3 implementation in bloom.c has a bug when converting series of 4 bytes into network-order integers when char is signed (which is controllable by a compiler option, and the default signedness of char is platform-specific). When a string contains characters with the high bit set, this bug causes results that, although internally consistent within Git, does not accord with other implementations of murmur3 and even with Git binaries that were compiled with different signedness of char. This bug affects both how Git writes changed path filters to disk and how Git interprets changed path filters on disk. Therefore, fix this bug. And because changed path filters on disk might no longer be compatible, teach Git to write "2" as the version when writing changed path filters (instead of "1" currently), and only accept "2" as the version when reading them (instead of "1" currently). Because this bug only manifests with characters that have the high bit set, it may be possible that some (or all) commits in a given repo would have the same changed path filter both before and after this fix is applied. However, in order to determine whether this is the case, the changed paths would first have to be computed, at which point it is not much more expensive to just compute a new changed path filter. So this patch does not include any mechanism to "salvage" changed path filters from repositories. There is a change in write_commit_graph(). graph_read_bloom_data() makes it possible for chunk_bloom_data to be non-NULL but bloom_filter_settings to be NULL, which causes a segfault later on. I produced such a segfault while developing this patch, but couldn't find a way to reproduce it neither after this complete patch (or before), but in any case it seemed like a good thing to include that might help future patch authors. The value in the test was obtained from another murmur3 implementation using the following Go source code: package main import "fmt" import "github.com/spaolacci/murmur3" func main() { fmt.Printf("%x\n", murmur3.Sum32([]byte("Hello world!"))) fmt.Printf("%x\n", murmur3.Sum32([]byte{0x99, 0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff})) } Signed-off-by: Jonathan Tan <jonathantanmy@google.com> --- bloom.c | 14 +++++++------- bloom.h | 9 ++++++--- commit-graph.c | 4 ++-- t/helper/test-bloom.c | 7 +++++++ t/t0095-bloom.sh | 8 ++++++++ t/t4216-log-bloom.sh | 8 ++++---- 6 files changed, 34 insertions(+), 16 deletions(-)