Message ID | 9b56496b0809cc8a25af877ea97042e2cb7f2af6.1655246092.git.steadmon@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] commit-graph: refactor to avoid prepare_repo_settings | expand |
On Tue, Jun 14, 2022 at 03:37:21PM -0700, Josh Steadmon wrote: > Range-diff against v2: > 1: 2c2bfc7b43 ! 1: 9b56496b08 commit-graph: refactor to avoid prepare_repo_settings > @@ fuzz-commit-graph.c: int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size > initialize_the_repository(); > - g = parse_commit_graph(the_repository, (void *)data, size); > + /* > -+ * Manually initialize commit-graph settings, to avoid the need to run > -+ * in an actual repository. > ++ * Initialize the_repository with commit-graph settings that would > ++ * normally be read from the repository's gitdir. We want to avoid > ++ * touching the disk to keep the individual fuzz-test cases as fast as > ++ * possible. > + */ > + the_repository->settings.commit_graph_generation_version = 2; > + the_repository->settings.commit_graph_read_changed_paths = 1; This version looks good to me. Thanks for working on this, Josh! Thanks, Taylor
Josh Steadmon <steadmon@google.com> writes: > This series of changes broke fuzz-commit-graph, which attempts to parse > arbitrary fuzzing-engine-provided bytes as a commit graph file. > commit-graph.c:parse_commit_graph() calls prepare_repo_settings(), but > since we run the fuzz tests without a valid repository, we are hitting > the BUG() from 44c7e62 for every test case. > > Fix this by moving the majority of the implementaiton of > `parse_commit_graph()` into a new function, > `parse_commit_graph_settings()` that accepts a repo_settings pointer. > This allows fuzz-commit-graph to continue to test the commit-graph > parser implementation without relying on prepare_repo_settings(). It sounds like this is not a "fix" but a workaround to bend the production code so that a non-production test shim can be inserted more easily. I am OK with the idea, but have a huge problem with the new name. Is it just me who thinks parse_commit_graph_settings() is a function that parses some kind of settings that affects the way the commit graph gets used or written? Stepping back a bit, why can't fuzz-commit-graph prepare a repository object that looks sufficiently real? Something along the lines of... struct repository fake_repo; fake_repo.settings.initialized = 1; fake_repo.gitdir = "."; parse_commit_graph(&fake_repo, (void *)data, size); ... Also, I feel somewhat uneasy to see these changes: > - if (get_configured_generation_version(r) >= 2) { > + if (s->commit_graph_generation_version >= 2) { > - if (r->settings.commit_graph_read_changed_paths) { > + if (s->commit_graph_read_changed_paths) { > - ctx->write_generation_data = (get_configured_generation_version(r) == 2); > + ctx->write_generation_data = (r->settings.commit_graph_generation_version == 2); > ctx->num_generation_data_overflows = 0; that makes the production code bend over backwards to _avoid_ referencing 'r', only to cater to the test shim. That's an artificial limitation we are forcing on our developers who works on this code. It might be that what is in the repository settings is sufficient for today's code to work, but I do not think needs for fuzz tests should tie the hand of this production code by forbidding it to look at other things in the repository in the future. After all, tests are to serve the production code, not the other way around. On the other hand, I think a change that is slightly smaller than the posted patch, which justifies itself with a completely different rationale, would be totally acceptable. You can justify this change with NO mention of fuzzers. The parse_commit_graph() function takes a "struct repository *" pointer, but all it cares about is the .settings member of it. Update the function and all its existing callers so that it takes "struct repo_settings *" instead. Now, in the future, some developers _might_ find it necessary to access stuff other than the repository settings to validate the contents of the graph file, and we may need to change it to take a full repository structure again. The test should adjust to such needs of the production code, not the other way around. But until then... Thanks.
On 2022.06.23 14:59, Junio C Hamano wrote: > Josh Steadmon <steadmon@google.com> writes: > > > This series of changes broke fuzz-commit-graph, which attempts to parse > > arbitrary fuzzing-engine-provided bytes as a commit graph file. > > commit-graph.c:parse_commit_graph() calls prepare_repo_settings(), but > > since we run the fuzz tests without a valid repository, we are hitting > > the BUG() from 44c7e62 for every test case. > > > > Fix this by moving the majority of the implementaiton of > > `parse_commit_graph()` into a new function, > > `parse_commit_graph_settings()` that accepts a repo_settings pointer. > > This allows fuzz-commit-graph to continue to test the commit-graph > > parser implementation without relying on prepare_repo_settings(). > > It sounds like this is not a "fix" but a workaround to bend the > production code so that a non-production test shim can be inserted > more easily. > > I am OK with the idea, but have a huge problem with the new name. > > Is it just me who thinks parse_commit_graph_settings() is a function > that parses some kind of settings that affects the way the commit > graph gets used or written? > > Stepping back a bit, why can't fuzz-commit-graph prepare a > repository object that looks sufficiently real? Something along the > lines of... > > struct repository fake_repo; > > fake_repo.settings.initialized = 1; > fake_repo.gitdir = "."; > parse_commit_graph(&fake_repo, (void *)data, size); > ... > > Also, I feel somewhat uneasy to see these changes: > > > - if (get_configured_generation_version(r) >= 2) { > > + if (s->commit_graph_generation_version >= 2) { > > - if (r->settings.commit_graph_read_changed_paths) { > > + if (s->commit_graph_read_changed_paths) { > > - ctx->write_generation_data = (get_configured_generation_version(r) == 2); > > + ctx->write_generation_data = (r->settings.commit_graph_generation_version == 2); > > ctx->num_generation_data_overflows = 0; > > that makes the production code bend over backwards to _avoid_ > referencing 'r', only to cater to the test shim. That's an > artificial limitation we are forcing on our developers who works on > this code. It might be that what is in the repository settings is > sufficient for today's code to work, but I do not think needs for > fuzz tests should tie the hand of this production code by forbidding > it to look at other things in the repository in the future. After > all, tests are to serve the production code, not the other way > around. > > On the other hand, I think a change that is slightly smaller than > the posted patch, which justifies itself with a completely different > rationale, would be totally acceptable. You can justify this change > with NO mention of fuzzers. > > The parse_commit_graph() function takes a "struct repository *" > pointer, but all it cares about is the .settings member of it. > > Update the function and all its existing callers so that it > takes "struct repo_settings *" instead. > > Now, in the future, some developers _might_ find it necessary to > access stuff other than the repository settings to validate the > contents of the graph file, and we may need to change it to take a > full repository structure again. The test should adjust to such > needs of the production code, not the other way around. But until > then... > > Thanks. I trimmed down the changes and reworded the commit message for V4. I'm assuming you don't object to mentioning the fuzzer, just that it shouldn't be the main justifiation for the change. Sorry for the delayed response, and thanks for the feedback.
diff --git a/commit-graph.c b/commit-graph.c index 265c010122..c54a734619 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -96,13 +96,6 @@ define_commit_slab(commit_graph_data_slab, struct commit_graph_data); static struct commit_graph_data_slab commit_graph_data_slab = COMMIT_SLAB_INIT(1, commit_graph_data_slab); -static int get_configured_generation_version(struct repository *r) -{ - int version = 2; - repo_config_get_int(r, "commitgraph.generationversion", &version); - return version; -} - uint32_t commit_graph_position(const struct commit *c) { struct commit_graph_data *data = @@ -335,6 +328,13 @@ static int graph_read_bloom_data(const unsigned char *chunk_start, struct commit_graph *parse_commit_graph(struct repository *r, void *graph_map, size_t graph_size) +{ + prepare_repo_settings(r); + return parse_commit_graph_settings(&r->settings, graph_map, graph_size); +} + +struct commit_graph *parse_commit_graph_settings(struct repo_settings *s, + void *graph_map, size_t graph_size) { const unsigned char *data; struct commit_graph *graph; @@ -371,8 +371,6 @@ struct commit_graph *parse_commit_graph(struct repository *r, return NULL; } - prepare_repo_settings(r); - graph = alloc_commit_graph(); graph->hash_len = the_hash_algo->rawsz; @@ -402,14 +400,14 @@ struct commit_graph *parse_commit_graph(struct repository *r, pair_chunk(cf, GRAPH_CHUNKID_EXTRAEDGES, &graph->chunk_extra_edges); pair_chunk(cf, GRAPH_CHUNKID_BASE, &graph->chunk_base_graphs); - if (get_configured_generation_version(r) >= 2) { + if (s->commit_graph_generation_version >= 2) { pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA, &graph->chunk_generation_data); pair_chunk(cf, GRAPH_CHUNKID_GENERATION_DATA_OVERFLOW, &graph->chunk_generation_data_overflow); } - if (r->settings.commit_graph_read_changed_paths) { + if (s->commit_graph_read_changed_paths) { pair_chunk(cf, GRAPH_CHUNKID_BLOOMINDEXES, &graph->chunk_bloom_indexes); read_chunk(cf, GRAPH_CHUNKID_BLOOMDATA, @@ -2288,7 +2286,7 @@ int write_commit_graph(struct object_directory *odb, ctx->split = flags & COMMIT_GRAPH_WRITE_SPLIT ? 1 : 0; ctx->opts = opts; ctx->total_bloom_filter_data_size = 0; - ctx->write_generation_data = (get_configured_generation_version(r) == 2); + ctx->write_generation_data = (r->settings.commit_graph_generation_version == 2); ctx->num_generation_data_overflows = 0; bloom_settings.bits_per_entry = git_env_ulong("GIT_TEST_BLOOM_SETTINGS_BITS_PER_ENTRY", diff --git a/commit-graph.h b/commit-graph.h index 04a94e1830..0f0d28b129 100644 --- a/commit-graph.h +++ b/commit-graph.h @@ -95,6 +95,8 @@ struct commit_graph *read_commit_graph_one(struct repository *r, struct object_directory *odb); struct commit_graph *parse_commit_graph(struct repository *r, void *graph_map, size_t graph_size); +struct commit_graph *parse_commit_graph_settings(struct repo_settings *s, + void *graph_map, size_t graph_size); /* * Return 1 if and only if the repository has a commit-graph diff --git a/fuzz-commit-graph.c b/fuzz-commit-graph.c index e7cf6d5b0f..810b1a8001 100644 --- a/fuzz-commit-graph.c +++ b/fuzz-commit-graph.c @@ -11,7 +11,15 @@ int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) struct commit_graph *g; initialize_the_repository(); - g = parse_commit_graph(the_repository, (void *)data, size); + /* + * Initialize the_repository with commit-graph settings that would + * normally be read from the repository's gitdir. We want to avoid + * touching the disk to keep the individual fuzz-test cases as fast as + * possible. + */ + the_repository->settings.commit_graph_generation_version = 2; + the_repository->settings.commit_graph_read_changed_paths = 1; + g = parse_commit_graph_settings(&the_repository->settings, (void *)data, size); repo_clear(the_repository); free_commit_graph(g); diff --git a/repo-settings.c b/repo-settings.c index b4fbd16cdc..26241c1c2c 100644 --- a/repo-settings.c +++ b/repo-settings.c @@ -10,6 +10,13 @@ static void repo_cfg_bool(struct repository *r, const char *key, int *dest, *dest = def; } +static void repo_cfg_int(struct repository *r, const char *key, int *dest, + int def) +{ + if (repo_config_get_int(r, key, dest)) + *dest = def; +} + void prepare_repo_settings(struct repository *r) { int experimental; @@ -41,11 +48,14 @@ void prepare_repo_settings(struct repository *r) r->settings.core_untracked_cache = UNTRACKED_CACHE_WRITE; } - /* Boolean config or default, does not cascade (simple) */ + /* Commit graph config or default, does not cascade (simple) */ repo_cfg_bool(r, "core.commitgraph", &r->settings.core_commit_graph, 1); + repo_cfg_int(r, "commitgraph.generationversion", &r->settings.commit_graph_generation_version, 2); repo_cfg_bool(r, "commitgraph.readchangedpaths", &r->settings.commit_graph_read_changed_paths, 1); repo_cfg_bool(r, "gc.writecommitgraph", &r->settings.gc_write_commit_graph, 1); repo_cfg_bool(r, "fetch.writecommitgraph", &r->settings.fetch_write_commit_graph, 0); + + /* Boolean config or default, does not cascade (simple) */ repo_cfg_bool(r, "pack.usesparse", &r->settings.pack_use_sparse, 1); repo_cfg_bool(r, "core.multipackindex", &r->settings.core_multi_pack_index, 1); repo_cfg_bool(r, "index.sparse", &r->settings.sparse_index, 0); diff --git a/repository.h b/repository.h index ca837cb9e9..4f8275f97c 100644 --- a/repository.h +++ b/repository.h @@ -29,6 +29,7 @@ struct repo_settings { int initialized; int core_commit_graph; + int commit_graph_generation_version; int commit_graph_read_changed_paths; int gc_write_commit_graph; int fetch_write_commit_graph;