Message ID | 20200612184014.1226972-2-abhishekkumar8222@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Move generation, graph_pos to a slab | expand |
Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > static unsigned int alloc_commit_index(struct repository *r) > { > - return r->parsed_objects->commit_count++; > + static unsigned int parsed_commits_count = 0; > + r->parsed_objects->commit_count++; > + return parsed_commits_count++; > } Hmph, ugly. The need for this hack, which butchers the function that explicitly takes a repository as its parameter so that parsed commits can be counted per repository to not count commits per repository, makes the whole thing smell fishy. There shouldn't be a reason why a commit in the superproject and another commit in the submodule need to be in the same commit graph in the first place. Instead of breaking the function like this, the right "fix" may be to make the commit slab per repository, because the commit index are taken from the separate namespace per repository.
Junio C Hamano <gitster@pobox.com> writes: > Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > >> static unsigned int alloc_commit_index(struct repository *r) >> { >> - return r->parsed_objects->commit_count++; >> + static unsigned int parsed_commits_count = 0; >> + r->parsed_objects->commit_count++; >> + return parsed_commits_count++; >> } > > Hmph, ugly. > > The need for this hack, which butchers the function that explicitly > takes a repository as its parameter so that parsed commits can be > counted per repository to not count commits per repository, makes > the whole thing smell fishy. > > There shouldn't be a reason why a commit in the superproject and > another commit in the submodule need to be in the same commit graph > in the first place. > > Instead of breaking the function like this, the right "fix" may be > to make the commit slab per repository, because the commit index are > taken from the separate namespace per repository. Given a commit object (i.e. "struct commit *"), currently there is no way to tell from which repository it came from. So from that point of view, we cannot identify which commit slab to use, unless we add a back-pointer that says from which repository each commit object came from, but that makes the commit object even heavier. Back when 14ba97f8 (alloc: allow arbitrary repositories for alloc functions, 2018-05-15) made the count per repository (which you are reverting with this patch), there must have been a reason why it did so. We know that commit slab code wants you to count globally and that is why you wrote this patch, but don't other parts of the code expect and rely on the commits being counted per repository? In other words, with this change, who are you breaking to help the commit slab code? Cc'ing SZEDER who also have touched this and made the function static early last year. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Back when 14ba97f8 (alloc: allow arbitrary repositories for alloc > functions, 2018-05-15) made the count per repository (which you are > reverting with this patch), there must have been a reason why it did > so. We know that commit slab code wants you to count globally and > that is why you wrote this patch, but don't other parts of the code > expect and rely on the commits being counted per repository? In > other words, with this change, who are you breaking to help the > commit slab code? I did a bit more digging, as I had a bit of time after pushing today's integration result out, and it turns out that in the today's code, c->index is the only thing that uses this counter. It is set in init_commit_node() function, which is called from object.c when an in-core commit object is created. The callchain looks like this: object_as_type(struct repository *, struct object *, enum object_type, int quiet) -> init_commit_node(struct repository *, struct commit *) -> alloc_commit_index(struct repository *) What is interesting is that object_as_type() takes the parameter "struct repository *" ONLY BECAUSE it needs to pass something to init_commit_node(), which in turn takes it ONLY BECAUSE the alloc_commit_index() wants one to be passed. Since the ONLY reason why there needs a monotorically increasing counter of in-core commit objects is because we need to be able to index into the commit slab, and because we cannot make commit slabs per-repository due to the lack of backpointer from an in-core commit object to the repository it belongs to, reverting 14ba97f8 (alloc: allow arbitrary repositories for alloc functions, 2018-05-15) is a reasonable thing to do, but then since the only reason the above three functions in the callchain take "struct repository *" is because the bottom-most alloc_commit_index() wants it WHEN IT DOES NOT USE IT, it would be good to get rid of the "struct repository" parameter from the functions involved in the callchain altogether. Then we won't have to ask the "who are we breaking with this change" question. At that point it would be clear that everybody would be OK with a single counter to ensure uniqueness of in-core commit across all the in-core repository instances. Thanks.
Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > static unsigned int alloc_commit_index(struct repository *r) > { > - return r->parsed_objects->commit_count++; > + static unsigned int parsed_commits_count = 0; > + r->parsed_objects->commit_count++; > + return parsed_commits_count++; > } I'll queue this as-is, together with the rest of the series, but with the following SQUASH??? to document why we are counting globally. I do not think r->parsed_objects->*_count is used by anybody, and we probably can eventually get rid of these stats fields, and when that happens, we may want to lose the "struct repository *" pointer from the functions in the callchain, but not right now. Thanks. alloc.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/alloc.c b/alloc.c index 29f0e3aa80..ee92661b71 100644 --- a/alloc.c +++ b/alloc.c @@ -99,6 +99,11 @@ void *alloc_object_node(struct repository *r) return obj; } +/* + * The returned count is to be used as an index into commit slabs, + * that are *NOT* maintained per repository, and that is why a single + * global counter is used. + */ static unsigned int alloc_commit_index(struct repository *r) { static unsigned int parsed_commits_count = 0;
On 12.06.2020, Abhishek Kumar wrote: > Commit slab relies on uniqueness of commit->index to access data. As > submodules are repositories on their own, alloc_commit_index() (which > depends on repository->parsed_objects->commit_count) no longer > returns unique values. > > This would break tests once we move `generation` and `graph_pos` into a > commit slab, as commits of supermodule and submodule can have the same > index but must have different graph positions. First, commits of supermodule and of submodule are in different graphs, so I don't see why they have to be in the same serialized commit-graph file. Second, Git stores many different types of information on slab already. How comes that we have not had any problems till now? There is contains_cache, commit_seen, indegree_slab, author_date_slab, commit_base, commit_pos, bloom_filter_slab, buffer_slab, commit_rev_name, commit_names, commit_name_slab, saved_parents, blame_suspects, commit_todo_item. > > Let's introduce a counter variable, `parsed_commits_count` to keep track > of parsed commits so far. All right, thought it might be worth mentioning that it is a global variable, or rather a static variable in a function. > > > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> > --- > > CI Build for the failing tests: > https://travis-ci.com/github/abhishekkumar2718/git/jobs/345413840 The failed tests are, from what I see: - t4060-diff-submodule-option-diff-format.sh - t4041-diff-submodule-option.sh - t4059-diff-submodule-not-initialized.sh > > alloc.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/alloc.c b/alloc.c > index 1c64c4dd16..29f0e3aa80 100644 > --- a/alloc.c > +++ b/alloc.c > @@ -101,7 +101,9 @@ void *alloc_object_node(struct repository *r) > > static unsigned int alloc_commit_index(struct repository *r) > { > - return r->parsed_objects->commit_count++; > + static unsigned int parsed_commits_count = 0; > + r->parsed_objects->commit_count++; Do we use r->parsed_objects->commit_count anywhere? > + return parsed_commits_count++; Does it matter that it is not thread safe, because it is not atomic? Shouldn't it be + static _Atomic unsigned int parsed_commits_count = 0; or + static _Atomic unsigned int parsed_commits_count = ATOMIC_VAR_INIT(0); (If it is allowed). > } > > void init_commit_node(struct repository *r, struct commit *c) >
On Fri, Jun 12, 2020 at 03:52:26PM -0700, Junio C Hamano wrote: > Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > > > static unsigned int alloc_commit_index(struct repository *r) > > { > > - return r->parsed_objects->commit_count++; > > + static unsigned int parsed_commits_count = 0; > > + r->parsed_objects->commit_count++; > > + return parsed_commits_count++; > > } > > I'll queue this as-is, together with the rest of the series, but > with the following SQUASH??? to document why we are counting > globally. > > I do not think r->parsed_objects->*_count is used by anybody, and we > probably can eventually get rid of these stats fields, and when that > happens, we may want to lose the "struct repository *" pointer from > the functions in the callchain, but not right now. > > Thanks. > > alloc.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/alloc.c b/alloc.c > index 29f0e3aa80..ee92661b71 100644 > --- a/alloc.c > +++ b/alloc.c > @@ -99,6 +99,11 @@ void *alloc_object_node(struct repository *r) > return obj; > } > > +/* > + * The returned count is to be used as an index into commit slabs, > + * that are *NOT* maintained per repository, and that is why a single > + * global counter is used. > + */ > static unsigned int alloc_commit_index(struct repository *r) > { > static unsigned int parsed_commits_count = 0; > -- > 2.27.0-90-geebb51ba8c > Thanks for taking a closer look. I have created an issue on gitgitgadget [1] to get rid of commit count and avoid passing repository to the functions anymore. [1]: https://github.com/gitgitgadget/git/issues/657 Abhishek
diff --git a/alloc.c b/alloc.c index 1c64c4dd16..29f0e3aa80 100644 --- a/alloc.c +++ b/alloc.c @@ -101,7 +101,9 @@ void *alloc_object_node(struct repository *r) static unsigned int alloc_commit_index(struct repository *r) { - return r->parsed_objects->commit_count++; + static unsigned int parsed_commits_count = 0; + r->parsed_objects->commit_count++; + return parsed_commits_count++; } void init_commit_node(struct repository *r, struct commit *c)
Commit slab relies on uniqueness of commit->index to access data. As submodules are repositories on their own, alloc_commit_index() (which depends on repository->parsed_objects->commit_count) no longer returns unique values. This would break tests once we move `generation` and `graph_pos` into a commit slab, as commits of supermodule and submodule can have the same index but must have different graph positions. Let's introduce a counter variable, `parsed_commits_count` to keep track of parsed commits so far. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> --- CI Build for the failing tests: https://travis-ci.com/github/abhishekkumar2718/git/jobs/345413840 alloc.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)