Message ID | 694ef1ec08d9dc96a74a2631b2710ad206397dbc.1602079786.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Implement Corrected Commit Date | expand |
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > From: Abhishek Kumar <abhishekkumar8222@gmail.com> > > With most of preparations done, let's implement corrected commit date. > > The corrected commit date for a commit is defined as: > > * A commit with no parents (a root commit) has corrected commit date > equal to its committer date. > * A commit with at least one parent has corrected commit date equal to > the maximum of its commit date and one more than the largest corrected > commit date among its parents. All right. We might want to say that it fulfills the same reachability criteria as topological level, but perhaps this level of detail is not necessary here. > As a special case, a root commit with timestamp of zero (01.01.1970 > 00:00:00Z) has corrected commit date of one, to be able to distinguish > from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit > date). I'm not sure if this special case is really necessary, but it makes for cleaner reasoning. > To minimize the space required to store corrected commit date, Git > stores corrected commit date offsets into the commit-graph file. The > corrected commit date offset for a commit is defined as the difference > between its corrected commit date and actual commit date. > > Storing corrected commit date requires sizeof(timestamp_t) bytes, which > in most cases is 64 bits (uintmax_t). However, corrected commit date > offsets can be safely stored using only 32-bits. This halves the size > of GDAT chunk, which is a reduction of around 6% in the size of > commit-graph file. > > However, using offsets be problematic if one of commits is malformed but > valid and has committerdate of 0 Unix time, as the offset would be the > same as corrected commit date and thus require 64-bits to be stored > properly. > > While Git does not write out offsets at this stage, Git stores the > corrected commit dates in member generation of struct commit_graph_data. > It will begin writing commit date offsets with the introduction of > generation data chunk. All right. > > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Somewhere in the commit message we should also describe that this commit changes how commit-graph is verified: from checking that the generation number agrees with _topological level definition_, that is that for a given commit it is 1 more than maximum of its parents (with the caveat that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to checking that slightly weaker condition fulfilled by both topological levels (generation number v1) and by corrected commit date (generation number v2) that for a given commit its generation number is 1 more than maximum of its parents or larger. But, as far as I understand it, current code does not handle correctly GENERATION_NUMBER_V1_MAX case (if we use generation number v1). On the other hand we could have simpy use functional check, that generation number used (which can be v1 or v2, or any similar other) fulfills the reachability condition for each edge, which can be simplified to checking that generation(parents) <= generation(commit). If the reachability condition is true for each edge, then it is true for each path, and for each commit. > --- > commit-graph.c | 43 +++++++++++++++++++++++-------------------- > 1 file changed, 23 insertions(+), 20 deletions(-) > > diff --git a/commit-graph.c b/commit-graph.c > index cedd311024..03948adfce 100644 > --- a/commit-graph.c > +++ b/commit-graph.c > @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb) > else if (generation_a > generation_b) > return 1; > > - /* use date as a heuristic when generations are equal */ > - if (a->date < b->date) > - return -1; > - else if (a->date > b->date) > - return 1; Why this change? It is not described in the commit message. Note that while this tie-breaking fallback doesn't make much sense for corrected committer date generation number v2, this tie-breaking helps if we have to use topological levels (generation number v2). > return 0; > } > > @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > ctx->commits.nr); > for (i = 0; i < ctx->commits.nr; i++) { > timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]); Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be enough; no need for 'timestamp_t' for 'level' variable. > + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation; > All right, we compute both generation numbers: topological levels and corrected commit date. I guess we use 'corrected_commit_date' instead of simply 'generation' to make it asier to remember which is which. > display_progress(ctx->progress, i + 1); > if (level != GENERATION_NUMBER_INFINITY && > - level != GENERATION_NUMBER_ZERO) > + level != GENERATION_NUMBER_ZERO && > + corrected_commit_date != GENERATION_NUMBER_INFINITY && > + corrected_commit_date != GENERATION_NUMBER_ZERO Straightforward addition. > + ) Why this closing parenthesis is now in separated line? > continue; > > commit_list_insert(ctx->commits.list[i], &list); > @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > struct commit_list *parent; > int all_parents_computed = 1; > uint32_t max_level = 0; > + timestamp_t max_corrected_commit_date = 0; All right, straightforward addition. > > for (parent = current->parents; parent; parent = parent->next) { > level = *topo_level_slab_at(ctx->topo_levels, parent->item); > - Why we have removed this empty line? > + corrected_commit_date = commit_graph_data_at(parent->item)->generation; All right. > if (level == GENERATION_NUMBER_INFINITY || > - level == GENERATION_NUMBER_ZERO) { > + level == GENERATION_NUMBER_ZERO || > + corrected_commit_date == GENERATION_NUMBER_INFINITY || > + corrected_commit_date == GENERATION_NUMBER_ZERO > + ) { All right, same as above. > all_parents_computed = 0; > commit_list_insert(parent->item, &list); > break; > - } else if (level > max_level) { > - max_level = level; > + } else { > + if (level > max_level) > + max_level = level; > + > + if (corrected_commit_date > max_corrected_commit_date) > + max_corrected_commit_date = corrected_commit_date; > } All right, reasonable and straightforward. > } > > @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > if (max_level > GENERATION_NUMBER_V1_MAX - 1) > max_level = GENERATION_NUMBER_V1_MAX - 1; > *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1; > + > + if (current->date && current->date > max_corrected_commit_date) > + max_corrected_commit_date = current->date - 1; > + commit_graph_data_at(current)->generation = max_corrected_commit_date + 1; All right. Here we use the same trick as in previous commit (and as above) to avoid any possible overflow, to minimize number of conditionals. The fact that max_corrected_commit_date might store incorrect value doesn't matter, as it is reset at beginning of this loop. > } > } > } > @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) > if (generation_zero == GENERATION_ZERO_EXISTS) > continue; > > - /* > - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then > - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid > - * extra logic in the following condition. > - */ > - if (max_generation == GENERATION_NUMBER_V1_MAX) > - max_generation--; > - Perhaps in the future we should check that both topological levels, and also corrected committer date (if it exists) for correctness according to their definition. Then the above removed part would be restored (but with s/max_generation/max_level/). > generation = commit_graph_generation(graph_commit); > - if (generation != max_generation + 1) > - graph_report(_("commit-graph generation for commit %s is %u != %u"), > + if (generation < max_generation + 1) > + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), All right, so we relaxed the check so that it will be fulfilled by generation number v2 (and also by generation number v1, as it implies the more strict check for v1). What would happen however if generation holds topological levels, and it is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1, so the `git commit-graph verify` would incorrectly say that there is a problem with generation number, while there isn't one (false positive detection of error). Sidenote: I think we don't have to worry about having to introduce GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from disck representation) corrected commiter date is the same as of commiter date itself, plus some, and I don't see us coming close to 64-bit limit of timestamp_t for commit dates. > oid_to_hex(&cur_oid), > generation, > max_generation + 1); Best,
On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote: > "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: > > > From: Abhishek Kumar <abhishekkumar8222@gmail.com> > > ... > > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> > > Somewhere in the commit message we should also describe that this commit > changes how commit-graph is verified: from checking that the generation > number agrees with _topological level definition_, that is that for a > given commit it is 1 more than maximum of its parents (with the caveat > that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to > checking that slightly weaker condition fulfilled by both topological > levels (generation number v1) and by corrected commit date (generation > number v2) that for a given commit its generation number is 1 more than > maximum of its parents or larger. Sure, that makes sense. Will add. > > But, as far as I understand it, current code does not handle correctly > GENERATION_NUMBER_V1_MAX case (if we use generation number v1). > > On the other hand we could have simpy use functional check, that > generation number used (which can be v1 or v2, or any similar other) > fulfills the reachability condition for each edge, which can be > simplified to checking that generation(parents) <= generation(commit). > If the reachability condition is true for each edge, then it is true for > each path, and for each commit. > > > --- > > commit-graph.c | 43 +++++++++++++++++++++++-------------------- > > 1 file changed, 23 insertions(+), 20 deletions(-) > > > > diff --git a/commit-graph.c b/commit-graph.c > > index cedd311024..03948adfce 100644 > > --- a/commit-graph.c > > +++ b/commit-graph.c > > @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb) > > else if (generation_a > generation_b) > > return 1; > > > > - /* use date as a heuristic when generations are equal */ > > - if (a->date < b->date) > > - return -1; > > - else if (a->date > b->date) > > - return 1; > > Why this change? It is not described in the commit message. > > Note that while this tie-breaking fallback doesn't make much sense for > corrected committer date generation number v2, this tie-breaking helps > if we have to use topological levels (generation number v2). > Right, I should have mentioned this change (and it's not something that makes a difference either way). We call commit_gen_cmp() only when we are sorting commits by generation to speed up computation of Bloom filters i.e. while writing a commit graph (either split commit-graph or a simple commit-graph). Since we are always computing and storing corrected commit date when we are writing (whether we write a GDAT chunk or not), using date as heuristic is longer required. > > return 0; > > } > > > > @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > > ctx->commits.nr); > > for (i = 0; i < ctx->commits.nr; i++) { > > timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]); > > Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be > enough; no need for 'timestamp_t' for 'level' variable. > > > + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation; > > We need the 'timestamp_t' as we are comparing level with the now 64-bits GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to timestamp_t. I have a hunch that since we are explicitly using a fixed width data type, compiler is unwilling to type coerce into broader data types. Advice on this appreciated. > > All right, we compute both generation numbers: topological levels and > corrected commit date. > > I guess we use 'corrected_commit_date' instead of simply 'generation' to > make it asier to remember which is which. > > > display_progress(ctx->progress, i + 1); > > if (level != GENERATION_NUMBER_INFINITY && > > - level != GENERATION_NUMBER_ZERO) > > + level != GENERATION_NUMBER_ZERO && > > + corrected_commit_date != GENERATION_NUMBER_INFINITY && > > + corrected_commit_date != GENERATION_NUMBER_ZERO > > Straightforward addition. > > > + ) > > Why this closing parenthesis is now in separated line? > > > continue; > > > > commit_list_insert(ctx->commits.list[i], &list); > > @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > > struct commit_list *parent; > > int all_parents_computed = 1; > > uint32_t max_level = 0; > > + timestamp_t max_corrected_commit_date = 0; > > All right, straightforward addition. > > > > > for (parent = current->parents; parent; parent = parent->next) { > > level = *topo_level_slab_at(ctx->topo_levels, parent->item); > > - > > Why we have removed this empty line? > > > + corrected_commit_date = commit_graph_data_at(parent->item)->generation; > > All right. > > > if (level == GENERATION_NUMBER_INFINITY || > > - level == GENERATION_NUMBER_ZERO) { > > + level == GENERATION_NUMBER_ZERO || > > + corrected_commit_date == GENERATION_NUMBER_INFINITY || > > + corrected_commit_date == GENERATION_NUMBER_ZERO > > + ) { > > All right, same as above. > > > all_parents_computed = 0; > > commit_list_insert(parent->item, &list); > > break; > > - } else if (level > max_level) { > > - max_level = level; > > + } else { > > + if (level > max_level) > > + max_level = level; > > + > > + if (corrected_commit_date > max_corrected_commit_date) > > + max_corrected_commit_date = corrected_commit_date; > > } > > All right, reasonable and straightforward. > > > } > > > > @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) > > if (max_level > GENERATION_NUMBER_V1_MAX - 1) > > max_level = GENERATION_NUMBER_V1_MAX - 1; > > *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1; > > + > > + if (current->date && current->date > max_corrected_commit_date) > > + max_corrected_commit_date = current->date - 1; > > + commit_graph_data_at(current)->generation = max_corrected_commit_date + 1; > > All right. > > Here we use the same trick as in previous commit (and as above) to avoid > any possible overflow, to minimize number of conditionals. The fact > that max_corrected_commit_date might store incorrect value doesn't > matter, as it is reset at beginning of this loop. > > > } > > } > > } > > @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) > > if (generation_zero == GENERATION_ZERO_EXISTS) > > continue; > > > > - /* > > - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then > > - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid > > - * extra logic in the following condition. > > - */ > > - if (max_generation == GENERATION_NUMBER_V1_MAX) > > - max_generation--; > > - > > Perhaps in the future we should check that both topological levels, and > also corrected committer date (if it exists) for correctness according > to their definition. Then the above removed part would be restored (but > with s/max_generation/max_level/). > > > generation = commit_graph_generation(graph_commit); > > - if (generation != max_generation + 1) > > - graph_report(_("commit-graph generation for commit %s is %u != %u"), > > + if (generation < max_generation + 1) > > + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), > > All right, so we relaxed the check so that it will be fulfilled by > generation number v2 (and also by generation number v1, as it implies > the more strict check for v1). > > What would happen however if generation holds topological levels, and it > is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is > GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition > would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1, > so the `git commit-graph verify` would incorrectly say that there is > a problem with generation number, while there isn't one (false positive > detection of error). Alright, so the above block still makes sense if we are working with topological levels but not with corrected commit dates. Instead of removing it, I will modify the condition to check that one of our parents has GENERATION_NUMBER_V1_MAX and the graph uses topological levels. Suprised that no test breaks by this change. I have also moved changes in the verify function to the next patch, as we cannot write or read corrected commit dates yet - so little sense in modifying verify. > > Sidenote: I think we don't have to worry about having to introduce > GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from > disck representation) corrected commiter date is the same as of commiter > date itself, plus some, and I don't see us coming close to 64-bit limit > of timestamp_t for commit dates. > > > oid_to_hex(&cur_oid), > > generation, > > max_generation + 1); > > Best, > -- > Jakub Narębski Thanks - Abhishek
Hello Abhishek, Abhishek Kumar <abhishekkumar8222@gmail.com> writes: > On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote: >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: >> >>> From: Abhishek Kumar <abhishekkumar8222@gmail.com> >>> ... >>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> >> >> Somewhere in the commit message we should also describe that this commit >> changes how commit-graph is verified: from checking that the generation >> number agrees with _topological level definition_, that is that for a >> given commit it is 1 more than maximum of its parents (with the caveat >> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to >> checking that slightly weaker condition fulfilled by both topological >> levels (generation number v1) and by corrected commit date (generation >> number v2) that for a given commit its generation number is 1 more than >> maximum of its parents or larger. > > Sure, that makes sense. Will add. Actually this description should match whatever we decide about mechanism for verifying correctness of generation numbers (see below). Because we have to choose one. >> >> But, as far as I understand it, current code does not handle correctly >> GENERATION_NUMBER_V1_MAX case (if we use generation number v1). >> >> On the other hand we could have simpy use functional check, that >> generation number used (which can be v1 or v2, or any similar other) >> fulfills the reachability condition for each edge, which can be >> simplified to checking that generation(parents) <= generation(commit). >> If the reachability condition is true for each edge, then it is true for >> each path, and for each commit. See below. >>> --- >>> commit-graph.c | 43 +++++++++++++++++++++++-------------------- >>> 1 file changed, 23 insertions(+), 20 deletions(-) >>> >>> diff --git a/commit-graph.c b/commit-graph.c >>> index cedd311024..03948adfce 100644 >>> --- a/commit-graph.c >>> +++ b/commit-graph.c >>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb) >>> else if (generation_a > generation_b) >>> return 1; >>> >>> - /* use date as a heuristic when generations are equal */ >>> - if (a->date < b->date) >>> - return -1; >>> - else if (a->date > b->date) >>> - return 1; >> >> Why this change? It is not described in the commit message. >> >> Note that while this tie-breaking fallback doesn't make much sense for >> corrected committer date generation number v2, this tie-breaking helps >> if we have to use topological levels (generation number v2). >> > > Right, I should have mentioned this change (and it's not something that > makes a difference either way). > > We call commit_gen_cmp() only when we are sorting commits by generation > to speed up computation of Bloom filters i.e. while writing a commit > graph (either split commit-graph or a simple commit-graph). > > Since we are always computing and storing corrected commit date when we > are writing (whether we write a GDAT chunk or not), using date as > heuristic is longer required. Thanks. This description really should be added to the commit message, because (yet again?) I was confused by this change. Sidenote: it is not obvious at least to me that this function is used only for sorting commits to speed up computation of Bloom filters while writing the commit-graph (`git commit-graph write --changed-paths [other options]`). >>> return 0; >>> } >>> >>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) >>> ctx->commits.nr); >>> for (i = 0; i < ctx->commits.nr; i++) { >>> timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]); >> >> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be >> enough; no need for 'timestamp_t' for 'level' variable. >> >>> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation; >>> > > We need the 'timestamp_t' as we are comparing level with the now 64-bits > GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to > timestamp_t. I have a hunch that since we are explicitly using a fixed > width data type, compiler is unwilling to type coerce into broader data > types. > > Advice on this appreciated. All right, so the wider type is used because of comparison with wide-uint GENERATION_NUMBER_INFINITY. I stand corrected. [...] >>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) >>> if (generation_zero == GENERATION_ZERO_EXISTS) >>> continue; >>> >>> - /* >>> - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then >>> - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid >>> - * extra logic in the following condition. >>> - */ >>> - if (max_generation == GENERATION_NUMBER_V1_MAX) >>> - max_generation--; >>> - >> >> Perhaps in the future we should check that both topological levels, and >> also corrected committer date (if it exists) for correctness according >> to their definition. Then the above removed part would be restored (but >> with s/max_generation/max_level/). >> >>> generation = commit_graph_generation(graph_commit); >>> - if (generation != max_generation + 1) >>> - graph_report(_("commit-graph generation for commit %s is %u != %u"), >>> + if (generation < max_generation + 1) >>> + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), >> >> All right, so we relaxed the check so that it will be fulfilled by >> generation number v2 (and also by generation number v1, as it implies >> the more strict check for v1). >> >> What would happen however if generation holds topological levels, and it >> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is >> GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition >> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1, >> so the `git commit-graph verify` would incorrectly say that there is >> a problem with generation number, while there isn't one (false positive >> detection of error). > > Alright, so the above block still makes sense if we are working with > topological levels but not with corrected commit dates. Instead of > removing it, I will modify the condition to check that one of our parents > has GENERATION_NUMBER_V1_MAX and the graph uses topological levels. That is one of the 3 possible solutions I can think of. I. First solution is to switch from checking that generation number matches its definition to checking that the [weaker] reachability condition for the generation number is true, that is: if (generation < max_generation) graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), The [weaker] reachability condition for generation numbers states that A reachable from B => gen(A) <= gen(B) This condition is true even if one or more generation numbers is GENERATION_NUMBER_ZERO (uninitialized or written by old git version), GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only for generation number v1), or GENERATION_NUMBER_INFINITY (for commits outside of the serialized commit-graph, doesn't matter and cannot happen during verification of the commit-graph data by definition). This means that if P* is the parent of C with the maximal generation number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be true), then there is a problem with generation number. This is why I thought you were going for, and what I have proposed. Advantages: - we are testing what actually matters for speeding up reachability queries, namely that the reachability property holds true - the test works for generation number v1, generation number v2, and any possible future use-compatibile generation number (not that I think we would need any) - least complicated solution Disadvantages: - weaker test that we have had for generation number v1 (topological levels), and weaker that possible test for generation number v2 that we could have (see below) II. Verify corrected committed date (generation number v2) if available, and verify topological levels (generation number v1) otherwise, checking that it matches the definition of it -- using version-specific checks. This would probably mean adding a conditional around the code verifying that given generation number is correct, possibly: if (g->read_generation_data) { /* verify corrected commit date */ } else { /* current code for verifying topological levels */ } II.a. For topological levels (generation number v1) we would continue checking that it matches the definition, that is that the following condition holds: gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1 This includes code for handling the case where `max_generation`, holding max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX. II.b. For corrected commiter dates (generation number v2) we can use the code proposed by this revision of this commit, namely we check if the following condition holds: gen(P) + 1 <= gen(C) for each P \in parents(C) or, in other words: max_{P: P ∈ parents(C)} { gen(P) } + 1 <= gen(C) Which could be checked using the following code (i.e. current state after this revision of this patch): if (generation < max_generation + 1) graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), This is what I think you are proposing now. Additionally, theoretically we could also check that the following condition holds for corrected commiter date: committer_date(C) <= gen_v2(C) but this is automatically fufilled because we use non-negative offsets to store corrected committed date info. Alternatively we can check for compliance with the definition of the corrected committer date: if (max_generation + 1 <= graph_commit->date) { /* commit date does not need correction */ if (generation != graph_commit->date) graph_report(_("commit-graph corrected commit date for commit %s " "is %"PRItime" != %"PRItime" commit date"), ...); } else { if (generation != max_generation + 1) graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime), ...); } Though I think it might be overkill. Advantages: - more strict tests, checking generation numbers (v2 if present, v1 otherwise) against their definition - if there is no GDAT chunk, verify works just like it did before Disadvantages: - more complicated code - possibly measurable performance degradation due to extra conditional III. Like II., but if there is generation numbers chunk (GDAT chunk), we verify *both* topological levels (v1) and corrected commit date (v2) against their definition. If GDAT chunk is not present, it reduces to current code (before this patch series). Advantages: - if there is no GDAT chunk, verify works just like it did before - most strict tests, verifying all the data: both generation number v1 and generation number v2 -- if possible Disadvantages: - most complex code; we need to somehow extract topological levels if the GDAT chunk is present (they are not on graph data slab in this case); I have not even started to think how it could be done - slower verification > Suprised that no test breaks by this change. I don't whink we have any test that created commit graph with topological levels greater than GENERATION_NUMBER_V1_MAX; this would be expensive and have to be of course protected by GIT_TEST_LONG aka EXPENSIVE prerequisite. # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' ' rm -rf long_chain && git init long_chain && test_commit_bulk -C long_chain 1073741824 && ( cd long_chain && GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write && git commit-graph verify ) ' This however lies slightly outside the scope of this patch series, though if you could add this test (in a separate patch), after testing it, it would be very nice. > > I have also moved changes in the verify function to the next patch, as > we cannot write or read corrected commit dates yet - so little sense in > modifying verify. I think putting changes to the verify function in a separate patch, be it before or after this one (depending on the choice of the algorithm for verification, see above) would be a good idea. >> >> Sidenote: I think we don't have to worry about having to introduce >> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from >> disck representation) corrected commiter date is the same as of commiter >> date itself, plus some, and I don't see us coming close to 64-bit limit >> of timestamp_t for commit dates. >> >>> oid_to_hex(&cur_oid), >>> generation, >>> max_generation + 1); Best,
Hi Abhishek, On 04/11/2020 16:45, Jakub Narębski wrote: > Hello Abhishek, > > Abhishek Kumar <abhishekkumar8222@gmail.com> writes: >> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote: >>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes: >>> >>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com> >>>> ... >>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> >>> Somewhere in the commit message we should also describe that this commit >>> changes how commit-graph is verified: from checking that the generation >>> number agrees with _topological level definition_, that is that for a >>> given commit it is 1 more than maximum of its parents (with the caveat >>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to >>> checking that slightly weaker condition fulfilled by both topological >>> levels (generation number v1) and by corrected commit date (generation >>> number v2) that for a given commit its generation number is 1 more than >>> maximum of its parents or larger. >> Sure, that makes sense. Will add. > Actually this description should match whatever we decide about > mechanism for verifying correctness of generation numbers (see below). > Because we have to choose one. This may be not part of the the main project, but could you consider, if time permits, also adding some entries into the Git Glossary (`git help glossary`) for the various terms we are using here and elsewhere, e.g. 'topological levels', 'generation number', 'corrected commit date' (and its fancy technical name for the use of date heuristics e.g. the 'chronological ordering';). The glossary can provide a reference, once the issues are resolved. The History Simplification and Commit Ordering section of git-log maybe a useful guide to some of the terms that would link to the glossary. -- Philip >>> But, as far as I understand it, current code does not handle correctly >>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1). >>> >>> On the other hand we could have simpy use functional check, that >>> generation number used (which can be v1 or v2, or any similar other) >>> fulfills the reachability condition for each edge, which can be >>> simplified to checking that generation(parents) <= generation(commit). >>> If the reachability condition is true for each edge, then it is true for >>> each path, and for each commit. > See below. > >>>> --- >>>> commit-graph.c | 43 +++++++++++++++++++++++-------------------- >>>> 1 file changed, 23 insertions(+), 20 deletions(-) >>>> >>>> diff --git a/commit-graph.c b/commit-graph.c >>>> index cedd311024..03948adfce 100644 >>>> --- a/commit-graph.c >>>> +++ b/commit-graph.c >>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb) >>>> else if (generation_a > generation_b) >>>> return 1; >>>> >>>> - /* use date as a heuristic when generations are equal */ >>>> - if (a->date < b->date) >>>> - return -1; >>>> - else if (a->date > b->date) >>>> - return 1; >>> Why this change? It is not described in the commit message. >>> >>> Note that while this tie-breaking fallback doesn't make much sense for >>> corrected committer date generation number v2, this tie-breaking helps >>> if we have to use topological levels (generation number v2). >>> >> Right, I should have mentioned this change (and it's not something that >> makes a difference either way). >> >> We call commit_gen_cmp() only when we are sorting commits by generation >> to speed up computation of Bloom filters i.e. while writing a commit >> graph (either split commit-graph or a simple commit-graph). >> >> Since we are always computing and storing corrected commit date when we >> are writing (whether we write a GDAT chunk or not), using date as >> heuristic is longer required. > Thanks. This description really should be added to the commit message, > because (yet again?) I was confused by this change. > > Sidenote: it is not obvious at least to me that this function is used > only for sorting commits to speed up computation of Bloom filters while > writing the commit-graph (`git commit-graph write --changed-paths [other > options]`). > >>>> return 0; >>>> } >>>> >>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) >>>> ctx->commits.nr); >>>> for (i = 0; i < ctx->commits.nr; i++) { >>>> timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]); >>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be >>> enough; no need for 'timestamp_t' for 'level' variable. >>> >>>> + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation; >>>> >> We need the 'timestamp_t' as we are comparing level with the now 64-bits >> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to >> timestamp_t. I have a hunch that since we are explicitly using a fixed >> width data type, compiler is unwilling to type coerce into broader data >> types. >> >> Advice on this appreciated. > All right, so the wider type is used because of comparison with > wide-uint GENERATION_NUMBER_INFINITY. I stand corrected. > > [...] >>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) >>>> if (generation_zero == GENERATION_ZERO_EXISTS) >>>> continue; >>>> >>>> - /* >>>> - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then >>>> - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid >>>> - * extra logic in the following condition. >>>> - */ >>>> - if (max_generation == GENERATION_NUMBER_V1_MAX) >>>> - max_generation--; >>>> - >>> Perhaps in the future we should check that both topological levels, and >>> also corrected committer date (if it exists) for correctness according >>> to their definition. Then the above removed part would be restored (but >>> with s/max_generation/max_level/). >>> >>>> generation = commit_graph_generation(graph_commit); >>>> - if (generation != max_generation + 1) >>>> - graph_report(_("commit-graph generation for commit %s is %u != %u"), >>>> + if (generation < max_generation + 1) >>>> + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), >>> All right, so we relaxed the check so that it will be fulfilled by >>> generation number v2 (and also by generation number v1, as it implies >>> the more strict check for v1). >>> >>> What would happen however if generation holds topological levels, and it >>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is >>> GENERATION_NUMBER_V1_MAX for a commit? As you can check, the condition >>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1, >>> so the `git commit-graph verify` would incorrectly say that there is >>> a problem with generation number, while there isn't one (false positive >>> detection of error). >> Alright, so the above block still makes sense if we are working with >> topological levels but not with corrected commit dates. Instead of >> removing it, I will modify the condition to check that one of our parents >> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels. > That is one of the 3 possible solutions I can think of. > > > I. First solution is to switch from checking that generation number > matches its definition to checking that the [weaker] reachability > condition for the generation number is true, that is: > > if (generation < max_generation) > graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), > > The [weaker] reachability condition for generation numbers states that > > A reachable from B => gen(A) <= gen(B) > > This condition is true even if one or more generation numbers is > GENERATION_NUMBER_ZERO (uninitialized or written by old git version), > GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only > for generation number v1), or GENERATION_NUMBER_INFINITY (for commits > outside of the serialized commit-graph, doesn't matter and cannot happen > during verification of the commit-graph data by definition). > > This means that if P* is the parent of C with the maximal generation > number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be > true), then there is a problem with generation number. > > This is why I thought you were going for, and what I have proposed. > > Advantages: > - we are testing what actually matters for speeding up reachability > queries, namely that the reachability property holds true > - the test works for generation number v1, generation number v2, > and any possible future use-compatibile generation number > (not that I think we would need any) > - least complicated solution > > Disadvantages: > - weaker test that we have had for generation number v1 (topological > levels), and weaker that possible test for generation number v2 > that we could have (see below) > > > II. Verify corrected committed date (generation number v2) if available, > and verify topological levels (generation number v1) otherwise, checking > that it matches the definition of it -- using version-specific checks. > > This would probably mean adding a conditional around the code verifying > that given generation number is correct, possibly: > > if (g->read_generation_data) { > /* verify corrected commit date */ > } else { > /* current code for verifying topological levels */ > } > > II.a. For topological levels (generation number v1) we would continue > checking that it matches the definition, that is that the following > condition holds: > > gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1 > > This includes code for handling the case where `max_generation`, holding > max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX. > > II.b. For corrected commiter dates (generation number v2) we can use the > code proposed by this revision of this commit, namely we check if the > following condition holds: > > gen(P) + 1 <= gen(C) for each P \in parents(C) > > or, in other words: > > max_{P: P ∈ parents(C)} { gen(P) } + 1 <= gen(C) > > Which could be checked using the following code (i.e. current state > after this revision of this patch): > > if (generation < max_generation + 1) > graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), > > This is what I think you are proposing now. > > Additionally, theoretically we could also check that the following > condition holds for corrected commiter date: > > committer_date(C) <= gen_v2(C) > > but this is automatically fufilled because we use non-negative offsets > to store corrected committed date info. > > Alternatively we can check for compliance with the definition of the > corrected committer date: > > if (max_generation + 1 <= graph_commit->date) { > /* commit date does not need correction */ > if (generation != graph_commit->date) > graph_report(_("commit-graph corrected commit date for commit %s " > "is %"PRItime" != %"PRItime" commit date"), > ...); > } else { > if (generation != max_generation + 1) > graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime), > ...); > } > > Though I think it might be overkill. > > Advantages: > - more strict tests, checking generation numbers (v2 if present, v1 > otherwise) against their definition > - if there is no GDAT chunk, verify works just like it did before > > Disadvantages: > - more complicated code > - possibly measurable performance degradation due to extra conditional > > > III. Like II., but if there is generation numbers chunk (GDAT chunk), we > verify *both* topological levels (v1) and corrected commit date (v2) > against their definition. If GDAT chunk is not present, it reduces to > current code (before this patch series). > > Advantages: > - if there is no GDAT chunk, verify works just like it did before > - most strict tests, verifying all the data: both generation number v1 > and generation number v2 -- if possible > > Disadvantages: > - most complex code; we need to somehow extract topological levels > if the GDAT chunk is present (they are not on graph data slab in this > case); I have not even started to think how it could be done > - slower verification > >> Suprised that no test breaks by this change. > I don't whink we have any test that created commit graph with > topological levels greater than GENERATION_NUMBER_V1_MAX; this would be > expensive and have to be of course protected by GIT_TEST_LONG aka > EXPENSIVE prerequisite. > > # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels > test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' ' > rm -rf long_chain && > git init long_chain && > test_commit_bulk -C long_chain 1073741824 && > ( > cd long_chain && > GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write && > git commit-graph verify > ) > ' > > This however lies slightly outside the scope of this patch series, > though if you could add this test (in a separate patch), after testing > it, it would be very nice. > >> I have also moved changes in the verify function to the next patch, as >> we cannot write or read corrected commit dates yet - so little sense in >> modifying verify. > I think putting changes to the verify function in a separate patch, be > it before or after this one (depending on the choice of the algorithm > for verification, see above) would be a good idea. > >>> Sidenote: I think we don't have to worry about having to introduce >>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from >>> disck representation) corrected commiter date is the same as of commiter >>> date itself, plus some, and I don't see us coming close to 64-bit limit >>> of timestamp_t for commit dates. >>> >>>> oid_to_hex(&cur_oid), >>>> generation, >>>> max_generation + 1); > Best,
Philip Oakley <philipoakley@iee.email> writes: > This may be not part of the the main project, but could you consider, if > time permits, also adding some entries into the Git Glossary (`git help > glossary`) for the various terms we are using here and elsewhere, e.g. > 'topological levels', 'generation number', 'corrected commit date' (and > its fancy technical name for the use of date heuristics e.g. the > 'chronological ordering';). > > The glossary can provide a reference, once the issues are resolved. The > History Simplification and Commit Ordering section of git-log maybe a > useful guide to some of the terms that would link to the glossary. Ah, I first thought that Documentation/rev-list-options.txt (which is the relevant part of "git log" documentation you mention here) already have references to deep technical terms explained in the glossary and you are suggesting Abhishek to mimic the arrangement by adding new and agreed-upon terms to the glossary and referring to them from the commit-graph documentation updated by this series. But sadly that is not the case. What you are saying is that you noticed that rev-list-options.txt needs a similar "the terms we use to explain these two sections should be defined and explained in the glossary (if they are not) and new references to glossary should be added there" update. In any case, that is a very good suggestion. I agree that updating "git log" doc may be outside the scope of Abhishek's theme, but it would be very good to have such an update by anybody ;-) Thanks
Junio C Hamano <gitster@pobox.com> writes: > Philip Oakley <philipoakley@iee.email> writes: > >> This may be not part of the the main project, but could you consider, if >> time permits, also adding some entries into the Git Glossary (`git help >> glossary`) for the various terms we are using here and elsewhere, e.g. >> 'topological levels', 'generation number', 'corrected commit date' (and >> its fancy technical name for the use of date heuristics e.g. the >> 'chronological ordering';). >> >> The glossary can provide a reference, once the issues are resolved. The >> History Simplification and Commit Ordering section of git-log maybe a >> useful guide to some of the terms that would link to the glossary. > > Ah, I first thought that Documentation/rev-list-options.txt (which > is the relevant part of "git log" documentation you mention here) > already have references to deep technical terms explained in the > glossary and you are suggesting Abhishek to mimic the arrangement by > adding new and agreed-upon terms to the glossary and referring to > them from the commit-graph documentation updated by this series. > > But sadly that is not the case. What you are saying is that you > noticed that rev-list-options.txt needs a similar "the terms we use > to explain these two sections should be defined and explained in the > glossary (if they are not) and new references to glossary should be > added there" update. > > In any case, that is a very good suggestion. I agree that updating > "git log" doc may be outside the scope of Abhishek's theme, but it > would be very good to have such an update by anybody ;-) The only possible problem I see with this suggestion is that some of those terms (like 'topological levels' and 'corrected commit date') are technical terms that should be not of concern for Git user, only for developers working on Git. (However one could encounter the term "generation number" in `git commit-graph verify` output.) I don't think adding technical terms that the user won't encounter in the documentation or among messages that Git outputs would be not a good idea. It could confuse users, rather than help them. Conversely, perhaps we should add Documentation/technical/glossary.txt to help developers. P.S. By the way, when looking at Documentation/glossary-content.txt, I have noticed few obsolescent entries, like "Git archive", few that have description that soon could be or is obsolete and would need updating, like "master" (when default branch switch to "main"), or "object identifier" and "SHA-1" (when Git switches away from SHA-1 as hash function). Best, -- Jakub Narębski
Jakub Narębski <jnareb@gmail.com> writes: > I don't think adding technical terms that the user won't encounter in > the documentation or among messages that Git outputs would be not a good > idea. It could confuse users, rather than help them. > > Conversely, perhaps we should add Documentation/technical/glossary.txt > to help developers. Thanks for a thoughtful suggestion to help the target audience. I agree 100% with the above two paragraphs.
Hi Jakub, On 06/11/2020 18:26, Jakub Narębski wrote: > Junio C Hamano <gitster@pobox.com> writes: >> Philip Oakley <philipoakley@iee.email> writes: >> >>> This may be not part of the the main project, but could you consider, if >>> time permits, also adding some entries into the Git Glossary (`git help >>> glossary`) for the various terms we are using here and elsewhere, e.g. >>> 'topological levels', 'generation number', 'corrected commit date' (and >>> its fancy technical name for the use of date heuristics e.g. the >>> 'chronological ordering';). >>> >>> The glossary can provide a reference, once the issues are resolved. The >>> History Simplification and Commit Ordering section of git-log maybe a >>> useful guide to some of the terms that would link to the glossary. >> Ah, I first thought that Documentation/rev-list-options.txt (which >> is the relevant part of "git log" documentation you mention here) >> already have references to deep technical terms explained in the >> glossary and you are suggesting Abhishek to mimic the arrangement by >> adding new and agreed-upon terms to the glossary and referring to >> them from the commit-graph documentation updated by this series. >> >> But sadly that is not the case. What you are saying is that you >> noticed that rev-list-options.txt needs a similar "the terms we use >> to explain these two sections should be defined and explained in the >> glossary (if they are not) and new references to glossary should be >> added there" update. >> >> In any case, that is a very good suggestion. I agree that updating >> "git log" doc may be outside the scope of Abhishek's theme, but it >> would be very good to have such an update by anybody ;-) > The only possible problem I see with this suggestion is that some of > those terms (like 'topological levels' and 'corrected commit date') are > technical terms that should be not of concern for Git user, only for > developers working on Git. (However one could encounter the term > "generation number" in `git commit-graph verify` output.) However we do mention "topolog*" in a number of the manual pages, and rather less, as yet, in the technical pages. "Lexicographic" and "chronological" are in the same group of fancy technical words ;-) > > I don't think adding technical terms that the user won't encounter in > the documentation or among messages that Git outputs would be not a good > idea. It could confuse users, rather than help them. > > Conversely, perhaps we should add Documentation/technical/glossary.txt > to help developers. I would agree that the Glossary probably ought to be split into the primary, secondary and background terms so that the core concepts are separated from the academic/developer style terms. Git does rip up most of what folks think about version "control", usually based on the imperfect replication of physical artefacts. > > P.S. By the way, when looking at Documentation/glossary-content.txt, I > have noticed few obsolescent entries, like "Git archive", few that have > description that soon could be or is obsolete and would need updating, > like "master" (when default branch switch to "main"), or "object > identifier" and "SHA-1" (when Git switches away from SHA-1 as hash > function). The obsolescent items can be updated. I'm expecting that the 'main' and 'SHA-' changes will eventually be picked up as part of the respective patch series, hopefully as part of the global replacements. -- Philip
Hello Philip, Philip Oakley <philipoakley@iee.email> writes: > On 06/11/2020 18:26, Jakub Narębski wrote: >> Junio C Hamano <gitster@pobox.com> writes: >>> Philip Oakley <philipoakley@iee.email> writes: >>> >>>> This may be not part of the the main project, but could you consider, if >>>> time permits, also adding some entries into the Git Glossary (`git help >>>> glossary`) for the various terms we are using here and elsewhere, e.g. >>>> 'topological levels', 'generation number', 'corrected commit date' (and >>>> its fancy technical name for the use of date heuristics e.g. the >>>> 'chronological ordering';). >>>> >>>> The glossary can provide a reference, once the issues are resolved. The >>>> History Simplification and Commit Ordering section of git-log maybe a >>>> useful guide to some of the terms that would link to the glossary. >>> >>> Ah, I first thought that Documentation/rev-list-options.txt (which >>> is the relevant part of "git log" documentation you mention here) >>> already have references to deep technical terms explained in the >>> glossary and you are suggesting Abhishek to mimic the arrangement by >>> adding new and agreed-upon terms to the glossary and referring to >>> them from the commit-graph documentation updated by this series. >>> >>> But sadly that is not the case. What you are saying is that you >>> noticed that rev-list-options.txt needs a similar "the terms we use >>> to explain these two sections should be defined and explained in the >>> glossary (if they are not) and new references to glossary should be >>> added there" update. What terms you feel need glossary entry? >>> In any case, that is a very good suggestion. I agree that updating >>> "git log" doc may be outside the scope of Abhishek's theme, but it >>> would be very good to have such an update by anybody ;-) >> >> The only possible problem I see with this suggestion is that some of >> those terms (like 'topological levels' and 'corrected commit date') are >> technical terms that should be not of concern for Git user, only for >> developers working on Git. (However one could encounter the term >> "generation number" in `git commit-graph verify` output.) To be more precise, I think that user-facing glossary should include only terms that appear in user-facing documentation and in output messages of Git commands (with the possible exception of maybe output messages of some low-level plumbing). I think that the developer-facing glossary should include terms that appear in technical documentation, and in commit messages in Git history. > However we do mention "topolog*" in a number of the manual pages, and > rather less, as yet, in the technical pages. > > "Lexicographic" and "chronological" are in the same group of fancy > technical words ;-) I think that 'topological level' would appear only in technical documentation; if it would be the case then there is no reason to add it to user-facing glossary (to gitglossary manpage). 'Topological order' or 'topological sort', 'lexicographical order' and 'chronological order' are not Git-specific terms, and there are no Git-specific ambiguities. I am therefore a bit unsure about adding them to *Git* glossary. - In computer science, a _topological sort_ or _topological_ ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For Git it means that top to bottom, commits always appear before their parents. With `--graph` or `--topo-order` Git also avoids showing commits on multiple lines of history intermixed. - In mathematics, the _lexicographic_ or _lexicographical order_ (also known as lexical order, dictionary order, etc.) is a generalization of the alphabetical order. For Git it is simply alphabetical order. - _Chronological order_ is the arrangement of things following one after another in time; or in other words date order. Note that `git log --date-order` commits also always appear before their parents, but otherwise commits are shown in the commit timestamp order (committer date order) >> >> I don't think adding technical terms that the user won't encounter in >> the documentation or among messages that Git outputs would be not a good >> idea. It could confuse users, rather than help them. >> >> Conversely, perhaps we should add Documentation/technical/glossary.txt >> to help developers. > > I would agree that the Glossary probably ought to be split into the > primary, secondary and background terms so that the core concepts are > separated from the academic/developer style terms. I don't thing we need three separate layers; in my opinion separating terms that user of Git might encounter from terms that somebody working on developing Git may encounter would be enough. The technical glossary / dictionary could also help onboarding... > > Git does rip up most of what folks think about version "control", > usually based on the imperfect replication of physical artefacts. I don't quite understand what you wanted to say there. Could you explain in more detail, please? >> P.S. By the way, when looking at Documentation/glossary-content.txt, I >> have noticed few obsolescent entries, like "Git archive", few that have >> description that soon could be or is obsolete and would need updating, >> like "master" (when default branch switch to "main"), or "object >> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash >> function). > > The obsolescent items can be updated. I'm expecting that the 'main' and > 'SHA-' changes will eventually be picked up as part of the respective > patch series, hopefully as part of the global replacements. Here I meant that "Git archive" entry is not important anymore, as I think there are no active users of GNU arch version control system (no "arch people"); arch's last release was in 2006, and its replacement, Bazaar (or 'bzr') doesn't use this term. So I think it can be safely removed in 2020, after 14 years after last release of arch. In most cases "SHA-1" in the descriptions of terms in glossary should be replaced by "object identifier" (to be more generic). This can be safely done before switch to NewHash is ready and announced. Best,
Hi Jakub, On 10/11/2020 01:35, Jakub Narębski wrote: > Hello Philip, > > Philip Oakley <philipoakley@iee.email> writes: >> On 06/11/2020 18:26, Jakub Narębski wrote: >>> Junio C Hamano <gitster@pobox.com> writes: >>>> Philip Oakley <philipoakley@iee.email> writes: >>>> >>>>> This may be not part of the the main project, but could you consider, if >>>>> time permits, also adding some entries into the Git Glossary (`git help >>>>> glossary`) for the various terms we are using here and elsewhere, e.g. >>>>> 'topological levels', 'generation number', 'corrected commit date' (and >>>>> its fancy technical name for the use of date heuristics e.g. the >>>>> 'chronological ordering';). >>>>> >>>>> The glossary can provide a reference, once the issues are resolved. The >>>>> History Simplification and Commit Ordering section of git-log maybe a >>>>> useful guide to some of the terms that would link to the glossary. >>>> Ah, I first thought that Documentation/rev-list-options.txt (which >>>> is the relevant part of "git log" documentation you mention here) >>>> already have references to deep technical terms explained in the >>>> glossary and you are suggesting Abhishek to mimic the arrangement by >>>> adding new and agreed-upon terms to the glossary and referring to >>>> them from the commit-graph documentation updated by this series. >>>> >>>> But sadly that is not the case. What you are saying is that you >>>> noticed that rev-list-options.txt needs a similar "the terms we use >>>> to explain these two sections should be defined and explained in the >>>> glossary (if they are not) and new references to glossary should be >>>> added there" update. > What terms you feel need glossary entry? While it was Junio that made the comment, I'd agree that we should be using the glossary to explain, in a general sense, the terms that are used is a specialist sense. As the user community expands, their natural understanding of some of the terms diminishes. > >>>> In any case, that is a very good suggestion. I agree that updating >>>> "git log" doc may be outside the scope of Abhishek's theme, but it >>>> would be very good to have such an update by anybody ;-) >>> The only possible problem I see with this suggestion is that some of >>> those terms (like 'topological levels' and 'corrected commit date') are >>> technical terms that should be not of concern for Git user, only for >>> developers working on Git. (However one could encounter the term >>> "generation number" in `git commit-graph verify` output.) > To be more precise, I think that user-facing glossary should include > only terms that appear in user-facing documentation and in output > messages of Git commands (with the possible exception of maybe output > messages of some low-level plumbing). And where implied, the underlying concepts when they aren't obvious, or lack general terms (e.g. the 'staging area' discussions) > > I think that the developer-facing glossary should include terms that > appear in technical documentation, and in commit messages in Git > history. > >> However we do mention "topolog*" in a number of the manual pages, and >> rather less, as yet, in the technical pages. >> >> "Lexicographic" and "chronological" are in the same group of fancy >> technical words ;-) > I think that 'topological level' would appear only in technical > documentation; if it would be the case then there is no reason to add it > to user-facing glossary (to gitglossary manpage). > > 'Topological order' or 'topological sort', 'lexicographical order' and > 'chronological order' are not Git-specific terms, and there are no > Git-specific ambiguities. I am therefore a bit unsure about adding them > to *Git* glossary. It is that they aren't terms used in normal speech, so many folks do not comprehend the implied precision that the docs assume, nor the problems they may hide. > > - In computer science, a _topological sort_ or _topological_ ordering of > a directed graph is a linear ordering of its vertices such that for > every directed edge uv from vertex u to vertex v, u comes before v in > the ordering. Does this imply that those who aren't computer scientists shouldn't be using Git? > > For Git it means that top to bottom, commits always appear before > their parents. With `--graph` or `--topo-order` Git also avoids > showing commits on multiple lines of history intermixed. > > - In mathematics, the _lexicographic_ or _lexicographical order_ (also > known as lexical order, dictionary order, etc.) is a generalization of > the alphabetical order. > > For Git it is simply alphabetical order. ASCII order, Case sensitivity, Special characters, etc. > > - _Chronological order_ is the arrangement of things following one after > another in time; or in other words date order. Given that most résumés (the thing most folk see that asks for date order) is latest first, does this clarify which way chronological is? (I see this regularly in my other volunteer work). > > Note that `git log --date-order` commits also always appear before > their parents, but otherwise commits are shown in the commit timestamp > order (committer date order) > >>> I don't think adding technical terms that the user won't encounter in >>> the documentation or among messages that Git outputs would be not a good >>> idea. It could confuse users, rather than help them. >>> >>> Conversely, perhaps we should add Documentation/technical/glossary.txt >>> to help developers. >> I would agree that the Glossary probably ought to be split into the >> primary, secondary and background terms so that the core concepts are >> separated from the academic/developer style terms. > I don't thing we need three separate layers; in my opinion separating > terms that user of Git might encounter from terms that somebody working > on developing Git may encounter would be enough. > > The technical glossary / dictionary could also help onboarding... > >> Git does rip up most of what folks think about version "control", >> usually based on the imperfect replication of physical artefacts. > I don't quite understand what you wanted to say there. Could you > explain in more detail, please? Background, I see Git & Version Control from an engineers view point, rather than developers view. In the "real" world there are no perfect copies, we serialise key items so that we can track their degradation, and replace them when required. We attempt to "Control" what is happening. Our documentation and monitoring systems have layers of control to ensure only suitably qualified persons may access and inspect critical items, can record and access previous status reports, etc. There is only one "Mona Lisa", with critical access controls, even though there are 'copies' https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies. Almost all of our terminology for configuration control comes from the 'real' world, i.e. pre-modern computing. Git turns all that on its head. We can make perfect duplicates (they're not copies, not replicas..). The Object name is immutable. It's either right or wrong (exempt the SHAttered sha-1 breakage; were moving to sha-256). Git does *not* provide any access control. It supports the 'software freedoms' by distributing the control to the user. The repository is a version storage system, and the OIDs allow easy authentication between folks that they are looking at the same object, and all its implied descendants. Git has ripped up classical 'real' world version control. In many areas we need new or alternative terms, and documents that explain them to screen writers(*) and the many other non CS-major users of Git (and some engineers;-) (*) there's a diff pattern for them, IIRC, or at least one was proposed. > >>> P.S. By the way, when looking at Documentation/glossary-content.txt, I >>> have noticed few obsolescent entries, like "Git archive", few that have >>> description that soon could be or is obsolete and would need updating, >>> like "master" (when default branch switch to "main"), or "object >>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash >>> function). >> The obsolescent items can be updated. I'm expecting that the 'main' and >> 'SHA-' changes will eventually be picked up as part of the respective >> patch series, hopefully as part of the global replacements. > Here I meant that "Git archive" entry is not important anymore, as I > think there are no active users of GNU arch version control system (no > "arch people"); arch's last release was in 2006, and its replacement, > Bazaar (or 'bzr') doesn't use this term. So I think it can be safely > removed in 2020, after 14 years after last release of arch. > > In most cases "SHA-1" in the descriptions of terms in glossary should be > replaced by "object identifier" (to be more generic). This can be > safely done before switch to NewHash is ready and announced. > > Best, -- Philip
Hello Philip, Philip Oakley <philipoakley@iee.email> writes: > On 10/11/2020 01:35, Jakub Narębski wrote: >> Philip Oakley <philipoakley@iee.email> writes: >>> On 06/11/2020 18:26, Jakub Narębski wrote: >>>> Junio C Hamano <gitster@pobox.com> writes: >>>>> Philip Oakley <philipoakley@iee.email> writes: >>>>> >>>>>> This may be not part of the the main project, but could you consider, if >>>>>> time permits, also adding some entries into the Git Glossary (`git help >>>>>> glossary`) for the various terms we are using here and elsewhere, e.g. >>>>>> 'topological levels', 'generation number', 'corrected commit date' (and >>>>>> its fancy technical name for the use of date heuristics e.g. the >>>>>> 'chronological ordering';). >>>>>> >>>>>> The glossary can provide a reference, once the issues are resolved. The >>>>>> History Simplification and Commit Ordering section of git-log maybe a >>>>>> useful guide to some of the terms that would link to the glossary. [...] >> What terms you feel need glossary entry? > > While it was Junio that made the comment, I'd agree that we should be > using the glossary to explain, in a general sense, the terms that are > used is a specialist sense. As the user community expands, their natural > understanding of some of the terms diminishes. I was hoping for a list of terms from the abovementioned sections of git-log manpage you feel need entry in gitglosary(7). [...] >> To be more precise, I think that user-facing glossary should include >> only terms that appear in user-facing documentation and in output >> messages of Git commands (with the possible exception of maybe output >> messages of some low-level plumbing). > > And where implied, the underlying concepts when they aren't obvious, or > lack general terms (e.g. the 'staging area' discussions) True, 'staging area' should IMVHO be in glossary (replacing or in addition to older less specific term 'index', previous name for 'staging area' term). >> I think that the developer-facing glossary should include terms that >> appear in technical documentation, and in commit messages in Git >> history. Such as 'topological levels', 'commit slab' / 'on the slab', etc. >>> However we do mention "topolog*" in a number of the manual pages, and >>> rather less, as yet, in the technical pages. >>> >>> "Lexicographic" and "chronological" are in the same group of fancy >>> technical words ;-) >> >> I think that 'topological level' would appear only in technical >> documentation; if it would be the case then there is no reason to add it >> to user-facing glossary (to gitglossary manpage). >> >> 'Topological order' or 'topological sort', 'lexicographical order' and >> 'chronological order' are not Git-specific terms, and there are no >> Git-specific ambiguities. I am therefore a bit unsure about adding them >> to *Git* glossary. > > It is that they aren't terms used in normal speech, so many folks do not > comprehend the implied precision that the docs assume, nor the problems > they may hide. Right. >> - In computer science, a _topological sort_ or _topological_ ordering of >> a directed graph is a linear ordering of its vertices such that for >> every directed edge uv from vertex u to vertex v, u comes before v in >> the ordering. > > Does this imply that those who aren't computer scientists shouldn't be > using Git? I think that in most cases where we refer to topological order in the documentation we describe it there. It might be good idea to add it to the glossary, especially because Git uses it often in a very specific sense. On the other hand, should we define 'topology' or 'graph' as well? Or 'glossary' ;-) ? Those don't have any special meaning in Git, and can be as well found in the dictionary or Wikipedia. >> For Git it means that top to bottom, commits always appear before >> their parents. With `--graph` or `--topo-order` Git also avoids >> showing commits on multiple lines of history intermixed. >> >> - In mathematics, the _lexicographic_ or _lexicographical order_ (also >> known as lexical order, dictionary order, etc.) is a generalization of >> the alphabetical order. >> >> For Git it is simply alphabetical order. > > ASCII order, Case sensitivity, Special characters, etc. Actually I don't know. Let me check: the only place this term appears in the documentation is in git-tag(1) manpage and related documentation. It simplly uses strcmp(), or strcasecmp() when using `--ignore-case` option; so by default case sensitive. It looks like it does not take locale-specific rules. >> - _Chronological order_ is the arrangement of things following one after >> another in time; or in other words date order. > > Given that most résumés (the thing most folk see that asks for date > order) is latest first, does this clarify which way chronological is? (I > see this regularly in my other volunteer work). Right, it might be not obvious at first glance that Git outputs most recent commits first, that is newest commits are on top. Though if you think about it in more detail, it is the only ordering that makes sense, especially for projects with a long history; first, it is newest commits that are most interesting, and second Git always walks the history from child to parent. >> Note that `git log --date-order` commits also always appear before >> their parents, but otherwise commits are shown in the commit timestamp >> order (committer date order) [...] >>> Git does rip up most of what folks think about version "control", >>> usually based on the imperfect replication of physical artefacts. >> >> I don't quite understand what you wanted to say there. Could you >> explain in more detail, please? > > Background, I see Git & Version Control from an engineers view point, > rather than developers view. > > In the "real" world there are no perfect copies, we serialise key items > so that we can track their degradation, and replace them when required. > We attempt to "Control" what is happening. Our documentation and > monitoring systems have layers of control to ensure only suitably > qualified persons may access and inspect critical items, can record and > access previous status reports, etc. There is only one "Mona Lisa", with > critical access controls, even though there are 'copies' > https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies. > Almost all of our terminology for configuration control comes from the > 'real' world, i.e. pre-modern computing. > > Git turns all that on its head. We can make perfect duplicates (they're > not copies, not replicas..). The Object name is immutable. It's either > right or wrong (exempt the SHAttered sha-1 breakage; were moving to > sha-256). Git does *not* provide any access control. It supports the > 'software freedoms' by distributing the control to the user. The > repository is a version storage system, and the OIDs allow easy > authentication between folks that they are looking at the same object, > and all its implied descendants. > > Git has ripped up classical 'real' world version control. In many areas > we need new or alternative terms, and documents that explain them to > screen writers(*) and the many other non CS-major users of Git (and some > engineers;-) > > (*) there's a diff pattern for them, IIRC, or at least one was proposed. Right, though for me the concept of 'version control' was by default always about the digital, usually the source code. There are different editions of books, changes to non-digital technical drawings and plans (AFAIK often in the form of physical foil overlays as subsequent layers, if done well; overdrawing on the same layer if not), amendment and changes to laws, etc. Anyway, the question is what level of knowledge can we assume from the average Git user -- this would affect the spread of terms that should be considered for the Git glossary. Best,
diff --git a/commit-graph.c b/commit-graph.c index cedd311024..03948adfce 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb) else if (generation_a > generation_b) return 1; - /* use date as a heuristic when generations are equal */ - if (a->date < b->date) - return -1; - else if (a->date > b->date) - return 1; return 0; } @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) ctx->commits.nr); for (i = 0; i < ctx->commits.nr; i++) { timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]); + timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation; display_progress(ctx->progress, i + 1); if (level != GENERATION_NUMBER_INFINITY && - level != GENERATION_NUMBER_ZERO) + level != GENERATION_NUMBER_ZERO && + corrected_commit_date != GENERATION_NUMBER_INFINITY && + corrected_commit_date != GENERATION_NUMBER_ZERO + ) continue; commit_list_insert(ctx->commits.list[i], &list); @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) struct commit_list *parent; int all_parents_computed = 1; uint32_t max_level = 0; + timestamp_t max_corrected_commit_date = 0; for (parent = current->parents; parent; parent = parent->next) { level = *topo_level_slab_at(ctx->topo_levels, parent->item); - + corrected_commit_date = commit_graph_data_at(parent->item)->generation; if (level == GENERATION_NUMBER_INFINITY || - level == GENERATION_NUMBER_ZERO) { + level == GENERATION_NUMBER_ZERO || + corrected_commit_date == GENERATION_NUMBER_INFINITY || + corrected_commit_date == GENERATION_NUMBER_ZERO + ) { all_parents_computed = 0; commit_list_insert(parent->item, &list); break; - } else if (level > max_level) { - max_level = level; + } else { + if (level > max_level) + max_level = level; + + if (corrected_commit_date > max_corrected_commit_date) + max_corrected_commit_date = corrected_commit_date; } } @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx) if (max_level > GENERATION_NUMBER_V1_MAX - 1) max_level = GENERATION_NUMBER_V1_MAX - 1; *topo_level_slab_at(ctx->topo_levels, current) = max_level + 1; + + if (current->date && current->date > max_corrected_commit_date) + max_corrected_commit_date = current->date - 1; + commit_graph_data_at(current)->generation = max_corrected_commit_date + 1; } } } @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags) if (generation_zero == GENERATION_ZERO_EXISTS) continue; - /* - * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then - * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid - * extra logic in the following condition. - */ - if (max_generation == GENERATION_NUMBER_V1_MAX) - max_generation--; - generation = commit_graph_generation(graph_commit); - if (generation != max_generation + 1) - graph_report(_("commit-graph generation for commit %s is %u != %u"), + if (generation < max_generation + 1) + graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime), oid_to_hex(&cur_oid), generation, max_generation + 1);