diff mbox series

[v4,06/10] commit-graph: implement corrected commit date

Message ID 694ef1ec08d9dc96a74a2631b2710ad206397dbc.1602079786.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Implement Corrected Commit Date | expand

Commit Message

Philippe Blain via GitGitGadget Oct. 7, 2020, 2:09 p.m. UTC
From: Abhishek Kumar <abhishekkumar8222@gmail.com>

With most of preparations done, let's implement corrected commit date.

The corrected commit date for a commit is defined as:

* A commit with no parents (a root commit) has corrected commit date
  equal to its committer date.
* A commit with at least one parent has corrected commit date equal to
  the maximum of its commit date and one more than the largest corrected
  commit date among its parents.

As a special case, a root commit with timestamp of zero (01.01.1970
00:00:00Z) has corrected commit date of one, to be able to distinguish
from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
date).

To minimize the space required to store corrected commit date, Git
stores corrected commit date offsets into the commit-graph file. The
corrected commit date offset for a commit is defined as the difference
between its corrected commit date and actual commit date.

Storing corrected commit date requires sizeof(timestamp_t) bytes, which
in most cases is 64 bits (uintmax_t). However, corrected commit date
offsets can be safely stored using only 32-bits. This halves the size
of GDAT chunk, which is a reduction of around 6% in the size of
commit-graph file.

However, using offsets be problematic if one of commits is malformed but
valid and has committerdate of 0 Unix time, as the offset would be the
same as corrected commit date and thus require 64-bits to be stored
properly.

While Git does not write out offsets at this stage, Git stores the
corrected commit dates in member generation of struct commit_graph_data.
It will begin writing commit date offsets with the introduction of
generation data chunk.

Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
---
 commit-graph.c | 43 +++++++++++++++++++++++--------------------
 1 file changed, 23 insertions(+), 20 deletions(-)

Comments

Jakub Narębski Oct. 27, 2020, 6:53 p.m. UTC | #1
"Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>
> With most of preparations done, let's implement corrected commit date.
>
> The corrected commit date for a commit is defined as:
>
> * A commit with no parents (a root commit) has corrected commit date
>   equal to its committer date.
> * A commit with at least one parent has corrected commit date equal to
>   the maximum of its commit date and one more than the largest corrected
>   commit date among its parents.

All right.  We might want to say that it fulfills the same reachability
criteria as topological level, but perhaps this level of detail is not
necessary here.

> As a special case, a root commit with timestamp of zero (01.01.1970
> 00:00:00Z) has corrected commit date of one, to be able to distinguish
> from GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit
> date).

I'm not sure if this special case is really necessary, but it makes for
cleaner reasoning.

> To minimize the space required to store corrected commit date, Git
> stores corrected commit date offsets into the commit-graph file. The
> corrected commit date offset for a commit is defined as the difference
> between its corrected commit date and actual commit date.
>
> Storing corrected commit date requires sizeof(timestamp_t) bytes, which
> in most cases is 64 bits (uintmax_t). However, corrected commit date
> offsets can be safely stored using only 32-bits. This halves the size
> of GDAT chunk, which is a reduction of around 6% in the size of
> commit-graph file.
>
> However, using offsets be problematic if one of commits is malformed but
> valid and has committerdate of 0 Unix time, as the offset would be the
> same as corrected commit date and thus require 64-bits to be stored
> properly.
>
> While Git does not write out offsets at this stage, Git stores the
> corrected commit dates in member generation of struct commit_graph_data.
> It will begin writing commit date offsets with the introduction of
> generation data chunk.

All right.

>
> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>

Somewhere in the commit message we should also describe that this commit
changes how commit-graph is verified: from checking that the generation
number agrees with _topological level definition_, that is that for a
given commit it is 1 more than maximum of its parents (with the caveat
that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
checking that slightly weaker condition fulfilled by both topological
levels (generation number v1) and by corrected commit date (generation
number v2) that for a given commit its generation number is 1 more than
maximum of its parents or larger.

But, as far as I understand it, current code does not handle correctly
GENERATION_NUMBER_V1_MAX case (if we use generation number v1).

On the other hand we could have simpy use functional check, that
generation number used (which can be v1 or v2, or any similar other)
fulfills the reachability condition for each edge, which can be
simplified to checking that generation(parents) <= generation(commit).
If the reachability condition is true for each edge, then it is true for
each path, and for each commit.

> ---
>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>  1 file changed, 23 insertions(+), 20 deletions(-)
>
> diff --git a/commit-graph.c b/commit-graph.c
> index cedd311024..03948adfce 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>  	else if (generation_a > generation_b)
>  		return 1;
>  
> -	/* use date as a heuristic when generations are equal */
> -	if (a->date < b->date)
> -		return -1;
> -	else if (a->date > b->date)
> -		return 1;

Why this change?  It is not described in the commit message.

Note that while this tie-breaking fallback doesn't make much sense for
corrected committer date generation number v2, this tie-breaking helps
if we have to use topological levels (generation number v2).

>  	return 0;
>  }
>  
> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  					ctx->commits.nr);
>  	for (i = 0; i < ctx->commits.nr; i++) {
>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);

Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
enough; no need for 'timestamp_t' for 'level' variable.

> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>

All right, we compute both generation numbers: topological levels and
corrected commit date.

I guess we use 'corrected_commit_date' instead of simply 'generation' to
make it asier to remember which is which.

>  		display_progress(ctx->progress, i + 1);
>  		if (level != GENERATION_NUMBER_INFINITY &&
> -		    level != GENERATION_NUMBER_ZERO)
> +		    level != GENERATION_NUMBER_ZERO &&
> +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> +		    corrected_commit_date != GENERATION_NUMBER_ZERO

Straightforward addition.

> +		    )

Why this closing parenthesis is now in separated line?

>  			continue;
>  
>  		commit_list_insert(ctx->commits.list[i], &list);
> @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  			struct commit_list *parent;
>  			int all_parents_computed = 1;
>  			uint32_t max_level = 0;
> +			timestamp_t max_corrected_commit_date = 0;

All right, straightforward addition.

>  
>  			for (parent = current->parents; parent; parent = parent->next) {
>  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> -

Why we have removed this empty line?

> +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;

All right.

>  				if (level == GENERATION_NUMBER_INFINITY ||
> -				    level == GENERATION_NUMBER_ZERO) {
> +				    level == GENERATION_NUMBER_ZERO ||
> +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> +				    ) {

All right, same as above.

>  					all_parents_computed = 0;
>  					commit_list_insert(parent->item, &list);
>  					break;
> -				} else if (level > max_level) {
> -					max_level = level;
> +				} else {
> +					if (level > max_level)
> +						max_level = level;
> +
> +					if (corrected_commit_date > max_corrected_commit_date)
> +						max_corrected_commit_date = corrected_commit_date;
>  				}

All right, reasonable and straightforward.

>  			}
>  
> @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>  				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
>  					max_level = GENERATION_NUMBER_V1_MAX - 1;
>  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> +
> +				if (current->date && current->date > max_corrected_commit_date)
> +					max_corrected_commit_date = current->date - 1;
> +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;

All right.

Here we use the same trick as in previous commit (and as above) to avoid
any possible overflow, to minimize number of conditionals.  The fact
that max_corrected_commit_date might store incorrect value doesn't
matter, as it is reset at beginning of this loop.

>  			}
>  		}
>  	}
> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>  			continue;
>  
> -		/*
> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> -		 * extra logic in the following condition.
> -		 */
> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
> -			max_generation--;
> -

Perhaps in the future we should check that both topological levels, and
also corrected committer date (if it exists) for correctness according
to their definition.  Then the above removed part would be restored (but
with s/max_generation/max_level/).

>  		generation = commit_graph_generation(graph_commit);
> -		if (generation != max_generation + 1)
> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> +		if (generation < max_generation + 1)
> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

All right, so we relaxed the check so that it will be fulfilled by
generation number v2 (and also by generation number v1, as it implies
the more strict check for v1).

What would happen however if generation holds topological levels, and it
is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
so the `git commit-graph verify` would incorrectly say that there is
a problem with generation number, while there isn't one (false positive
detection of error).

Sidenote: I think we don't have to worry about having to introduce
GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
disck representation) corrected commiter date is the same as of commiter
date itself, plus some, and I don't see us coming close to 64-bit limit
of timestamp_t for commit dates.

>  				     oid_to_hex(&cur_oid),
>  				     generation,
>  				     max_generation + 1);

Best,
Abhishek Kumar Nov. 3, 2020, 11:44 a.m. UTC | #2
On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
> > From: Abhishek Kumar <abhishekkumar8222@gmail.com>
> > ...
> > Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
> 
> Somewhere in the commit message we should also describe that this commit
> changes how commit-graph is verified: from checking that the generation
> number agrees with _topological level definition_, that is that for a
> given commit it is 1 more than maximum of its parents (with the caveat
> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
> checking that slightly weaker condition fulfilled by both topological
> levels (generation number v1) and by corrected commit date (generation
> number v2) that for a given commit its generation number is 1 more than
> maximum of its parents or larger.

Sure, that makes sense. Will add.

> 
> But, as far as I understand it, current code does not handle correctly
> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
> 
> On the other hand we could have simpy use functional check, that
> generation number used (which can be v1 or v2, or any similar other)
> fulfills the reachability condition for each edge, which can be
> simplified to checking that generation(parents) <= generation(commit).
> If the reachability condition is true for each edge, then it is true for
> each path, and for each commit.
> 
> > ---
> >  commit-graph.c | 43 +++++++++++++++++++++++--------------------
> >  1 file changed, 23 insertions(+), 20 deletions(-)
> >
> > diff --git a/commit-graph.c b/commit-graph.c
> > index cedd311024..03948adfce 100644
> > --- a/commit-graph.c
> > +++ b/commit-graph.c
> > @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
> >  	else if (generation_a > generation_b)
> >  		return 1;
> >  
> > -	/* use date as a heuristic when generations are equal */
> > -	if (a->date < b->date)
> > -		return -1;
> > -	else if (a->date > b->date)
> > -		return 1;
> 
> Why this change?  It is not described in the commit message.
> 
> Note that while this tie-breaking fallback doesn't make much sense for
> corrected committer date generation number v2, this tie-breaking helps
> if we have to use topological levels (generation number v2).
> 

Right, I should have mentioned this change (and it's not something that
makes a difference either way).

We call commit_gen_cmp() only when we are sorting commits by generation
to speed up computation of Bloom filters i.e. while writing a commit
graph (either split commit-graph or a simple commit-graph).

Since we are always computing and storing corrected commit date when we
are writing (whether we write a GDAT chunk or not), using date as
heuristic is longer required.

> >  	return 0;
> >  }
> >  
> > @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  					ctx->commits.nr);
> >  	for (i = 0; i < ctx->commits.nr; i++) {
> >  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
> 
> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
> enough; no need for 'timestamp_t' for 'level' variable.
> 
> > +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
> >

We need the 'timestamp_t' as we are comparing level with the now 64-bits
GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
timestamp_t. I have a hunch that since we are explicitly using a fixed
width data type, compiler is unwilling to type coerce into broader data
types.

Advice on this appreciated.

> 
> All right, we compute both generation numbers: topological levels and
> corrected commit date.
> 
> I guess we use 'corrected_commit_date' instead of simply 'generation' to
> make it asier to remember which is which.
> 
> >  		display_progress(ctx->progress, i + 1);
> >  		if (level != GENERATION_NUMBER_INFINITY &&
> > -		    level != GENERATION_NUMBER_ZERO)
> > +		    level != GENERATION_NUMBER_ZERO &&
> > +		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
> > +		    corrected_commit_date != GENERATION_NUMBER_ZERO
> 
> Straightforward addition.
> 
> > +		    )
> 
> Why this closing parenthesis is now in separated line?
> 
> >  			continue;
> >  
> >  		commit_list_insert(ctx->commits.list[i], &list);
> > @@ -1369,17 +1368,25 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  			struct commit_list *parent;
> >  			int all_parents_computed = 1;
> >  			uint32_t max_level = 0;
> > +			timestamp_t max_corrected_commit_date = 0;
> 
> All right, straightforward addition.
> 
> >  
> >  			for (parent = current->parents; parent; parent = parent->next) {
> >  				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
> > -
> 
> Why we have removed this empty line?
> 
> > +				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
> 
> All right.
> 
> >  				if (level == GENERATION_NUMBER_INFINITY ||
> > -				    level == GENERATION_NUMBER_ZERO) {
> > +				    level == GENERATION_NUMBER_ZERO ||
> > +				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
> > +				    corrected_commit_date == GENERATION_NUMBER_ZERO
> > +				    ) {
> 
> All right, same as above.
> 
> >  					all_parents_computed = 0;
> >  					commit_list_insert(parent->item, &list);
> >  					break;
> > -				} else if (level > max_level) {
> > -					max_level = level;
> > +				} else {
> > +					if (level > max_level)
> > +						max_level = level;
> > +
> > +					if (corrected_commit_date > max_corrected_commit_date)
> > +						max_corrected_commit_date = corrected_commit_date;
> >  				}
> 
> All right, reasonable and straightforward.
> 
> >  			}
> >  
> > @@ -1389,6 +1396,10 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
> >  				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
> >  					max_level = GENERATION_NUMBER_V1_MAX - 1;
> >  				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
> > +
> > +				if (current->date && current->date > max_corrected_commit_date)
> > +					max_corrected_commit_date = current->date - 1;
> > +				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
> 
> All right.
> 
> Here we use the same trick as in previous commit (and as above) to avoid
> any possible overflow, to minimize number of conditionals.  The fact
> that max_corrected_commit_date might store incorrect value doesn't
> matter, as it is reset at beginning of this loop.
> 
> >  			}
> >  		}
> >  	}
> > @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
> >  		if (generation_zero == GENERATION_ZERO_EXISTS)
> >  			continue;
> >  
> > -		/*
> > -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
> > -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
> > -		 * extra logic in the following condition.
> > -		 */
> > -		if (max_generation == GENERATION_NUMBER_V1_MAX)
> > -			max_generation--;
> > -
> 
> Perhaps in the future we should check that both topological levels, and
> also corrected committer date (if it exists) for correctness according
> to their definition.  Then the above removed part would be restored (but
> with s/max_generation/max_level/).
> 
> >  		generation = commit_graph_generation(graph_commit);
> > -		if (generation != max_generation + 1)
> > -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
> > +		if (generation < max_generation + 1)
> > +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
> 
> All right, so we relaxed the check so that it will be fulfilled by
> generation number v2 (and also by generation number v1, as it implies
> the more strict check for v1).
> 
> What would happen however if generation holds topological levels, and it
> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
> so the `git commit-graph verify` would incorrectly say that there is
> a problem with generation number, while there isn't one (false positive
> detection of error).

Alright, so the above block still makes sense if we are working with
topological levels but not with corrected commit dates. Instead of
removing it, I will modify the condition to check that one of our parents
has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.

Suprised that no test breaks by this change.

I have also moved changes in the verify function to the next patch, as
we cannot write or read corrected commit dates yet - so little sense in
modifying verify.

> 
> Sidenote: I think we don't have to worry about having to introduce
> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
> disck representation) corrected commiter date is the same as of commiter
> date itself, plus some, and I don't see us coming close to 64-bit limit
> of timestamp_t for commit dates.
> 
> >  				     oid_to_hex(&cur_oid),
> >  				     generation,
> >  				     max_generation + 1);
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek
Jakub Narębski Nov. 4, 2020, 4:45 p.m. UTC | #3
Hello Abhishek,

Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> 
>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> ...
>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>> 
>> Somewhere in the commit message we should also describe that this commit
>> changes how commit-graph is verified: from checking that the generation
>> number agrees with _topological level definition_, that is that for a
>> given commit it is 1 more than maximum of its parents (with the caveat
>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>> checking that slightly weaker condition fulfilled by both topological
>> levels (generation number v1) and by corrected commit date (generation
>> number v2) that for a given commit its generation number is 1 more than
>> maximum of its parents or larger.
>
> Sure, that makes sense. Will add.

Actually this description should match whatever we decide about
mechanism for verifying correctness of generation numbers (see below).
Because we have to choose one.

>> 
>> But, as far as I understand it, current code does not handle correctly
>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>> 
>> On the other hand we could have simpy use functional check, that
>> generation number used (which can be v1 or v2, or any similar other)
>> fulfills the reachability condition for each edge, which can be
>> simplified to checking that generation(parents) <= generation(commit).
>> If the reachability condition is true for each edge, then it is true for
>> each path, and for each commit.

See below.

>>> ---
>>>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>>
>>> diff --git a/commit-graph.c b/commit-graph.c
>>> index cedd311024..03948adfce 100644
>>> --- a/commit-graph.c
>>> +++ b/commit-graph.c
>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>  	else if (generation_a > generation_b)
>>>  		return 1;
>>>  
>>> -	/* use date as a heuristic when generations are equal */
>>> -	if (a->date < b->date)
>>> -		return -1;
>>> -	else if (a->date > b->date)
>>> -		return 1;
>> 
>> Why this change?  It is not described in the commit message.
>> 
>> Note that while this tie-breaking fallback doesn't make much sense for
>> corrected committer date generation number v2, this tie-breaking helps
>> if we have to use topological levels (generation number v2).
>> 
>
> Right, I should have mentioned this change (and it's not something that
> makes a difference either way).
>
> We call commit_gen_cmp() only when we are sorting commits by generation
> to speed up computation of Bloom filters i.e. while writing a commit
> graph (either split commit-graph or a simple commit-graph).
>
> Since we are always computing and storing corrected commit date when we
> are writing (whether we write a GDAT chunk or not), using date as
> heuristic is longer required.

Thanks.  This description really should be added to the commit message,
because (yet again?) I was confused by this change.

Sidenote: it is not obvious at least to me that this function is used
only for sorting commits to speed up computation of Bloom filters while
writing the commit-graph (`git commit-graph write --changed-paths [other
options]`).

>>>  	return 0;
>>>  }
>>>  
>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>  					ctx->commits.nr);
>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>> 
>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>> enough; no need for 'timestamp_t' for 'level' variable.
>> 
>>> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>
>
> We need the 'timestamp_t' as we are comparing level with the now 64-bits
> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
> timestamp_t. I have a hunch that since we are explicitly using a fixed
> width data type, compiler is unwilling to type coerce into broader data
> types.
>
> Advice on this appreciated.

All right, so the wider type is used because of comparison with
wide-uint GENERATION_NUMBER_INFINITY.  I stand corrected.

[...]
>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>>>  			continue;
>>>  
>>> -		/*
>>> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>> -		 * extra logic in the following condition.
>>> -		 */
>>> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
>>> -			max_generation--;
>>> -
>> 
>> Perhaps in the future we should check that both topological levels, and
>> also corrected committer date (if it exists) for correctness according
>> to their definition.  Then the above removed part would be restored (but
>> with s/max_generation/max_level/).
>> 
>>>  		generation = commit_graph_generation(graph_commit);
>>> -		if (generation != max_generation + 1)
>>> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>> +		if (generation < max_generation + 1)
>>> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>> 
>> All right, so we relaxed the check so that it will be fulfilled by
>> generation number v2 (and also by generation number v1, as it implies
>> the more strict check for v1).
>> 
>> What would happen however if generation holds topological levels, and it
>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>> so the `git commit-graph verify` would incorrectly say that there is
>> a problem with generation number, while there isn't one (false positive
>> detection of error).
>
> Alright, so the above block still makes sense if we are working with
> topological levels but not with corrected commit dates. Instead of
> removing it, I will modify the condition to check that one of our parents
> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.

That is one of the 3 possible solutions I can think of.


I. First solution is to switch from checking that generation number
matches its definition to checking that the [weaker] reachability
condition for the generation number is true, that is:

 	if (generation < max_generation)
 		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

The [weaker] reachability condition for generation numbers states that

   A reachable from B    =>    gen(A) <= gen(B)

This condition is true even if one or more generation numbers is
GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
outside of the serialized commit-graph, doesn't matter and cannot happen
during verification of the commit-graph data by definition).

This means that if P* is the parent of C with the maximal generation
number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
true), then there is a problem with generation number.

This is why I thought you were going for, and what I have proposed.

Advantages:
- we are testing what actually matters for speeding up reachability
  queries, namely that the reachability property holds true
- the test works for generation number v1, generation number v2,
  and any possible future use-compatibile generation number
  (not that I think we would need any)
- least complicated solution

Disadvantages:
- weaker test that we have had for generation number v1 (topological
  levels), and weaker that possible test for generation number v2
  that we could have (see below)


II. Verify corrected committed date (generation number v2) if available,
and verify topological levels (generation number v1) otherwise, checking
that it matches the definition of it -- using version-specific checks.

This would probably mean adding a conditional around the code verifying
that given generation number is correct, possibly:

  if (g->read_generation_data) {
  	/* verify corrected commit date */
  } else {
  	/* current code for verifying topological levels */
  }

II.a. For topological levels (generation number v1) we would continue
checking that it matches the definition, that is that the following
condition holds:

  gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1

This includes code for handling the case where `max_generation`, holding 
max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.

II.b. For corrected commiter dates (generation number v2) we can use the
code proposed by this revision of this commit, namely we check if the
following condition holds:

  gen(P) + 1 <= gen(C)   for each P \in parents(C)

or, in other words:

  max_{P: P ∈ parents(C)} { gen(P) } + 1  <=  gen(C)

Which could be checked using the following code (i.e. current state
after this revision of this patch):

	if (generation < max_generation + 1)
		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),

This is what I think you are proposing now.

Additionally, theoretically we could also check that the following
condition holds for corrected commiter date:

   committer_date(C) <= gen_v2(C)

but this is automatically fufilled because we use non-negative offsets
to store corrected committed date info.

Alternatively we can check for compliance with the definition of the
corrected committer date:

  if (max_generation + 1 <= graph_commit->date) {
  	/* commit date does not need correction */
  	if (generation != graph_commit->date)
    	graph_report(_("commit-graph corrected commit date for commit %s "
  		               "is %"PRItime" != %"PRItime" commit date"),
                     ...);
  } else {
  	if (generation != max_generation + 1)
  		graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
                     ...);
  }

Though I think it might be overkill.

Advantages:
- more strict tests, checking generation numbers (v2 if present, v1
  otherwise) against their definition
- if there is no GDAT chunk, verify works just like it did before

Disadvantages:
- more complicated code
- possibly measurable performance degradation due to extra conditional


III. Like II., but if there is generation numbers chunk (GDAT chunk), we
verify *both* topological levels (v1) and corrected commit date (v2)
against their definition.  If GDAT chunk is not present, it reduces to
current code (before this patch series).

Advantages:
- if there is no GDAT chunk, verify works just like it did before
- most strict tests, verifying all the data: both generation number v1
  and generation number v2 -- if possible

Disadvantages:
- most complex code; we need to somehow extract topological levels
  if the GDAT chunk is present (they are not on graph data slab in this
  case); I have not even started to think how it could be done
- slower verification

> Suprised that no test breaks by this change.

I don't whink we have any test that created commit graph with
topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
expensive and have to be of course protected by GIT_TEST_LONG aka
EXPENSIVE prerequisite.

  # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
  test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
  	rm -rf long_chain &&
  	git init long_chain &&
  	test_commit_bulk -C long_chain 1073741824 &&
    (
  		cd long_chain &&
  		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
  		git commit-graph verify
    )
  '

This however lies slightly outside the scope of this patch series,
though if you could add this test (in a separate patch), after testing
it, it would be very nice.

>
> I have also moved changes in the verify function to the next patch, as
> we cannot write or read corrected commit dates yet - so little sense in
> modifying verify.

I think putting changes to the verify function in a separate patch, be
it before or after this one (depending on the choice of the algorithm
for verification, see above) would be a good idea.

>> 
>> Sidenote: I think we don't have to worry about having to introduce
>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>> disck representation) corrected commiter date is the same as of commiter
>> date itself, plus some, and I don't see us coming close to 64-bit limit
>> of timestamp_t for commit dates.
>> 
>>>  				     oid_to_hex(&cur_oid),
>>>  				     generation,
>>>  				     max_generation + 1);

Best,
Philip Oakley Nov. 5, 2020, 2:05 p.m. UTC | #4
Hi Abhishek,

On 04/11/2020 16:45, Jakub Narębski wrote:
> Hello Abhishek,
>
> Abhishek Kumar <abhishekkumar8222@gmail.com> writes:
>> On Tue, Oct 27, 2020 at 07:53:23PM +0100, Jakub Narębski wrote:
>>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>>
>>>> From: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>>> ...
>>>> Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com>
>>> Somewhere in the commit message we should also describe that this commit
>>> changes how commit-graph is verified: from checking that the generation
>>> number agrees with _topological level definition_, that is that for a
>>> given commit it is 1 more than maximum of its parents (with the caveat
>>> that we need to handle GENERATION_NUMBER_V1_MAX values correctly), to
>>> checking that slightly weaker condition fulfilled by both topological
>>> levels (generation number v1) and by corrected commit date (generation
>>> number v2) that for a given commit its generation number is 1 more than
>>> maximum of its parents or larger.
>> Sure, that makes sense. Will add.
> Actually this description should match whatever we decide about
> mechanism for verifying correctness of generation numbers (see below).
> Because we have to choose one.

This may be not part of the the main project, but could you consider, if
time permits, also adding some entries into the Git Glossary (`git help
glossary`) for the various terms we are using here and elsewhere, e.g.
'topological levels', 'generation number', 'corrected commit date' (and
its fancy technical name for the use of date heuristics e.g. the
'chronological ordering';).

The glossary can provide a reference, once the issues are resolved. The
History Simplification and Commit Ordering section of git-log maybe a
useful guide to some of the terms that would link to the glossary.
--
Philip

>>> But, as far as I understand it, current code does not handle correctly
>>> GENERATION_NUMBER_V1_MAX case (if we use generation number v1).
>>>
>>> On the other hand we could have simpy use functional check, that
>>> generation number used (which can be v1 or v2, or any similar other)
>>> fulfills the reachability condition for each edge, which can be
>>> simplified to checking that generation(parents) <= generation(commit).
>>> If the reachability condition is true for each edge, then it is true for
>>> each path, and for each commit.
> See below.
>
>>>> ---
>>>>  commit-graph.c | 43 +++++++++++++++++++++++--------------------
>>>>  1 file changed, 23 insertions(+), 20 deletions(-)
>>>>
>>>> diff --git a/commit-graph.c b/commit-graph.c
>>>> index cedd311024..03948adfce 100644
>>>> --- a/commit-graph.c
>>>> +++ b/commit-graph.c
>>>> @@ -154,11 +154,6 @@ static int commit_gen_cmp(const void *va, const void *vb)
>>>>  	else if (generation_a > generation_b)
>>>>  		return 1;
>>>>  
>>>> -	/* use date as a heuristic when generations are equal */
>>>> -	if (a->date < b->date)
>>>> -		return -1;
>>>> -	else if (a->date > b->date)
>>>> -		return 1;
>>> Why this change?  It is not described in the commit message.
>>>
>>> Note that while this tie-breaking fallback doesn't make much sense for
>>> corrected committer date generation number v2, this tie-breaking helps
>>> if we have to use topological levels (generation number v2).
>>>
>> Right, I should have mentioned this change (and it's not something that
>> makes a difference either way).
>>
>> We call commit_gen_cmp() only when we are sorting commits by generation
>> to speed up computation of Bloom filters i.e. while writing a commit
>> graph (either split commit-graph or a simple commit-graph).
>>
>> Since we are always computing and storing corrected commit date when we
>> are writing (whether we write a GDAT chunk or not), using date as
>> heuristic is longer required.
> Thanks.  This description really should be added to the commit message,
> because (yet again?) I was confused by this change.
>
> Sidenote: it is not obvious at least to me that this function is used
> only for sorting commits to speed up computation of Bloom filters while
> writing the commit-graph (`git commit-graph write --changed-paths [other
> options]`).
>
>>>>  	return 0;
>>>>  }
>>>>  
>>>> @@ -1357,10 +1352,14 @@ static void compute_generation_numbers(struct write_commit_graph_context *ctx)
>>>>  					ctx->commits.nr);
>>>>  	for (i = 0; i < ctx->commits.nr; i++) {
>>>>  		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
>>> Sidenote: I haven't noticed it earlier, but here 'uint32_t' might be
>>> enough; no need for 'timestamp_t' for 'level' variable.
>>>
>>>> +		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
>>>>
>> We need the 'timestamp_t' as we are comparing level with the now 64-bits
>> GENERATION_NUMBER_INFINITY. I thought uint32_t would be promoted to
>> timestamp_t. I have a hunch that since we are explicitly using a fixed
>> width data type, compiler is unwilling to type coerce into broader data
>> types.
>>
>> Advice on this appreciated.
> All right, so the wider type is used because of comparison with
> wide-uint GENERATION_NUMBER_INFINITY.  I stand corrected.
>
> [...]
>>>> @@ -2485,17 +2496,9 @@ int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
>>>>  		if (generation_zero == GENERATION_ZERO_EXISTS)
>>>>  			continue;
>>>>  
>>>> -		/*
>>>> -		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
>>>> -		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
>>>> -		 * extra logic in the following condition.
>>>> -		 */
>>>> -		if (max_generation == GENERATION_NUMBER_V1_MAX)
>>>> -			max_generation--;
>>>> -
>>> Perhaps in the future we should check that both topological levels, and
>>> also corrected committer date (if it exists) for correctness according
>>> to their definition.  Then the above removed part would be restored (but
>>> with s/max_generation/max_level/).
>>>
>>>>  		generation = commit_graph_generation(graph_commit);
>>>> -		if (generation != max_generation + 1)
>>>> -			graph_report(_("commit-graph generation for commit %s is %u != %u"),
>>>> +		if (generation < max_generation + 1)
>>>> +			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>>> All right, so we relaxed the check so that it will be fulfilled by
>>> generation number v2 (and also by generation number v1, as it implies
>>> the more strict check for v1).
>>>
>>> What would happen however if generation holds topological levels, and it
>>> is GENERATION_NUMBER_V1_MAX for at least one parent, which means it is
>>> GENERATION_NUMBER_V1_MAX for a commit?  As you can check, the condition
>>> would be true: GENERATION_NUMBER_V1_MAX < GENERATION_NUMBER_V1_MAX + 1,
>>> so the `git commit-graph verify` would incorrectly say that there is
>>> a problem with generation number, while there isn't one (false positive
>>> detection of error).
>> Alright, so the above block still makes sense if we are working with
>> topological levels but not with corrected commit dates. Instead of
>> removing it, I will modify the condition to check that one of our parents
>> has GENERATION_NUMBER_V1_MAX and the graph uses topological levels.
> That is one of the 3 possible solutions I can think of.
>
>
> I. First solution is to switch from checking that generation number
> matches its definition to checking that the [weaker] reachability
> condition for the generation number is true, that is:
>
>  	if (generation < max_generation)
>  		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> The [weaker] reachability condition for generation numbers states that
>
>    A reachable from B    =>    gen(A) <= gen(B)
>
> This condition is true even if one or more generation numbers is
> GENERATION_NUMBER_ZERO (uninitialized or written by old git version),
> GENERATION_NUMBER_V1_MAX (we hit storage limitations, can happen only
> for generation number v1), or GENERATION_NUMBER_INFINITY (for commits
> outside of the serialized commit-graph, doesn't matter and cannot happen
> during verification of the commit-graph data by definition).
>
> This means that if P* is the parent of C with the maximal generation
> number, and gen(C) < gen(P*) is true (while gen(P*) <= gen(C) should be
> true), then there is a problem with generation number.
>
> This is why I thought you were going for, and what I have proposed.
>
> Advantages:
> - we are testing what actually matters for speeding up reachability
>   queries, namely that the reachability property holds true
> - the test works for generation number v1, generation number v2,
>   and any possible future use-compatibile generation number
>   (not that I think we would need any)
> - least complicated solution
>
> Disadvantages:
> - weaker test that we have had for generation number v1 (topological
>   levels), and weaker that possible test for generation number v2
>   that we could have (see below)
>
>
> II. Verify corrected committed date (generation number v2) if available,
> and verify topological levels (generation number v1) otherwise, checking
> that it matches the definition of it -- using version-specific checks.
>
> This would probably mean adding a conditional around the code verifying
> that given generation number is correct, possibly:
>
>   if (g->read_generation_data) {
>   	/* verify corrected commit date */
>   } else {
>   	/* current code for verifying topological levels */
>   }
>
> II.a. For topological levels (generation number v1) we would continue
> checking that it matches the definition, that is that the following
> condition holds:
>
>   gen(C) = max_{P: P ∈ parents(C)} gen(P) + 1
>
> This includes code for handling the case where `max_generation`, holding 
> max_{P: P ∈ parents(C)} gen(P), is GENERATION_NUMBER_V1_MAX.
>
> II.b. For corrected commiter dates (generation number v2) we can use the
> code proposed by this revision of this commit, namely we check if the
> following condition holds:
>
>   gen(P) + 1 <= gen(C)   for each P \in parents(C)
>
> or, in other words:
>
>   max_{P: P ∈ parents(C)} { gen(P) } + 1  <=  gen(C)
>
> Which could be checked using the following code (i.e. current state
> after this revision of this patch):
>
> 	if (generation < max_generation + 1)
> 		graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
>
> This is what I think you are proposing now.
>
> Additionally, theoretically we could also check that the following
> condition holds for corrected commiter date:
>
>    committer_date(C) <= gen_v2(C)
>
> but this is automatically fufilled because we use non-negative offsets
> to store corrected committed date info.
>
> Alternatively we can check for compliance with the definition of the
> corrected committer date:
>
>   if (max_generation + 1 <= graph_commit->date) {
>   	/* commit date does not need correction */
>   	if (generation != graph_commit->date)
>     	graph_report(_("commit-graph corrected commit date for commit %s "
>   		               "is %"PRItime" != %"PRItime" commit date"),
>                      ...);
>   } else {
>   	if (generation != max_generation + 1)
>   		graph_report(_("commit-graph generation v2 for commit %s is %"PRItime" != %"PRItime),
>                      ...);
>   }
>
> Though I think it might be overkill.
>
> Advantages:
> - more strict tests, checking generation numbers (v2 if present, v1
>   otherwise) against their definition
> - if there is no GDAT chunk, verify works just like it did before
>
> Disadvantages:
> - more complicated code
> - possibly measurable performance degradation due to extra conditional
>
>
> III. Like II., but if there is generation numbers chunk (GDAT chunk), we
> verify *both* topological levels (v1) and corrected commit date (v2)
> against their definition.  If GDAT chunk is not present, it reduces to
> current code (before this patch series).
>
> Advantages:
> - if there is no GDAT chunk, verify works just like it did before
> - most strict tests, verifying all the data: both generation number v1
>   and generation number v2 -- if possible
>
> Disadvantages:
> - most complex code; we need to somehow extract topological levels
>   if the GDAT chunk is present (they are not on graph data slab in this
>   case); I have not even started to think how it could be done
> - slower verification
>
>> Suprised that no test breaks by this change.
> I don't whink we have any test that created commit graph with
> topological levels greater than GENERATION_NUMBER_V1_MAX; this would be
> expensive and have to be of course protected by GIT_TEST_LONG aka
> EXPENSIVE prerequisite.
>
>   # GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 is here to force verification of topological levels
>   test_expect_success EXPENSIVE 'verify handles topological levels > GENERATION_NUMBER_V1_MAX' '
>   	rm -rf long_chain &&
>   	git init long_chain &&
>   	test_commit_bulk -C long_chain 1073741824 &&
>     (
>   		cd long_chain &&
>   		GIT_TEST_COMMIT_GRAPH_NO_GDAT=1 git commit-graph write &&
>   		git commit-graph verify
>     )
>   '
>
> This however lies slightly outside the scope of this patch series,
> though if you could add this test (in a separate patch), after testing
> it, it would be very nice.
>
>> I have also moved changes in the verify function to the next patch, as
>> we cannot write or read corrected commit dates yet - so little sense in
>> modifying verify.
> I think putting changes to the verify function in a separate patch, be
> it before or after this one (depending on the choice of the algorithm
> for verification, see above) would be a good idea.
>
>>> Sidenote: I think we don't have to worry about having to introduce
>>> GENERATION_NUMBER_V2_MAX, as the in-memory size (of reconstructed from
>>> disck representation) corrected commiter date is the same as of commiter
>>> date itself, plus some, and I don't see us coming close to 64-bit limit
>>> of timestamp_t for commit dates.
>>>
>>>>  				     oid_to_hex(&cur_oid),
>>>>  				     generation,
>>>>  				     max_generation + 1);
> Best,
Junio C Hamano Nov. 5, 2020, 6:22 p.m. UTC | #5
Philip Oakley <philipoakley@iee.email> writes:

> This may be not part of the the main project, but could you consider, if
> time permits, also adding some entries into the Git Glossary (`git help
> glossary`) for the various terms we are using here and elsewhere, e.g.
> 'topological levels', 'generation number', 'corrected commit date' (and
> its fancy technical name for the use of date heuristics e.g. the
> 'chronological ordering';).
>
> The glossary can provide a reference, once the issues are resolved. The
> History Simplification and Commit Ordering section of git-log maybe a
> useful guide to some of the terms that would link to the glossary.

Ah, I first thought that Documentation/rev-list-options.txt (which
is the relevant part of "git log" documentation you mention here)
already have references to deep technical terms explained in the
glossary and you are suggesting Abhishek to mimic the arrangement by
adding new and agreed-upon terms to the glossary and referring to
them from the commit-graph documentation updated by this series.

But sadly that is not the case.  What you are saying is that you
noticed that rev-list-options.txt needs a similar "the terms we use
to explain these two sections should be defined and explained in the
glossary (if they are not) and new references to glossary should be
added there" update.

In any case, that is a very good suggestion.  I agree that updating
"git log" doc may be outside the scope of Abhishek's theme, but it
would be very good to have such an update by anybody ;-)

Thanks
Jakub Narębski Nov. 6, 2020, 6:26 p.m. UTC | #6
Junio C Hamano <gitster@pobox.com> writes:
> Philip Oakley <philipoakley@iee.email> writes:
>
>> This may be not part of the the main project, but could you consider, if
>> time permits, also adding some entries into the Git Glossary (`git help
>> glossary`) for the various terms we are using here and elsewhere, e.g.
>> 'topological levels', 'generation number', 'corrected commit date' (and
>> its fancy technical name for the use of date heuristics e.g. the
>> 'chronological ordering';).
>>
>> The glossary can provide a reference, once the issues are resolved. The
>> History Simplification and Commit Ordering section of git-log maybe a
>> useful guide to some of the terms that would link to the glossary.
>
> Ah, I first thought that Documentation/rev-list-options.txt (which
> is the relevant part of "git log" documentation you mention here)
> already have references to deep technical terms explained in the
> glossary and you are suggesting Abhishek to mimic the arrangement by
> adding new and agreed-upon terms to the glossary and referring to
> them from the commit-graph documentation updated by this series.
>
> But sadly that is not the case.  What you are saying is that you
> noticed that rev-list-options.txt needs a similar "the terms we use
> to explain these two sections should be defined and explained in the
> glossary (if they are not) and new references to glossary should be
> added there" update.
>
> In any case, that is a very good suggestion.  I agree that updating
> "git log" doc may be outside the scope of Abhishek's theme, but it
> would be very good to have such an update by anybody ;-)

The only possible problem I see with this suggestion is that some of
those terms (like 'topological levels' and 'corrected commit date') are
technical terms that should be not of concern for Git user, only for
developers working on Git.  (However one could encounter the term
"generation number" in `git commit-graph verify` output.)

I don't think adding technical terms that the user won't encounter in
the documentation or among messages that Git outputs would be not a good
idea.  It could confuse users, rather than help them.

Conversely, perhaps we should add Documentation/technical/glossary.txt
to help developers.


P.S. By the way, when looking at Documentation/glossary-content.txt, I
have noticed few obsolescent entries, like "Git archive", few that have
description that soon could be or is obsolete and would need updating,
like "master" (when default branch switch to "main"), or "object
identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
function).

Best,
--
Jakub Narębski
Junio C Hamano Nov. 6, 2020, 7:33 p.m. UTC | #7
Jakub Narębski <jnareb@gmail.com> writes:

> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea.  It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.

Thanks for a thoughtful suggestion to help the target audience.  I
agree 100% with the above two paragraphs.
Philip Oakley Nov. 8, 2020, 5:23 p.m. UTC | #8
Hi Jakub,

On 06/11/2020 18:26, Jakub Narębski wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Philip Oakley <philipoakley@iee.email> writes:
>>
>>> This may be not part of the the main project, but could you consider, if
>>> time permits, also adding some entries into the Git Glossary (`git help
>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>> its fancy technical name for the use of date heuristics e.g. the
>>> 'chronological ordering';).
>>>
>>> The glossary can provide a reference, once the issues are resolved. The
>>> History Simplification and Commit Ordering section of git-log maybe a
>>> useful guide to some of the terms that would link to the glossary.
>> Ah, I first thought that Documentation/rev-list-options.txt (which
>> is the relevant part of "git log" documentation you mention here)
>> already have references to deep technical terms explained in the
>> glossary and you are suggesting Abhishek to mimic the arrangement by
>> adding new and agreed-upon terms to the glossary and referring to
>> them from the commit-graph documentation updated by this series.
>>
>> But sadly that is not the case.  What you are saying is that you
>> noticed that rev-list-options.txt needs a similar "the terms we use
>> to explain these two sections should be defined and explained in the
>> glossary (if they are not) and new references to glossary should be
>> added there" update.
>>
>> In any case, that is a very good suggestion.  I agree that updating
>> "git log" doc may be outside the scope of Abhishek's theme, but it
>> would be very good to have such an update by anybody ;-)
> The only possible problem I see with this suggestion is that some of
> those terms (like 'topological levels' and 'corrected commit date') are
> technical terms that should be not of concern for Git user, only for
> developers working on Git.  (However one could encounter the term
> "generation number" in `git commit-graph verify` output.)
However we do mention "topolog*"  in a number of the manual pages, and
rather less, as yet, in the technical pages.

"Lexicographic" and "chronological" are in the same group of fancy
technical words ;-)

>
> I don't think adding technical terms that the user won't encounter in
> the documentation or among messages that Git outputs would be not a good
> idea.  It could confuse users, rather than help them.
>
> Conversely, perhaps we should add Documentation/technical/glossary.txt
> to help developers.

I would agree that the Glossary probably ought to be split into the
primary, secondary and background terms so that the core concepts are
separated from the academic/developer style terms.

Git does rip up most of what folks think about version "control",
usually based on the imperfect replication of physical artefacts.
>
> P.S. By the way, when looking at Documentation/glossary-content.txt, I
> have noticed few obsolescent entries, like "Git archive", few that have
> description that soon could be or is obsolete and would need updating,
> like "master" (when default branch switch to "main"), or "object
> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
> function).
The obsolescent items can be updated. I'm expecting that the 'main' and
'SHA-' changes will eventually be picked up as part of the respective
patch series, hopefully as part of the global replacements.

--
Philip
Jakub Narębski Nov. 10, 2020, 1:35 a.m. UTC | #9
Hello Philip,

Philip Oakley <philipoakley@iee.email> writes:
> On 06/11/2020 18:26, Jakub Narębski wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>> Philip Oakley <philipoakley@iee.email> writes:
>>>
>>>> This may be not part of the the main project, but could you consider, if
>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>> its fancy technical name for the use of date heuristics e.g. the
>>>> 'chronological ordering';).
>>>>
>>>> The glossary can provide a reference, once the issues are resolved. The
>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>> useful guide to some of the terms that would link to the glossary.
>>>
>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>> is the relevant part of "git log" documentation you mention here)
>>> already have references to deep technical terms explained in the
>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>> adding new and agreed-upon terms to the glossary and referring to
>>> them from the commit-graph documentation updated by this series.
>>>
>>> But sadly that is not the case.  What you are saying is that you
>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>> to explain these two sections should be defined and explained in the
>>> glossary (if they are not) and new references to glossary should be
>>> added there" update.

What terms you feel need glossary entry?

>>> In any case, that is a very good suggestion.  I agree that updating
>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>> would be very good to have such an update by anybody ;-)
>>
>> The only possible problem I see with this suggestion is that some of
>> those terms (like 'topological levels' and 'corrected commit date') are
>> technical terms that should be not of concern for Git user, only for
>> developers working on Git.  (However one could encounter the term
>> "generation number" in `git commit-graph verify` output.)

To be more precise, I think that user-facing glossary should include
only terms that appear in user-facing documentation and in output
messages of Git commands (with the possible exception of maybe output
messages of some low-level plumbing).

I think that the developer-facing glossary should include terms that
appear in technical documentation, and in commit messages in Git
history.

> However we do mention "topolog*"  in a number of the manual pages, and
> rather less, as yet, in the technical pages.
>
> "Lexicographic" and "chronological" are in the same group of fancy
> technical words ;-)

I think that 'topological level' would appear only in technical
documentation; if it would be the case then there is no reason to add it
to user-facing glossary (to gitglossary manpage).

'Topological order' or 'topological sort', 'lexicographical order' and
'chronological order' are not Git-specific terms, and there are no
Git-specific ambiguities.  I am therefore a bit unsure about adding them
to *Git* glossary.

- In computer science, a _topological sort_ or _topological_ ordering of
  a directed graph is a linear ordering of its vertices such that for
  every directed edge uv from vertex u to vertex v, u comes before v in
  the ordering.

  For Git it means that top to bottom, commits always appear before
  their parents. With `--graph` or `--topo-order` Git also avoids
  showing commits on multiple lines of history intermixed.

- In mathematics, the _lexicographic_ or _lexicographical order_ (also
  known as lexical order, dictionary order, etc.) is a generalization of
  the alphabetical order.

  For Git it is simply alphabetical order.

- _Chronological order_ is the arrangement of things following one after
  another in time; or in other words date order.

  Note that `git log --date-order` commits also always appear before
  their parents, but otherwise commits are shown in the commit timestamp
  order (committer date order)

>>
>> I don't think adding technical terms that the user won't encounter in
>> the documentation or among messages that Git outputs would be not a good
>> idea.  It could confuse users, rather than help them.
>>
>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>> to help developers.
>
> I would agree that the Glossary probably ought to be split into the
> primary, secondary and background terms so that the core concepts are
> separated from the academic/developer style terms.

I don't thing we need three separate layers; in my opinion separating
terms that user of Git might encounter from terms that somebody working
on developing Git may encounter would be enough.

The technical glossary / dictionary could also help onboarding...

>
> Git does rip up most of what folks think about version "control",
> usually based on the imperfect replication of physical artefacts.

I don't quite understand what you wanted to say there.  Could you
explain in more detail, please?

>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>> have noticed few obsolescent entries, like "Git archive", few that have
>> description that soon could be or is obsolete and would need updating,
>> like "master" (when default branch switch to "main"), or "object
>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>> function).
>
> The obsolescent items can be updated. I'm expecting that the 'main' and
> 'SHA-' changes will eventually be picked up as part of the respective
> patch series, hopefully as part of the global replacements.

Here I meant that "Git archive" entry is not important anymore, as I
think there are no active users of GNU arch version control system (no
"arch people"); arch's last release was in 2006, and its replacement,
Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
removed in 2020, after 14 years after last release of arch.

In most cases "SHA-1" in the descriptions of terms in glossary should be
replaced by "object identifier" (to be more generic).  This can be
safely done before switch to NewHash is ready and announced.

Best,
Philip Oakley Nov. 10, 2020, 2:04 p.m. UTC | #10
Hi Jakub,

On 10/11/2020 01:35, Jakub Narębski wrote:
> Hello Philip,
>
> Philip Oakley <philipoakley@iee.email> writes:
>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>> Junio C Hamano <gitster@pobox.com> writes:
>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>
>>>>> This may be not part of the the main project, but could you consider, if
>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>> 'chronological ordering';).
>>>>>
>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>> useful guide to some of the terms that would link to the glossary.
>>>> Ah, I first thought that Documentation/rev-list-options.txt (which
>>>> is the relevant part of "git log" documentation you mention here)
>>>> already have references to deep technical terms explained in the
>>>> glossary and you are suggesting Abhishek to mimic the arrangement by
>>>> adding new and agreed-upon terms to the glossary and referring to
>>>> them from the commit-graph documentation updated by this series.
>>>>
>>>> But sadly that is not the case.  What you are saying is that you
>>>> noticed that rev-list-options.txt needs a similar "the terms we use
>>>> to explain these two sections should be defined and explained in the
>>>> glossary (if they are not) and new references to glossary should be
>>>> added there" update.
> What terms you feel need glossary entry?
While it was Junio that made the comment, I'd agree that we should be
using the glossary to explain, in a general sense, the terms that are
used is a specialist sense. As the user community expands, their natural
understanding of some of the terms diminishes.
>
>>>> In any case, that is a very good suggestion.  I agree that updating
>>>> "git log" doc may be outside the scope of Abhishek's theme, but it
>>>> would be very good to have such an update by anybody ;-)
>>> The only possible problem I see with this suggestion is that some of
>>> those terms (like 'topological levels' and 'corrected commit date') are
>>> technical terms that should be not of concern for Git user, only for
>>> developers working on Git.  (However one could encounter the term
>>> "generation number" in `git commit-graph verify` output.)
> To be more precise, I think that user-facing glossary should include
> only terms that appear in user-facing documentation and in output
> messages of Git commands (with the possible exception of maybe output
> messages of some low-level plumbing).
And where implied, the underlying concepts when they aren't obvious, or
lack general terms (e.g. the 'staging area' discussions)
>
> I think that the developer-facing glossary should include terms that
> appear in technical documentation, and in commit messages in Git
> history.
>
>> However we do mention "topolog*"  in a number of the manual pages, and
>> rather less, as yet, in the technical pages.
>>
>> "Lexicographic" and "chronological" are in the same group of fancy
>> technical words ;-)
> I think that 'topological level' would appear only in technical
> documentation; if it would be the case then there is no reason to add it
> to user-facing glossary (to gitglossary manpage).
>
> 'Topological order' or 'topological sort', 'lexicographical order' and
> 'chronological order' are not Git-specific terms, and there are no
> Git-specific ambiguities.  I am therefore a bit unsure about adding them
> to *Git* glossary.

It is that they aren't terms used in normal speech, so many folks do not
comprehend the implied precision that the docs assume, nor the problems
they may hide.
>
> - In computer science, a _topological sort_ or _topological_ ordering of
>   a directed graph is a linear ordering of its vertices such that for
>   every directed edge uv from vertex u to vertex v, u comes before v in
>   the ordering.
Does this imply that those who aren't computer scientists shouldn't be
using Git?
>
>   For Git it means that top to bottom, commits always appear before
>   their parents. With `--graph` or `--topo-order` Git also avoids
>   showing commits on multiple lines of history intermixed.
>
> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
>   known as lexical order, dictionary order, etc.) is a generalization of
>   the alphabetical order.
>
>   For Git it is simply alphabetical order. 
ASCII order, Case sensitivity, Special characters, etc.
>
> - _Chronological order_ is the arrangement of things following one after
>   another in time; or in other words date order.
Given that most  résumés (the thing most folk see that asks for date
order) is latest first, does this clarify which way chronological is? (I
see this regularly in my other volunteer work).
>
>   Note that `git log --date-order` commits also always appear before
>   their parents, but otherwise commits are shown in the commit timestamp
>   order (committer date order)

>
>>> I don't think adding technical terms that the user won't encounter in
>>> the documentation or among messages that Git outputs would be not a good
>>> idea.  It could confuse users, rather than help them.
>>>
>>> Conversely, perhaps we should add Documentation/technical/glossary.txt
>>> to help developers.
>> I would agree that the Glossary probably ought to be split into the
>> primary, secondary and background terms so that the core concepts are
>> separated from the academic/developer style terms.
> I don't thing we need three separate layers; in my opinion separating
> terms that user of Git might encounter from terms that somebody working
> on developing Git may encounter would be enough.
>
> The technical glossary / dictionary could also help onboarding...
>
>> Git does rip up most of what folks think about version "control",
>> usually based on the imperfect replication of physical artefacts.
> I don't quite understand what you wanted to say there.  Could you
> explain in more detail, please?
Background, I see Git & Version Control from an engineers view point,
rather than developers view.

In the "real" world there are no perfect copies, we serialise key items
so that we can track their degradation, and replace them when required.
We attempt to "Control" what is happening. Our documentation and
monitoring systems have layers of control to ensure only suitably
qualified persons may access and inspect critical items, can record and
access previous status reports, etc. There is only one "Mona Lisa", with
critical access controls, even though there are 'copies'
https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
Almost all of our terminology for configuration control comes from the
'real' world, i.e. pre-modern computing.

Git turns all that on its head. We can make perfect duplicates (they're
not copies, not replicas..). The Object name is immutable. It's either
right or wrong (exempt the SHAttered sha-1 breakage; were moving to
sha-256). Git does *not* provide any access control. It supports the
'software freedoms' by distributing the control to the user. The
repository is a version storage system, and the OIDs allow easy
authentication between folks that they are looking at the same object,
and all its implied descendants.

Git has ripped up classical 'real' world version control. In many areas
we need new or alternative terms, and documents that explain them to
screen writers(*) and the many other non CS-major users of Git (and some
engineers;-)

(*) there's a diff pattern for them, IIRC, or at least one was proposed.
>
>>> P.S. By the way, when looking at Documentation/glossary-content.txt, I
>>> have noticed few obsolescent entries, like "Git archive", few that have
>>> description that soon could be or is obsolete and would need updating,
>>> like "master" (when default branch switch to "main"), or "object
>>> identifier" and "SHA-1" (when Git switches away from SHA-1 as hash
>>> function).
>> The obsolescent items can be updated. I'm expecting that the 'main' and
>> 'SHA-' changes will eventually be picked up as part of the respective
>> patch series, hopefully as part of the global replacements.
> Here I meant that "Git archive" entry is not important anymore, as I
> think there are no active users of GNU arch version control system (no
> "arch people"); arch's last release was in 2006, and its replacement,
> Bazaar (or 'bzr') doesn't use this term. So I think it can be safely
> removed in 2020, after 14 years after last release of arch.
>
> In most cases "SHA-1" in the descriptions of terms in glossary should be
> replaced by "object identifier" (to be more generic).  This can be
> safely done before switch to NewHash is ready and announced.
>
> Best,
--
Philip
Jakub Narębski Nov. 10, 2020, 11:52 p.m. UTC | #11
Hello Philip,

Philip Oakley <philipoakley@iee.email> writes:
> On 10/11/2020 01:35, Jakub Narębski wrote:
>> Philip Oakley <philipoakley@iee.email> writes:
>>> On 06/11/2020 18:26, Jakub Narębski wrote:
>>>> Junio C Hamano <gitster@pobox.com> writes:
>>>>> Philip Oakley <philipoakley@iee.email> writes:
>>>>>
>>>>>> This may be not part of the the main project, but could you consider, if
>>>>>> time permits, also adding some entries into the Git Glossary (`git help
>>>>>> glossary`) for the various terms we are using here and elsewhere, e.g.
>>>>>> 'topological levels', 'generation number', 'corrected commit date' (and
>>>>>> its fancy technical name for the use of date heuristics e.g. the
>>>>>> 'chronological ordering';).
>>>>>>
>>>>>> The glossary can provide a reference, once the issues are resolved. The
>>>>>> History Simplification and Commit Ordering section of git-log maybe a
>>>>>> useful guide to some of the terms that would link to the glossary.
[...]
>> What terms you feel need glossary entry?
>
> While it was Junio that made the comment, I'd agree that we should be
> using the glossary to explain, in a general sense, the terms that are
> used is a specialist sense. As the user community expands, their natural
> understanding of some of the terms diminishes.

I was hoping for a list of terms from the abovementioned sections of
git-log manpage you feel need entry in gitglosary(7).

[...]
>> To be more precise, I think that user-facing glossary should include
>> only terms that appear in user-facing documentation and in output
>> messages of Git commands (with the possible exception of maybe output
>> messages of some low-level plumbing).
>
> And where implied, the underlying concepts when they aren't obvious, or
> lack general terms (e.g. the 'staging area' discussions)

True, 'staging area' should IMVHO be in glossary (replacing or in
addition to older less specific term 'index', previous name for 'staging
area' term).

>> I think that the developer-facing glossary should include terms that
>> appear in technical documentation, and in commit messages in Git
>> history.

Such as 'topological levels', 'commit slab' / 'on the slab', etc.

>>> However we do mention "topolog*"  in a number of the manual pages, and
>>> rather less, as yet, in the technical pages.
>>>
>>> "Lexicographic" and "chronological" are in the same group of fancy
>>> technical words ;-)
>>
>> I think that 'topological level' would appear only in technical
>> documentation; if it would be the case then there is no reason to add it
>> to user-facing glossary (to gitglossary manpage).
>>
>> 'Topological order' or 'topological sort', 'lexicographical order' and
>> 'chronological order' are not Git-specific terms, and there are no
>> Git-specific ambiguities.  I am therefore a bit unsure about adding them
>> to *Git* glossary.
>
> It is that they aren't terms used in normal speech, so many folks do not
> comprehend the implied precision that the docs assume, nor the problems
> they may hide.

Right.

>> - In computer science, a _topological sort_ or _topological_ ordering of
>>   a directed graph is a linear ordering of its vertices such that for
>>   every directed edge uv from vertex u to vertex v, u comes before v in
>>   the ordering.
>
> Does this imply that those who aren't computer scientists shouldn't be
> using Git?

I think that in most cases where we refer to topological order in the
documentation we describe it there.  It might be good idea to add it to
the glossary, especially because Git uses it often in a very specific
sense.

On the other hand, should we define 'topology' or 'graph' as well? Or
'glossary' ;-) ? Those don't have any special meaning in Git, and can be
as well found in the dictionary or Wikipedia.

>>   For Git it means that top to bottom, commits always appear before
>>   their parents. With `--graph` or `--topo-order` Git also avoids
>>   showing commits on multiple lines of history intermixed.
>>
>> - In mathematics, the _lexicographic_ or _lexicographical order_ (also
>>   known as lexical order, dictionary order, etc.) is a generalization of
>>   the alphabetical order.
>>
>>   For Git it is simply alphabetical order. 
>
> ASCII order, Case sensitivity, Special characters, etc.

Actually I don't know. Let me check: the only place this term appears in
the documentation is in git-tag(1) manpage and related documentation.
It simplly uses strcmp(), or strcasecmp() when using `--ignore-case`
option; so by default case sensitive.

It looks like it does not take locale-specific rules.

>> - _Chronological order_ is the arrangement of things following one after
>>   another in time; or in other words date order.
>
> Given that most résumés (the thing most folk see that asks for date
> order) is latest first, does this clarify which way chronological is? (I
> see this regularly in my other volunteer work).

Right, it might be not obvious at first glance that Git outputs most
recent commits first, that is newest commits are on top. Though if you
think about it in more detail, it is the only ordering that makes sense,
especially for projects with a long history; first, it is newest commits
that are most interesting, and second Git always walks the history from
child to parent.

>>   Note that `git log --date-order` commits also always appear before
>>   their parents, but otherwise commits are shown in the commit timestamp
>>   order (committer date order)

[...]
>>> Git does rip up most of what folks think about version "control",
>>> usually based on the imperfect replication of physical artefacts.
>>
>> I don't quite understand what you wanted to say there.  Could you
>> explain in more detail, please?
>
> Background, I see Git & Version Control from an engineers view point,
> rather than developers view.
>
> In the "real" world there are no perfect copies, we serialise key items
> so that we can track their degradation, and replace them when required.
> We attempt to "Control" what is happening. Our documentation and
> monitoring systems have layers of control to ensure only suitably
> qualified persons may access and inspect critical items, can record and
> access previous status reports, etc. There is only one "Mona Lisa", with
> critical access controls, even though there are 'copies'
> https://en.wikipedia.org/wiki/Mona_Lisa#Early_versions_and_copies.
> Almost all of our terminology for configuration control comes from the
> 'real' world, i.e. pre-modern computing.
>
> Git turns all that on its head. We can make perfect duplicates (they're
> not copies, not replicas..). The Object name is immutable. It's either
> right or wrong (exempt the SHAttered sha-1 breakage; were moving to
> sha-256). Git does *not* provide any access control. It supports the
> 'software freedoms' by distributing the control to the user. The
> repository is a version storage system, and the OIDs allow easy
> authentication between folks that they are looking at the same object,
> and all its implied descendants.
>
> Git has ripped up classical 'real' world version control. In many areas
> we need new or alternative terms, and documents that explain them to
> screen writers(*) and the many other non CS-major users of Git (and some
> engineers;-)
>
> (*) there's a diff pattern for them, IIRC, or at least one was proposed.

Right, though for me the concept of 'version control' was by default
always about the digital, usually the source code.

There are different editions of books, changes to non-digital technical
drawings and plans (AFAIK often in the form of physical foil overlays as
subsequent layers, if done well; overdrawing on the same layer if not),
amendment and changes to laws, etc.


Anyway, the question is what level of knowledge can we assume from the
average Git user -- this would affect the spread of terms that should be
considered for the Git glossary.

Best,
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index cedd311024..03948adfce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -154,11 +154,6 @@  static int commit_gen_cmp(const void *va, const void *vb)
 	else if (generation_a > generation_b)
 		return 1;
 
-	/* use date as a heuristic when generations are equal */
-	if (a->date < b->date)
-		return -1;
-	else if (a->date > b->date)
-		return 1;
 	return 0;
 }
 
@@ -1357,10 +1352,14 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 					ctx->commits.nr);
 	for (i = 0; i < ctx->commits.nr; i++) {
 		timestamp_t level = *topo_level_slab_at(ctx->topo_levels, ctx->commits.list[i]);
+		timestamp_t corrected_commit_date = commit_graph_data_at(ctx->commits.list[i])->generation;
 
 		display_progress(ctx->progress, i + 1);
 		if (level != GENERATION_NUMBER_INFINITY &&
-		    level != GENERATION_NUMBER_ZERO)
+		    level != GENERATION_NUMBER_ZERO &&
+		    corrected_commit_date != GENERATION_NUMBER_INFINITY &&
+		    corrected_commit_date != GENERATION_NUMBER_ZERO
+		    )
 			continue;
 
 		commit_list_insert(ctx->commits.list[i], &list);
@@ -1369,17 +1368,25 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 			struct commit_list *parent;
 			int all_parents_computed = 1;
 			uint32_t max_level = 0;
+			timestamp_t max_corrected_commit_date = 0;
 
 			for (parent = current->parents; parent; parent = parent->next) {
 				level = *topo_level_slab_at(ctx->topo_levels, parent->item);
-
+				corrected_commit_date = commit_graph_data_at(parent->item)->generation;
 				if (level == GENERATION_NUMBER_INFINITY ||
-				    level == GENERATION_NUMBER_ZERO) {
+				    level == GENERATION_NUMBER_ZERO ||
+				    corrected_commit_date == GENERATION_NUMBER_INFINITY ||
+				    corrected_commit_date == GENERATION_NUMBER_ZERO
+				    ) {
 					all_parents_computed = 0;
 					commit_list_insert(parent->item, &list);
 					break;
-				} else if (level > max_level) {
-					max_level = level;
+				} else {
+					if (level > max_level)
+						max_level = level;
+
+					if (corrected_commit_date > max_corrected_commit_date)
+						max_corrected_commit_date = corrected_commit_date;
 				}
 			}
 
@@ -1389,6 +1396,10 @@  static void compute_generation_numbers(struct write_commit_graph_context *ctx)
 				if (max_level > GENERATION_NUMBER_V1_MAX - 1)
 					max_level = GENERATION_NUMBER_V1_MAX - 1;
 				*topo_level_slab_at(ctx->topo_levels, current) = max_level + 1;
+
+				if (current->date && current->date > max_corrected_commit_date)
+					max_corrected_commit_date = current->date - 1;
+				commit_graph_data_at(current)->generation = max_corrected_commit_date + 1;
 			}
 		}
 	}
@@ -2485,17 +2496,9 @@  int verify_commit_graph(struct repository *r, struct commit_graph *g, int flags)
 		if (generation_zero == GENERATION_ZERO_EXISTS)
 			continue;
 
-		/*
-		 * If one of our parents has generation GENERATION_NUMBER_V1_MAX, then
-		 * our generation is also GENERATION_NUMBER_V1_MAX. Decrement to avoid
-		 * extra logic in the following condition.
-		 */
-		if (max_generation == GENERATION_NUMBER_V1_MAX)
-			max_generation--;
-
 		generation = commit_graph_generation(graph_commit);
-		if (generation != max_generation + 1)
-			graph_report(_("commit-graph generation for commit %s is %u != %u"),
+		if (generation < max_generation + 1)
+			graph_report(_("commit-graph generation for commit %s is %"PRItime" < %"PRItime),
 				     oid_to_hex(&cur_oid),
 				     generation,
 				     max_generation + 1);