diff mbox series

[v3] commit-graph: ignore duplicates when merging layers

Message ID pull.747.v3.git.1602169479482.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series [v3] commit-graph: ignore duplicates when merging layers | expand

Commit Message

Philippe Blain via GitGitGadget Oct. 8, 2020, 3:04 p.m. UTC
From: Derrick Stolee <dstolee@microsoft.com>

Thomas reported [1] that a "git fetch" command was failing with an error
saying "unexpected duplicate commit id". The root cause is that they had
fetch.writeCommitGraph enabled which generates commit-graph chains, and
this instance was merging two layers that both contained the same commit
ID.

[1] https://lore.kernel.org/git/55f8f00c-a61c-67d4-889e-a9501c596c39@virtuell-zuhause.de/

The initial assumption is that Git would not write a commit ID into a
commit-graph layer if it already exists in a lower commit-graph layer.
Somehow, this specific case did get into that situation, leading to this
error.

While unexpected, this isn't actually invalid (as long as the two layers
agree on the metadata for the commit). When we parse a commit that does
not have a graph_pos in the commit_graph_data_slab, we use binary search
in the commit-graph layers to find the commit and set graph_pos. That
position is never used again in this case. However, when we parse a
commit from the commit-graph file, we load its parents from the
commit-graph and assign graph_pos at that point. If those parents were
already parsed from the commit-graph, then nothing needs to be done.
Otherwise, this graph_pos is a valid position in the commit-graph so we
can parse the parents, when necessary.

Thus, this die() is too aggressive. The easiest thing to do would be to
ignore the duplicates.

If we only ignore the duplicates, then we will produce a commit-graph
that has identical commit IDs listed in adjacent positions. This excess
data will never be removed from the commit-graph, which could cascade
into significantly bloated file sizes.

Thankfully, we can collapse the list to erase the duplicate commit
pointers. This allows us to get the end result we want without extra
memory costs and minimal CPU time.

Since the root cause for producing commit-graph layers with these
duplicate commits is currently unknown, it is difficult to create a test
for this scenario. For now, we must rely on testing the example data
graciously provided in [1]. My local test successfully merged layers,
and 'git commit-graph verify' passed.

Reported-by: Thomas Braun <thomas.braun@virtuell-zuhause.de>
Helped-by: Taylor Blau <me@ttaylorr.com>
Co-authored-by: Jeff King <peff@peff.net>
Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
    commit-graph: ignore duplicates when merging layers
    
    This wasn't quite as simple as what Peff had posted, since we really
    don't want to keep duplicate commits around in the new merged layer.
    
    I still don't have a grasp on how this happened in the first place, but
    will keep looking.
    
    Thanks, -Stolee
    
    APOLOGIES: v2 accidentally only changed the commit message, not the
    patch contents. Please ignore v2 and go straight to v3.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-747%2Fderrickstolee%2Fcommit-graph-dup-commits-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-747/derrickstolee/commit-graph-dup-commits-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/747

Range-diff vs v2:

 1:  85f4e578b8 ! 1:  9e760f07ac commit-graph: ignore duplicates when merging layers
     @@ Commit message
      
       ## commit-graph.c ##
      @@ commit-graph.c: static int commit_compare(const void *_a, const void *_b)
     + 
       static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
       {
     - 	uint32_t i;
     -+	struct packed_commit_list deduped_commits = { NULL, 0, 0 };
     +-	uint32_t i;
     ++	uint32_t i, dedup_i = 0;
       
       	if (ctx->report_progress)
       		ctx->progress = start_delayed_progress(
      @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
     - 					ctx->commits.nr);
     - 
     - 	QSORT(ctx->commits.list, ctx->commits.nr, commit_compare);
     -+	deduped_commits.alloc = ctx->commits.nr;
     -+	ALLOC_ARRAY(deduped_commits.list, deduped_commits.alloc);
     - 
     - 	ctx->num_extra_edges = 0;
     - 	for (i = 0; i < ctx->commits.nr; i++) {
     -@@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
       
       		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
       			  &ctx->commits.list[i]->object.oid)) {
     @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_gra
       		} else {
       			unsigned int num_parents;
       
     -+			deduped_commits.list[deduped_commits.nr] = ctx->commits.list[i];
     -+			deduped_commits.nr++;
     ++			ctx->commits.list[dedup_i] = ctx->commits.list[i];
     ++			dedup_i++;
      +
       			num_parents = commit_list_count(ctx->commits.list[i]->parents);
       			if (num_parents > 2)
     @@ commit-graph.c: static void sort_and_scan_merged_commits(struct write_commit_gra
       		}
       	}
       
     -+	free(ctx->commits.list);
     -+	ctx->commits.list = deduped_commits.list;
     -+	ctx->commits.nr = deduped_commits.nr;
     -+	ctx->commits.alloc = deduped_commits.alloc;
     ++	ctx->commits.nr = dedup_i;
      +
       	stop_progress(&ctx->progress);
       }


 commit-graph.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)


base-commit: d98273ba77e1ab9ec755576bc86c716a97bf59d7

Comments

Jeff King Oct. 8, 2020, 3:53 p.m. UTC | #1
On Thu, Oct 08, 2020 at 03:04:39PM +0000, Derrick Stolee via GitGitGadget wrote:

> Since the root cause for producing commit-graph layers with these
> duplicate commits is currently unknown, it is difficult to create a test
> for this scenario. For now, we must rely on testing the example data
> graciously provided in [1]. My local test successfully merged layers,
> and 'git commit-graph verify' passed.

Yeah, that is unfortunate. We could synthetically create such a graph
file, but I'm not sure if it's worth the trouble.

> @@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>  
>  		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
>  			  &ctx->commits.list[i]->object.oid)) {
> -			die(_("unexpected duplicate commit id %s"),
> -			    oid_to_hex(&ctx->commits.list[i]->object.oid));
> +			/*
> +			 * Silently ignore duplicates. These were likely
> +			 * created due to a commit appearing in multiple
> +			 * layers of the chain, which is unexpected but
> +			 * not invalid. We should make sure there is a
> +			 * unique copy in the new layer.
> +			 */

You mentioned earlier checking tha the metadata for the duplicates was
identical. How hard would that be to do here?

>  		} else {
>  			unsigned int num_parents;
>  
> +			ctx->commits.list[dedup_i] = ctx->commits.list[i];
> +			dedup_i++;
> +

This in-place de-duping is much nicer than what was in v1. There's still
a slight cost to the common case when we have no duplicates, but it's
minor (just an extra noop self-assignment of each index).

-Peff
Derrick Stolee Oct. 8, 2020, 4:26 p.m. UTC | #2
On 10/8/2020 11:53 AM, Jeff King wrote:
> On Thu, Oct 08, 2020 at 03:04:39PM +0000, Derrick Stolee via GitGitGadget wrote:
>> @@ -2023,17 +2023,27 @@ static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
>>  
>>  		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
>>  			  &ctx->commits.list[i]->object.oid)) {
>> -			die(_("unexpected duplicate commit id %s"),
>> -			    oid_to_hex(&ctx->commits.list[i]->object.oid));
>> +			/*
>> +			 * Silently ignore duplicates. These were likely
>> +			 * created due to a commit appearing in multiple
>> +			 * layers of the chain, which is unexpected but
>> +			 * not invalid. We should make sure there is a
>> +			 * unique copy in the new layer.
>> +			 */
> 
> You mentioned earlier checking tha the metadata for the duplicates was
> identical. How hard would that be to do here?

I do think it is a bit tricky, since we would need to identify
from these duplicates which commit-graph layers they live in,
then compare the binary data in each row (for tree, date, generation)
and also logical data (convert parent int-ids into oids). One way
to do this would be to create distinct 'struct commit' objects (do
not use lookup_commit()) but finding the two positions within the
layers is the hard part.

At this point, any disagreement between rows would be corrupt data
in one or the other, and it should be caught by the 'verify'
subcommand. It definitely would be caught by 'verify' in the merged
layer after the 'write' completes.

At this point, we don't have any evidence that whatever causes the
duplicate rows could possibly write the wrong data to the duplicate
rows. I'll keep it in mind as we look for that root cause.

Thanks,
-Stolee
Taylor Blau Oct. 8, 2020, 4:42 p.m. UTC | #3
On Thu, Oct 08, 2020 at 12:26:29PM -0400, Derrick Stolee wrote:
> At this point, any disagreement between rows would be corrupt data
> in one or the other, and it should be caught by the 'verify'
> subcommand. It definitely would be caught by 'verify' in the merged
> layer after the 'write' completes.

Yeah, I'm fine with assuming that this data is correct here, since we
would have already "checked" it after we wrote it.

Of course, that means that if we find another commit-graph bug that
writes bad data and fix it in a future version, old commit-graphs with
duplicate objects have a chance to persist their data.

But, we again have 'git commit-graph verify' as a last resort there, so
I think it's OK.

> At this point, we don't have any evidence that whatever causes the
> duplicate rows could possibly write the wrong data to the duplicate
> rows. I'll keep it in mind as we look for that root cause.

Thanks.

Taylor
Jeff King Oct. 8, 2020, 4:43 p.m. UTC | #4
On Thu, Oct 08, 2020 at 12:26:29PM -0400, Derrick Stolee wrote:

> >> +			/*
> >> +			 * Silently ignore duplicates. These were likely
> >> +			 * created due to a commit appearing in multiple
> >> +			 * layers of the chain, which is unexpected but
> >> +			 * not invalid. We should make sure there is a
> >> +			 * unique copy in the new layer.
> >> +			 */
> > 
> > You mentioned earlier checking tha the metadata for the duplicates was
> > identical. How hard would that be to do here?
> 
> I do think it is a bit tricky, since we would need to identify
> from these duplicates which commit-graph layers they live in,
> then compare the binary data in each row (for tree, date, generation)
> and also logical data (convert parent int-ids into oids). One way
> to do this would be to create distinct 'struct commit' objects (do
> not use lookup_commit()) but finding the two positions within the
> layers is the hard part.

OK, that sounds sufficiently hard that it isn't worth doing. I wondered
if there was easy access since we had the commit_graph handles here. But
I guess it really depends on which chunks are even available.

> At this point, any disagreement between rows would be corrupt data
> in one or the other, and it should be caught by the 'verify'
> subcommand. It definitely would be caught by 'verify' in the merged
> layer after the 'write' completes.
> 
> At this point, we don't have any evidence that whatever causes the
> duplicate rows could possibly write the wrong data to the duplicate
> rows. I'll keep it in mind as we look for that root cause.

That makes sense. I wonder if it is worth tipping the user off that
something funny is going on, and they may want to run "verify". I.e.,
should we be downgrading the die() to a warning(), rather than silently
skipping the duplicate.

I guess it depends on how often we expect this to happen. If the root
cause turns out to be some race that's unusual but may come up from time
to time, then the warning would unnecessarily alarm people, and/or be
annoying. But we don't know yet if that's the case.

-Peff
diff mbox series

Patch

diff --git a/commit-graph.c b/commit-graph.c
index cb042bdba8..0280dcb2ce 100644
--- a/commit-graph.c
+++ b/commit-graph.c
@@ -2008,7 +2008,7 @@  static int commit_compare(const void *_a, const void *_b)
 
 static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 {
-	uint32_t i;
+	uint32_t i, dedup_i = 0;
 
 	if (ctx->report_progress)
 		ctx->progress = start_delayed_progress(
@@ -2023,17 +2023,27 @@  static void sort_and_scan_merged_commits(struct write_commit_graph_context *ctx)
 
 		if (i && oideq(&ctx->commits.list[i - 1]->object.oid,
 			  &ctx->commits.list[i]->object.oid)) {
-			die(_("unexpected duplicate commit id %s"),
-			    oid_to_hex(&ctx->commits.list[i]->object.oid));
+			/*
+			 * Silently ignore duplicates. These were likely
+			 * created due to a commit appearing in multiple
+			 * layers of the chain, which is unexpected but
+			 * not invalid. We should make sure there is a
+			 * unique copy in the new layer.
+			 */
 		} else {
 			unsigned int num_parents;
 
+			ctx->commits.list[dedup_i] = ctx->commits.list[i];
+			dedup_i++;
+
 			num_parents = commit_list_count(ctx->commits.list[i]->parents);
 			if (num_parents > 2)
 				ctx->num_extra_edges += num_parents - 1;
 		}
 	}
 
+	ctx->commits.nr = dedup_i;
+
 	stop_progress(&ctx->progress);
 }