diff mbox series

[v2,11/17] mktree: overwrite duplicate entries

Message ID fb555658057f834d94f232f1d8b380a6304a3671.1718834285.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series mktree: support more flexible usage | expand

Commit Message

Victoria Dye June 19, 2024, 9:57 p.m. UTC
From: Victoria Dye <vdye@github.com>

If multiple tree entries with the same name are provided as input to
'mktree', only write the last one to the tree. Entries are considered
duplicates if they have identical names (*not* considering mode); if a blob
and a tree with the same name are provided, only the last one will be
written to the tree. A tree with duplicate entries is invalid (per 'git
fsck'), so that condition should be avoided wherever possible.

Signed-off-by: Victoria Dye <vdye@github.com>
---
 Documentation/git-mktree.txt |  3 ++-
 builtin/mktree.c             | 45 ++++++++++++++++++++++++++++++++----
 t/t1010-mktree.sh            | 36 +++++++++++++++++++++++++++--
 3 files changed, 77 insertions(+), 7 deletions(-)

Comments

Junio C Hamano June 20, 2024, 10:05 p.m. UTC | #1
"Victoria Dye via GitGitGadget" <gitgitgadget@gmail.com> writes:

> From: Victoria Dye <vdye@github.com>
>
> If multiple tree entries with the same name are provided as input to
> 'mktree', only write the last one to the tree. Entries are considered
> duplicates if they have identical names (*not* considering mode); if a blob
> and a tree with the same name are provided, only the last one will be
> written to the tree. A tree with duplicate entries is invalid (per 'git
> fsck'), so that condition should be avoided wherever possible.

The "should be avoided" in the last sentence can be satisified
either by the callers being extra careful, or the callee ignoring
earlier entries with the same path.  I do not have a strong
objection against allowing looser callers, but if that is what is
going on, perhaps

	By teaching "mktree" to ignore the earlier entries for the
        same path in the input, the callers can be more casual about
        sending duplicate entries in order to avoid creating an
        invalid tree objects.

is a more honest justification for this setp?

> diff --git a/Documentation/git-mktree.txt b/Documentation/git-mktree.txt
> index 5f3a6dfe38e..cf1fd82f754 100644
> --- a/Documentation/git-mktree.txt
> +++ b/Documentation/git-mktree.txt
> @@ -54,7 +54,8 @@ cannot be represented in a tree object. The command will fail without
>  writing the tree if a higher order stage is specified for any entry.
>  
>  The order of the tree entries is normalized by `mktree` so pre-sorting the
> -input by path is not required.
> +input by path is not required. Multiple entries provided with the same path
> +are deduplicated, with only the last one specified added to the tree.

OK.

>  struct tree_entry {
> +	/* Internal */
> +	size_t order;
> +
>  	unsigned mode;
>  	struct object_id oid;
>  	int len;
> @@ -74,15 +77,49 @@ static void append_to_tree(unsigned mode, struct object_id *oid, const char *pat
>  	ent->len = len;
>  	oidcpy(&ent->oid, oid);
>  
> +	ent->order = arr->nr;
>  	tree_entry_array_push(arr, ent);
>  }
>  
> -static int ent_compare(const void *a_, const void *b_)
> +static int ent_compare(const void *a_, const void *b_, void *ctx)
>  {
> +	int cmp;
>  	struct tree_entry *a = *(struct tree_entry **)a_;
>  	struct tree_entry *b = *(struct tree_entry **)b_;
> -	return base_name_compare(a->name, a->len, a->mode,
> -				 b->name, b->len, b->mode);
> +	int ignore_mode = *((int *)ctx);
> +
> +	if (ignore_mode)
> +		cmp = name_compare(a->name, a->len, b->name, b->len);
> +	else
> +		cmp = base_name_compare(a->name, a->len, a->mode,
> +					b->name, b->len, b->mode);
> +	return cmp ? cmp : b->order - a->order;
> +}

Having two similar functions that could go out of sync has bothered
me somewhat.  We could instead do

	int a_mode = ignore_mode ? 0 : a->mode;
	int b_mode = ignore_mode ? 0 : b->mode;
	cmp = base_name_compare(a->name, a->len, a_mode,
				b->name, b->len, b_mode);

but that should be done by rewriting name_compare() in terms of
base_name_compare(), which will help more callers, not just this
one.

> +static void sort_and_dedup_tree_entry_array(struct tree_entry_array *arr)
> +{
> +	size_t count = arr->nr;
> +	struct tree_entry *prev = NULL;
> +
> +	int ignore_mode = 1;
> +	QSORT_S(arr->entries, arr->nr, ent_compare, &ignore_mode);

Swap the decl for ignore_mode and the blank line above it?

If the callback context only needs a single bit, ent_compare() could
just use the NULL-ness of ctx as "do we want to ignore mode?" bit.

> +	arr->nr = 0;
> +	for (size_t i = 0; i < count; i++) {
> +		struct tree_entry *curr = arr->entries[i];
> +		if (prev &&
> +		    !name_compare(prev->name, prev->len,
> +				  curr->name, curr->len)) {
> +			FREE_AND_NULL(curr);
> +		} else {
> +			arr->entries[arr->nr++] = curr;
> +			prev = curr;
> +		}
> +	}

As long as this is done for a single tree (i.e. the paths do not
have any slashes in them), this "sort them all and keep the last
one" is a good strategy.

> +	/* Sort again to order the entries for tree insertion */
> +	ignore_mode = 0;
> +	QSORT_S(arr->entries, arr->nr, ent_compare, &ignore_mode);

OK.  We from time to time find need to do this, and I always regret
that we didn't design the sort order of paths in a tree (and in the
index) like so [*].  But that is almost 20 years too late ;-).

Looking good.


[Footnote]

 * A directory entry $T should have sorted after a non-directory
   entry $T but before any non-directory entry whose path has $T
   as its prefix (e.g. even a blob whose path is $T + "\001" should
   sort after a tree $T).  That way we didn't have to worry about a
   blob at ($T + '-') sorting before a tree at $T but a blob at ($T
   + '0') sorting after that tree.
diff mbox series

Patch

diff --git a/Documentation/git-mktree.txt b/Documentation/git-mktree.txt
index 5f3a6dfe38e..cf1fd82f754 100644
--- a/Documentation/git-mktree.txt
+++ b/Documentation/git-mktree.txt
@@ -54,7 +54,8 @@  cannot be represented in a tree object. The command will fail without
 writing the tree if a higher order stage is specified for any entry.
 
 The order of the tree entries is normalized by `mktree` so pre-sorting the
-input by path is not required.
+input by path is not required. Multiple entries provided with the same path
+are deduplicated, with only the last one specified added to the tree.
 
 GIT
 ---
diff --git a/builtin/mktree.c b/builtin/mktree.c
index 8f0af24b6b1..a91d3a7b028 100644
--- a/builtin/mktree.c
+++ b/builtin/mktree.c
@@ -15,6 +15,9 @@ 
 #include "object-store-ll.h"
 
 struct tree_entry {
+	/* Internal */
+	size_t order;
+
 	unsigned mode;
 	struct object_id oid;
 	int len;
@@ -74,15 +77,49 @@  static void append_to_tree(unsigned mode, struct object_id *oid, const char *pat
 	ent->len = len;
 	oidcpy(&ent->oid, oid);
 
+	ent->order = arr->nr;
 	tree_entry_array_push(arr, ent);
 }
 
-static int ent_compare(const void *a_, const void *b_)
+static int ent_compare(const void *a_, const void *b_, void *ctx)
 {
+	int cmp;
 	struct tree_entry *a = *(struct tree_entry **)a_;
 	struct tree_entry *b = *(struct tree_entry **)b_;
-	return base_name_compare(a->name, a->len, a->mode,
-				 b->name, b->len, b->mode);
+	int ignore_mode = *((int *)ctx);
+
+	if (ignore_mode)
+		cmp = name_compare(a->name, a->len, b->name, b->len);
+	else
+		cmp = base_name_compare(a->name, a->len, a->mode,
+					b->name, b->len, b->mode);
+	return cmp ? cmp : b->order - a->order;
+}
+
+static void sort_and_dedup_tree_entry_array(struct tree_entry_array *arr)
+{
+	size_t count = arr->nr;
+	struct tree_entry *prev = NULL;
+
+	int ignore_mode = 1;
+	QSORT_S(arr->entries, arr->nr, ent_compare, &ignore_mode);
+
+	arr->nr = 0;
+	for (size_t i = 0; i < count; i++) {
+		struct tree_entry *curr = arr->entries[i];
+		if (prev &&
+		    !name_compare(prev->name, prev->len,
+				  curr->name, curr->len)) {
+			FREE_AND_NULL(curr);
+		} else {
+			arr->entries[arr->nr++] = curr;
+			prev = curr;
+		}
+	}
+
+	/* Sort again to order the entries for tree insertion */
+	ignore_mode = 0;
+	QSORT_S(arr->entries, arr->nr, ent_compare, &ignore_mode);
 }
 
 static void write_tree(struct tree_entry_array *arr, struct object_id *oid)
@@ -90,7 +127,7 @@  static void write_tree(struct tree_entry_array *arr, struct object_id *oid)
 	struct strbuf buf;
 	size_t size = 0;
 
-	QSORT(arr->entries, arr->nr, ent_compare);
+	sort_and_dedup_tree_entry_array(arr);
 	for (size_t i = 0; i < arr->nr; i++)
 		size += 32 + arr->entries[i]->len;
 
diff --git a/t/t1010-mktree.sh b/t/t1010-mktree.sh
index 7e750530455..08760141d6f 100755
--- a/t/t1010-mktree.sh
+++ b/t/t1010-mktree.sh
@@ -6,11 +6,16 @@  TEST_PASSES_SANITIZE_LEAK=true
 . ./test-lib.sh
 
 test_expect_success setup '
-	for d in a a- a0
+	for d in folder folder- folder0
 	do
 		mkdir "$d" && echo "$d/one" >"$d/one" &&
 		git add "$d" || return 1
 	done &&
+	for f in before folder.txt later
+	do
+		echo "$f" >"$f" &&
+		git add "$f" || return 1
+	done &&
 	echo zero >one &&
 	git update-index --add --info-only one &&
 	git write-tree --missing-ok >tree.missing &&
@@ -171,7 +176,7 @@  test_expect_success '--literally can create invalid trees' '
 
 test_expect_success 'mktree validates path' '
 	tree_oid="$(cat tree)" &&
-	blob_oid="$(git rev-parse $tree_oid:a/one)" &&
+	blob_oid="$(git rev-parse $tree_oid:folder.txt)" &&
 	head_oid="$(git rev-parse HEAD)" &&
 
 	# Valid: tree with or without trailing slash, blob without trailing slash
@@ -202,4 +207,31 @@  test_expect_success 'mktree validates path' '
 	test_grep "invalid path ${SQ}.git/${SQ}" err
 '
 
+test_expect_success 'mktree with duplicate entries' '
+	tree_oid=$(cat tree) &&
+	folder_oid=$(git rev-parse ${tree_oid}:folder) &&
+	before_oid=$(git rev-parse ${tree_oid}:before) &&
+	head_oid=$(git rev-parse HEAD) &&
+
+	{
+		printf "100755 blob $before_oid\ttest\n" &&
+		printf "040000 tree $folder_oid\ttest-\n" &&
+		printf "160000 commit $head_oid\ttest.txt\n" &&
+		printf "040000 tree $folder_oid\ttest\n" &&
+		printf "100644 blob $before_oid\ttest0\n" &&
+		printf "160000 commit $head_oid\ttest-\n"
+	} >top.dup &&
+	git mktree <top.dup >tree.actual &&
+
+	{
+		printf "160000 commit $head_oid\ttest-\n" &&
+		printf "160000 commit $head_oid\ttest.txt\n" &&
+		printf "040000 tree $folder_oid\ttest\n" &&
+		printf "100644 blob $before_oid\ttest0\n"
+	} >expect &&
+	git ls-tree $(cat tree.actual) >actual &&
+
+	test_cmp expect actual
+'
+
 test_done