diff mbox series

[v2,15/20] merge-ort: step 3 of tree writing -- handling subdirectories as we go

Message ID 20201102204344.342633-16-newren@gmail.com (mailing list archive)
State New, archived
Headers show
Series fundamentals of merge-ort implementation | expand

Commit Message

Elijah Newren Nov. 2, 2020, 8:43 p.m. UTC
Our order for processing of entries means that if we have a tree of
files that looks like
   Makefile
   src/moduleA/foo.c
   src/moduleA/bar.c
   src/moduleB/baz.c
   src/moduleB/umm.c
   tokens.txt

Then we will process paths in the order of the leftmost column below.  I
have added two additional columns that help explain the algorithm that
follows; the 2nd column is there to remind us we have oid & mode info we
are tracking for each of these paths (which differs between the paths
which I'm not representing well here), and the third column annotates
the parent directory of the entry:
   tokens.txt               <version_info>    ""
   src/moduleB/umm.c        <version_info>    src/moduleB
   src/moduleB/baz.c        <version_info>    src/moduleB
   src/moduleB              <version_info>    src
   src/moduleA/foo.c        <version_info>    src/moduleA
   src/moduleA/bar.c        <version_info>    src/moduleA
   src/moduleA              <version_info>    src
   src                      <version_info>    ""
   Makefile                 <version_info>    ""

When the parent directory changes, if it's a subdirectory of the previous
parent directory (e.g. "" -> src/moduleB) then we can just keep appending.
If the parent directory differs from the previous parent directory and is
not a subdirectory, then we should process that directory.

So, for example, when we get to this point:
   tokens.txt               <version_info>    ""
   src/moduleB/umm.c        <version_info>    src/moduleB
   src/moduleB/baz.c        <version_info>    src/moduleB

and note that the next entry (src/moduleB) has a different parent than
the last one that isn't a subdirectory, we should write out a tree for it
   100644 blob <HASH> umm.c
   100644 blob <HASH> baz.c

then pop all the entries under that directory while recording the new
hash for that directory, leaving us with
   tokens.txt               <version_info>        ""
   src/moduleB              <new version_info>    src

This process repeats until at the end we get to
   tokens.txt               <version_info>        ""
   src                      <new version_info>    ""
   Makefile                 <version_info>        ""

and then we can write out the toplevel tree.  Since we potentially have
entries in our string_list corresponding to multiple different toplevel
directories, e.g. a slightly different repository might have:
   whizbang.txt             <version_info>        ""
   tokens.txt               <version_info>        ""
   src/moduleD              <new version_info>    src
   src/moduleC              <new version_info>    src
   src/moduleB              <new version_info>    src
   src/moduleA/foo.c        <version_info>        src/moduleA
   src/moduleA/bar.c        <version_info>        src/moduleA

When src/moduleA is popped off, we need to know that the "last
directory" reverts back to src, and how many entries in our string_list
are associated with that parent directory.  So I use an auxiliary
offsets string_list which would have (parent_directory,offset)
information of the form
   ""             0
   src            2
   src/moduleA    5

Whenever I write out a tree for a subdirectory, I set versions.nr to
the final offset value and then decrement offsets.nr...and then add
an entry to versions with a hash for the new directory.

The idea is relatively simple, there's just a lot of accounting to
implement this.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 merge-ort.c | 113 ++++++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 106 insertions(+), 7 deletions(-)

Comments

Jonathan Tan Nov. 12, 2020, 8:15 p.m. UTC | #1
Firstly, from [1]:

> Thanks for the reviews!  I was hoping to see some comments on patch
> 15, as it's possibly the gnarliest.  It's a relatively straightforward
> algorithm, just lots of bookkeeping.

I was planning to send this out yesterday, but couldn't finish it. :-P
Indeed, a lot of things to think about.

[1] https://lore.kernel.org/git/CABPp-BFgQX6Ash03x7z+RfE3ytbw3x0DzDSBrGddgMr_soODoA@mail.gmail.com/

[snip commit message]

Thanks for the thorough explanation.

> @@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
>  
>  struct directory_versions {
>  	struct string_list versions;
> +	struct string_list offsets;

Looking below (and at the explanation in the commit message), this is a
mapping from full paths to integers casted to the pointer type.

> +	const char *last_directory;
> +	unsigned last_directory_len;

Is this just the last entry in "versions"?

>  static void write_tree(struct object_id *result_oid,
> @@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
>  		/* nothing to record */
>  		return;
>  
> +	/*
> +	 * Note: write_completed_directories() already added
> +	 * entries for directories to dir_metadata->versions,
> +	 * so no need to handle ci->filemask == 0 again.
> +	 */
> +	if (!ci->merged.clean && !ci->filemask)
> +		return;
> +
>  	basename = path + ci->merged.basename_offset;
>  	assert(strchr(basename, '/') == NULL);
>  	string_list_append(&dir_metadata->versions,
>  			   basename)->util = &ci->merged.result;
>  }

Conceptually, I can see how the algorithm below inserts directories, but
I don't understand the significance of "!ci->merged.clean" in the change
above.

> +static void write_completed_directories(struct merge_options *opt,
> +					const char *new_directory_name,
> +					struct directory_versions *info)
> +{
> +	const char *prev_dir;
> +	struct merged_info *dir_info = NULL;
> +	unsigned int offset;
> +	int wrote_a_new_tree = 0;
> +
> +	if (new_directory_name == info->last_directory)
> +		return;

Pointer equality is OK here presumably because of the string interning
of directory names.

I'm starting to think that it might be too difficult to keep track of
where strings are interned. Maybe there should be a data structure
containing all interned strings, and make the path a struct or something
like that (to clearly indicate that the string inside comes from the
interned string data structure).

> +	/*
> +	 * If we are just starting (last_directory is NULL), or last_directory
> +	 * is a prefix of the current directory, then we can just update
> +	 * last_directory and record the offset where we started this directory.
> +	 */
> +	if (info->last_directory == NULL ||
> +	    !strncmp(new_directory_name, info->last_directory,
> +		     info->last_directory_len)) {

Git has starts_with() for prefix checking. (May not be as optimized as
this one, though.)

> +		uintptr_t offset = info->versions.nr;
> +
> +		info->last_directory = new_directory_name;
> +		info->last_directory_len = strlen(info->last_directory);
> +		string_list_append(&info->offsets,
> +				   info->last_directory)->util = (void*)offset;
> +		return;
> +	}

Due to the way this is sorted, there might be a jump of 2 or more
directories. (The commit message also provides such an example - from ""
to "src/moduleB", without going through "src".)

> +	/*
> +	 * At this point, ne (next entry) is within a different directory
> +	 * than the last entry, so we need to create a tree object for all
> +	 * the entries in info->versions that are under info->last_directory.
> +	 */

There's no "ne" below.

> +	dir_info = strmap_get(&opt->priv->paths, info->last_directory);
> +	assert(dir_info);
> +	offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
> +	if (offset == info->versions.nr) {
> +		dir_info->is_null = 1;
> +	} else {
> +		dir_info->result.mode = S_IFDIR;
> +		write_tree(&dir_info->result.oid, &info->versions, offset);
> +		wrote_a_new_tree = 1;
> +	}

I was trying to figure out the cases in which offset would be
info->versions.nr - if such a case existed, and if yes, would it be
incorrect to skip creating such a tree because presumably this offset
exists in info->offsets for a reason. Do you know in which situation
offset would equal info->versions.nr?

> +	/*
> +	 * We've now used several entries from info->versions and one entry
> +	 * from info->offsets, so we get rid of those values.
> +	 */
> +	info->offsets.nr--;
> +	info->versions.nr = offset;

OK.

> +	/*
> +	 * Now we've got an OID for last_directory in dir_info.  We need to
> +	 * add it to info->versions for it to be part of the computation of
> +	 * its parent directories' OID.  But first, we have to find out what
> +	 * its' parent name was and whether that matches the previous
> +	 * info->offsets or we need to set up a new one.
> +	 */
> +	prev_dir = info->offsets.nr == 0 ? NULL :
> +		   info->offsets.items[info->offsets.nr-1].string;
> +	if (new_directory_name != prev_dir) {
> +		uintptr_t c = info->versions.nr;
> +		string_list_append(&info->offsets,
> +				   new_directory_name)->util = (void*)c;
> +	}

Because of the possible jump of 2 or more directories that I mentioned
earlier, there may be gaps in the offsets. So it makes sense that we
sometimes need to insert an intermediate one.

I wonder if the code would be clearer if we had explicit "begin tree"
and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
and LOFS_END_TREE). Here we have "end tree" (because of the way the
entries were sorted) but not "begin tree". If we had "begin tree", we
probably would be able to create the necessary offsets in a loop at that
stage, and the reasoning about the contents of the offsets would not be
so complicated.

If we really only want one side (i.e. you don't want to introduce a
synthetic entry just to mark the end or the beginning), then my personal
experience is that having the "begin" side is easier to understand, as
the state is more natural and easier to reason about. (Unlike here,
where there could be gaps in the offsets and the reader has to
understand that the gaps will be filled just in time.) But that may just
be my own experience.

> +	/*
> +	 * Okay, finally record OID for last_directory in info->versions,
> +	 * and update last_directory.
> +	 */
> +	if (wrote_a_new_tree) {
> +		const char *dir_name = strrchr(info->last_directory, '/');
> +		dir_name = dir_name ? dir_name+1 : info->last_directory;
> +		string_list_append(&info->versions, dir_name)->util = dir_info;
> +	}
> +	info->last_directory = new_directory_name;
> +	info->last_directory_len = strlen(info->last_directory);
> +}

OK - several entries in info->versions were deleted earlier (through
info->versions.nr = offset), and we add one here to represent the tree
containing all those deleted versions.

> @@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
>  		 */
>  		struct conflict_info *ci = entry->util;
>  
> +		write_completed_directories(opt, ci->merged.directory_name,
> +					    &dir_metadata);
>  		if (ci->merged.clean)
>  			record_entry_for_tree(&dir_metadata, path, ci);
>  		else
>  			process_entry(opt, path, ci, &dir_metadata);
>  	}

Trying to make sense of this: we pass in the directory name of the
current entry so that if the last directory is *not* a prefix of the
current directory (so we either went up a directory or went sideways),
then we write a tree (unless offset == info->versions.nr, which as I
stated above, I still don't fully understand - I thought we would always
have to write a tree). So maybe the name of the function should be
"write_completed_directory" (and document it as "write a tree if
???"), since we write at most one.

In this kind of algorithm (where intermediate accumulated results are
being written), there needs to be a last write after the loop that
writes whatever's left in the accumulation buffer. I do see it below
("write_tree"), so that's good.

> -	/*
> -	 * TODO: We can't actually write a tree yet, because dir_metadata just
> -	 * contains all basenames of all files throughout the tree with their
> -	 * mode and hash.  Not only is that a nonsensical tree, it will have
> -	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
> -	 */
> -	die("Not yet implemented; need to process subtrees separately");
> +	if (dir_metadata.offsets.nr != 1 ||
> +	    (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
> +		printf("dir_metadata.offsets.nr = %d (should be 1)\n",
> +		       dir_metadata.offsets.nr);
> +		printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
> +		       (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
> +		fflush(stdout);
> +		BUG("dir_metadata accounting completely off; shouldn't happen");
> +	}

Sanity check, OK.

[snip rest]

In summary, I think that the code would be easier to understand (for
everyone) if there were both BEGIN_TREE and END_TREE entries. And for me
personally, once the offset == info->versions.nr part is clarified
(perhaps there is something obvious that I'm missing).
Elijah Newren Nov. 12, 2020, 10:30 p.m. UTC | #2
On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> Firstly, from [1]:
>
> > Thanks for the reviews!  I was hoping to see some comments on patch
> > 15, as it's possibly the gnarliest.  It's a relatively straightforward
> > algorithm, just lots of bookkeeping.
>
> I was planning to send this out yesterday, but couldn't finish it. :-P
> Indeed, a lot of things to think about.
>
> [1] https://lore.kernel.org/git/CABPp-BFgQX6Ash03x7z+RfE3ytbw3x0DzDSBrGddgMr_soODoA@mail.gmail.com/
>
> [snip commit message]
>
> Thanks for the thorough explanation.
>
> > @@ -353,6 +353,9 @@ static int string_list_df_name_compare(const char *one, const char *two)
> >
> >  struct directory_versions {
> >       struct string_list versions;
> > +     struct string_list offsets;
>
> Looking below (and at the explanation in the commit message), this is a
> mapping from full paths to integers casted to the pointer type.
>
> > +     const char *last_directory;
> > +     unsigned last_directory_len;
>
> Is this just the last entry in "versions"?

No, it's a simple cache of strlen(info->last_directory), so I don't
have to recompute that length multiple times.  Perhaps I should add a
comment to that effect.

> >  static void write_tree(struct object_id *result_oid,
> > @@ -409,12 +412,100 @@ static void record_entry_for_tree(struct directory_versions *dir_metadata,
> >               /* nothing to record */
> >               return;
> >
> > +     /*
> > +      * Note: write_completed_directories() already added
> > +      * entries for directories to dir_metadata->versions,
> > +      * so no need to handle ci->filemask == 0 again.
> > +      */
> > +     if (!ci->merged.clean && !ci->filemask)
> > +             return;
> > +
> >       basename = path + ci->merged.basename_offset;
> >       assert(strchr(basename, '/') == NULL);
> >       string_list_append(&dir_metadata->versions,
> >                          basename)->util = &ci->merged.result;
> >  }
>
> Conceptually, I can see how the algorithm below inserts directories, but
> I don't understand the significance of "!ci->merged.clean" in the change
> above.

Checking ci->filemask is likely an out-of-bounds memory read if
ci->merged.clean is true.  (ci may point to something that was
allocated with the size of a merge_info or a conflict_info.)  Perhaps
I could extend the comment to say that conflicted directories, i.e.
paths that are unclean with ci->filemask == 0 can be skipped because
they were already handled.

> > +static void write_completed_directories(struct merge_options *opt,
> > +                                     const char *new_directory_name,
> > +                                     struct directory_versions *info)
> > +{
> > +     const char *prev_dir;
> > +     struct merged_info *dir_info = NULL;
> > +     unsigned int offset;
> > +     int wrote_a_new_tree = 0;
> > +
> > +     if (new_directory_name == info->last_directory)
> > +             return;
>
> Pointer equality is OK here presumably because of the string interning
> of directory names.

Yes, precisely.

> I'm starting to think that it might be too difficult to keep track of
> where strings are interned. Maybe there should be a data structure
> containing all interned strings, and make the path a struct or something
> like that (to clearly indicate that the string inside comes from the
> interned string data structure).

Good news: Interned strings already are stored as the keys of our
primary data structure -- the strmap known as opt->priv->paths.  All
relative paths from the root of the repository that are relevant to
the merge at all -- both for files and directories -- are interned by
collect_merge_info() inside that "paths" strmap.

I guess I should note that it eventually gets _slightly_ more
complicated.  Due to renames and directory renames, I might need to
remove a path from opt->priv->paths.  In such a case there will be an
auxiliary string_list named "paths_to_free" that stores the interned
strings which are no longer part of opt->priv->paths.

> > +     /*
> > +      * If we are just starting (last_directory is NULL), or last_directory
> > +      * is a prefix of the current directory, then we can just update
> > +      * last_directory and record the offset where we started this directory.
> > +      */
> > +     if (info->last_directory == NULL ||
> > +         !strncmp(new_directory_name, info->last_directory,
> > +                  info->last_directory_len)) {
>
> Git has starts_with() for prefix checking. (May not be as optimized as
> this one, though.)
>
> > +             uintptr_t offset = info->versions.nr;
> > +
> > +             info->last_directory = new_directory_name;
> > +             info->last_directory_len = strlen(info->last_directory);
> > +             string_list_append(&info->offsets,
> > +                                info->last_directory)->util = (void*)offset;
> > +             return;
> > +     }
>
> Due to the way this is sorted, there might be a jump of 2 or more
> directories. (The commit message also provides such an example - from ""
> to "src/moduleB", without going through "src".)
>
> > +     /*
> > +      * At this point, ne (next entry) is within a different directory
> > +      * than the last entry, so we need to create a tree object for all
> > +      * the entries in info->versions that are under info->last_directory.
> > +      */
>
> There's no "ne" below.

Oops, that code has been heavily refactored since that comment.
Something like this would be more up-to-date:
    /*
     * The next entry will be within new_directory_name.  Since at this
     * point we know that new_directory_name is within a different
     * directory than info->last_directory, we have all entries for
     * info->last_directory in info->versions and we need to create a
     * tree object for them.
     */

> > +     dir_info = strmap_get(&opt->priv->paths, info->last_directory);
> > +     assert(dir_info);
> > +     offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
> > +     if (offset == info->versions.nr) {
> > +             dir_info->is_null = 1;
> > +     } else {
> > +             dir_info->result.mode = S_IFDIR;
> > +             write_tree(&dir_info->result.oid, &info->versions, offset);
> > +             wrote_a_new_tree = 1;
> > +     }
>
> I was trying to figure out the cases in which offset would be
> info->versions.nr - if such a case existed, and if yes, would it be
> incorrect to skip creating such a tree because presumably this offset
> exists in info->offsets for a reason. Do you know in which situation
> offset would equal info->versions.nr?

Yes, it is possible that all files within the directory become empty
as a result of merging[1], and in such cases this line of logic will
trigger (note that record_entry_for_tree(), which is what adds more
items to info->versions, returns early if ci->merged.is_null).  We do
not want to write out an empty tree for the directory or record the
tree's hash for its parent directory, we simply want to omit it
entirely.  Omitting it entirely is handled by the line
"dir_info->is_null = 1".

[1] The simplest example is when one side doesn't touch anything
within a directory but the other side deletes the whole directory.
Files can also disappear in a merge for other reasons, such as being
deleted on both sides, or being renamed.  If _all_ files within the
directory are removed by the merge logic, the directory has no
entries.

>
> > +     /*
> > +      * We've now used several entries from info->versions and one entry
> > +      * from info->offsets, so we get rid of those values.
> > +      */
> > +     info->offsets.nr--;
> > +     info->versions.nr = offset;
>
> OK.
>
> > +     /*
> > +      * Now we've got an OID for last_directory in dir_info.  We need to
> > +      * add it to info->versions for it to be part of the computation of
> > +      * its parent directories' OID.  But first, we have to find out what
> > +      * its' parent name was and whether that matches the previous
> > +      * info->offsets or we need to set up a new one.
> > +      */
> > +     prev_dir = info->offsets.nr == 0 ? NULL :
> > +                info->offsets.items[info->offsets.nr-1].string;
> > +     if (new_directory_name != prev_dir) {
> > +             uintptr_t c = info->versions.nr;
> > +             string_list_append(&info->offsets,
> > +                                new_directory_name)->util = (void*)c;
> > +     }
>
> Because of the possible jump of 2 or more directories that I mentioned
> earlier, there may be gaps in the offsets. So it makes sense that we
> sometimes need to insert an intermediate one.
>
> I wonder if the code would be clearer if we had explicit "begin tree"
> and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
> and LOFS_END_TREE). Here we have "end tree" (because of the way the
> entries were sorted) but not "begin tree". If we had "begin tree", we
> probably would be able to create the necessary offsets in a loop at that
> stage, and the reasoning about the contents of the offsets would not be
> so complicated.
>
> If we really only want one side (i.e. you don't want to introduce a
> synthetic entry just to mark the end or the beginning), then my personal
> experience is that having the "begin" side is easier to understand, as
> the state is more natural and easier to reason about. (Unlike here,
> where there could be gaps in the offsets and the reader has to
> understand that the gaps will be filled just in time.) But that may just
> be my own experience.

Interesting, I'll take a look into it.

>
> > +     /*
> > +      * Okay, finally record OID for last_directory in info->versions,
> > +      * and update last_directory.
> > +      */
> > +     if (wrote_a_new_tree) {
> > +             const char *dir_name = strrchr(info->last_directory, '/');
> > +             dir_name = dir_name ? dir_name+1 : info->last_directory;
> > +             string_list_append(&info->versions, dir_name)->util = dir_info;
> > +     }
> > +     info->last_directory = new_directory_name;
> > +     info->last_directory_len = strlen(info->last_directory);
> > +}
>
> OK - several entries in info->versions were deleted earlier (through
> info->versions.nr = offset), and we add one here to represent the tree
> containing all those deleted versions.
>
> > @@ -541,22 +635,27 @@ static void process_entries(struct merge_options *opt,
> >                */
> >               struct conflict_info *ci = entry->util;
> >
> > +             write_completed_directories(opt, ci->merged.directory_name,
> > +                                         &dir_metadata);
> >               if (ci->merged.clean)
> >                       record_entry_for_tree(&dir_metadata, path, ci);
> >               else
> >                       process_entry(opt, path, ci, &dir_metadata);
> >       }
>
> Trying to make sense of this: we pass in the directory name of the
> current entry so that if the last directory is *not* a prefix of the
> current directory (so we either went up a directory or went sideways),
> then we write a tree (unless offset == info->versions.nr, which as I
> stated above, I still don't fully understand - I thought we would always
> have to write a tree). So maybe the name of the function should be
> "write_completed_directory" (and document it as "write a tree if
> ???"), since we write at most one.

Yeah, write_completed_directory() would be better.  And just to
reiterate on the offset == info->versions.nr thing, we do not want to
write a tree if it turns out that the merged result of all files
within the directory is to delete them all.

> In this kind of algorithm (where intermediate accumulated results are
> being written), there needs to be a last write after the loop that
> writes whatever's left in the accumulation buffer. I do see it below
> ("write_tree"), so that's good.
>
> > -     /*
> > -      * TODO: We can't actually write a tree yet, because dir_metadata just
> > -      * contains all basenames of all files throughout the tree with their
> > -      * mode and hash.  Not only is that a nonsensical tree, it will have
> > -      * lots of duplicates for paths such as "Makefile" or ".gitignore".
> > -      */
> > -     die("Not yet implemented; need to process subtrees separately");
> > +     if (dir_metadata.offsets.nr != 1 ||
> > +         (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
> > +             printf("dir_metadata.offsets.nr = %d (should be 1)\n",
> > +                    dir_metadata.offsets.nr);
> > +             printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
> > +                    (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
> > +             fflush(stdout);
> > +             BUG("dir_metadata accounting completely off; shouldn't happen");
> > +     }
>
> Sanity check, OK.
>
> [snip rest]
>
> In summary, I think that the code would be easier to understand (for
> everyone) if there were both BEGIN_TREE and END_TREE entries. And for me
> personally, once the offset == info->versions.nr part is clarified
> (perhaps there is something obvious that I'm missing).

I'm not sure how the BEGIN_TREE/END_TREE entries would look, but I'll
investigate.
Elijah Newren Nov. 24, 2020, 8:19 p.m. UTC | #3
On Thu, Nov 12, 2020 at 2:30 PM Elijah Newren <newren@gmail.com> wrote:
>
> On Thu, Nov 12, 2020 at 12:15 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > +     /*
> > > +      * Now we've got an OID for last_directory in dir_info.  We need to
> > > +      * add it to info->versions for it to be part of the computation of
> > > +      * its parent directories' OID.  But first, we have to find out what
> > > +      * its' parent name was and whether that matches the previous
> > > +      * info->offsets or we need to set up a new one.
> > > +      */
> > > +     prev_dir = info->offsets.nr == 0 ? NULL :
> > > +                info->offsets.items[info->offsets.nr-1].string;
> > > +     if (new_directory_name != prev_dir) {
> > > +             uintptr_t c = info->versions.nr;
> > > +             string_list_append(&info->offsets,
> > > +                                new_directory_name)->util = (void*)c;
> > > +     }
> >
> > Because of the possible jump of 2 or more directories that I mentioned
> > earlier, there may be gaps in the offsets. So it makes sense that we
> > sometimes need to insert an intermediate one.
> >
> > I wonder if the code would be clearer if we had explicit "begin tree"
> > and "end tree" steps just like in list-objects-filter.c (LOFS_BEGIN_TREE
> > and LOFS_END_TREE). Here we have "end tree" (because of the way the
> > entries were sorted) but not "begin tree". If we had "begin tree", we
> > probably would be able to create the necessary offsets in a loop at that
> > stage, and the reasoning about the contents of the offsets would not be
> > so complicated.
> >
> > If we really only want one side (i.e. you don't want to introduce a
> > synthetic entry just to mark the end or the beginning), then my personal
> > experience is that having the "begin" side is easier to understand, as
> > the state is more natural and easier to reason about. (Unlike here,
> > where there could be gaps in the offsets and the reader has to
> > understand that the gaps will be filled just in time.) But that may just
> > be my own experience.
>
> Interesting, I'll take a look into it.
>

So, I've been going through making all the changes you and Derrick
suggested or highlighted...but I don't see how to tackle this one.
Perhaps I'm missing something.

Using your example of LOFS_BEGIN_TREE and LOFS_END_TREE from
list-objects-filter.c, I note that you handle it as part of
traverse_trees(), and thus you have a very natural "I'm going to
process this tree" point and "I'm done processing this tree" point.
There is no equivalent mapping to merge-ort that I can figure out.

merge-ort does use traverse_trees() in collect_merge_info(), and fills
opt->priv->paths with all full pathnames (both files and directories)
found in any of the three trees.  But I cannot process
files/directories at that time; rename detection needs
traverse_trees() to be finished to have all paths so far.  Further,
the list of pathnames from traverse_trees is not necessarily complete;
additional paths could be added by any of
  * Directory/file conflicts (need to move the file to a different location)
  * Directory/submodule conflicts (need to move something to a
different location)
  * Add/add conflicts of files of different types (e.g.
symlink/regular file; can't just content merge them with conflict
markers)
  * Directory rename detection (can move new files or even directories
on one side of history into a new directory on other side)

Thus, after traverse_trees() ends, my rename detection stuff can add
paths (including new directories), then process_entries() can add
paths -- and remove some when the resolution is to delete.  And the
code here in question runs as part of the process_entries() loop.

Now, we'd still be able to create synthetic BEGIN_TREE markers if we
operated in lexicographic ordering, but process_entries() *must*
operate in _reverse_ lexicographic ordering because:
  1) subtrees need to be written out before trees are; hashes of those
subtrees are used in the parent tree
  2) it's the only sane way to handle directory/file conflicts; I need
to know if all entries under the directory resolve to nothing; if not,
the directory is still in the way when it comes time to process the
file.

Granted, I could do some tricky calculations based on the reverse
lexicographic ordering of fullpaths (and their containing directories)
to figure out where trees begin and end -- but that takes us to
exactly what I *did* do.  It was precisely this step that you thought
should be made simpler, but I'm not seeing how to avoid it.

For now, I'll keep the code as-is, but add more comments to both the
data structure and the code.  If I've missed something about how I
could make use of your BEGIN_TREE idea, let me know and I'll look at
it again.
Jonathan Tan Nov. 25, 2020, 2:07 a.m. UTC | #4
> For now, I'll keep the code as-is, but add more comments to both the
> data structure and the code.  If I've missed something about how I
> could make use of your BEGIN_TREE idea, let me know and I'll look at
> it again.

In collect_merge_info_callback(), you call setup_path_info() to add to
opt->priv->paths, then call traverse_trees() (which recursively calls
collect_merge_info_callback()). I was thinking that in addition to doing
that, you could call setup_path_info() a second time, but teach it to
add a synthetic path (maybe have a special bit in struct conflict_info
or something like that) that indicates "this is the end of the tree".
Subsequent code can notice that bit and not do the normal processing,
but instead do end-of-tree processing.

Having said that, maybe it will turn out that your additional comments
in v3 will be clearer, and we wouldn't need the synthetic entry.
Elijah Newren Nov. 26, 2020, 6:13 p.m. UTC | #5
On Tue, Nov 24, 2020 at 6:07 PM Jonathan Tan <jonathantanmy@google.com> wrote:
>
> > For now, I'll keep the code as-is, but add more comments to both the
> > data structure and the code.  If I've missed something about how I
> > could make use of your BEGIN_TREE idea, let me know and I'll look at
> > it again.
>
> In collect_merge_info_callback(), you call setup_path_info() to add to
> opt->priv->paths, then call traverse_trees() (which recursively calls
> collect_merge_info_callback()). I was thinking that in addition to doing
> that, you could call setup_path_info() a second time, but teach it to
> add a synthetic path (maybe have a special bit in struct conflict_info
> or something like that) that indicates "this is the end of the tree".
> Subsequent code can notice that bit and not do the normal processing,
> but instead do end-of-tree processing.

So, I realized that I already had end-of-tree markers -- the
directories themselves.  But due to some other weirdness in how I had
built up the processing, the existence of those markers was both
obscured, and deliberately side-stepped.  So, I did a little
restructuring so we can use these as actual end-of-tree markers more
directly.

> Having said that, maybe it will turn out that your additional comments
> in v3 will be clearer, and we wouldn't need the synthetic entry.

Hopefully it's clearer now, but the entries aren't synthetic.  My big
opt->priv->paths strmap with all full relative paths contained all
files _and_ directories already, and now I just use the directory
markers more directly.  Hopefully the extra comments help too.
Jonathan Tan Nov. 30, 2020, 6:41 p.m. UTC | #6
> On Tue, Nov 24, 2020 at 6:07 PM Jonathan Tan <jonathantanmy@google.com> wrote:
> >
> > > For now, I'll keep the code as-is, but add more comments to both the
> > > data structure and the code.  If I've missed something about how I
> > > could make use of your BEGIN_TREE idea, let me know and I'll look at
> > > it again.
> >
> > In collect_merge_info_callback(), you call setup_path_info() to add to
> > opt->priv->paths, then call traverse_trees() (which recursively calls
> > collect_merge_info_callback()). I was thinking that in addition to doing
> > that, you could call setup_path_info() a second time, but teach it to
> > add a synthetic path (maybe have a special bit in struct conflict_info
> > or something like that) that indicates "this is the end of the tree".
> > Subsequent code can notice that bit and not do the normal processing,
> > but instead do end-of-tree processing.
> 
> So, I realized that I already had end-of-tree markers -- the
> directories themselves.  But due to some other weirdness in how I had
> built up the processing, the existence of those markers was both
> obscured, and deliberately side-stepped.  So, I did a little
> restructuring so we can use these as actual end-of-tree markers more
> directly.

Ah sorry...what I meant was to have both begin-of-tree and end-of-tree
elements in the path list, so one of them is real and the other
synthetic. Right now you have an end-of-tree real path in the list of
paths, yes.

> > Having said that, maybe it will turn out that your additional comments
> > in v3 will be clearer, and we wouldn't need the synthetic entry.
> 
> Hopefully it's clearer now, but the entries aren't synthetic.  My big
> opt->priv->paths strmap with all full relative paths contained all
> files _and_ directories already, and now I just use the directory
> markers more directly.  Hopefully the extra comments help too.

OK - I see that you have a new version [1] and hopefully I'll be able to
take a look soon.

[1] https://lore.kernel.org/git/pull.923.git.git.1606635803.gitgitgadget@gmail.com/
diff mbox series

Patch

diff --git a/merge-ort.c b/merge-ort.c
index c560dd1634..20b7c0d8b0 100644
--- a/merge-ort.c
+++ b/merge-ort.c
@@ -353,6 +353,9 @@  static int string_list_df_name_compare(const char *one, const char *two)
 
 struct directory_versions {
 	struct string_list versions;
+	struct string_list offsets;
+	const char *last_directory;
+	unsigned last_directory_len;
 };
 
 static void write_tree(struct object_id *result_oid,
@@ -409,12 +412,100 @@  static void record_entry_for_tree(struct directory_versions *dir_metadata,
 		/* nothing to record */
 		return;
 
+	/*
+	 * Note: write_completed_directories() already added
+	 * entries for directories to dir_metadata->versions,
+	 * so no need to handle ci->filemask == 0 again.
+	 */
+	if (!ci->merged.clean && !ci->filemask)
+		return;
+
 	basename = path + ci->merged.basename_offset;
 	assert(strchr(basename, '/') == NULL);
 	string_list_append(&dir_metadata->versions,
 			   basename)->util = &ci->merged.result;
 }
 
+static void write_completed_directories(struct merge_options *opt,
+					const char *new_directory_name,
+					struct directory_versions *info)
+{
+	const char *prev_dir;
+	struct merged_info *dir_info = NULL;
+	unsigned int offset;
+	int wrote_a_new_tree = 0;
+
+	if (new_directory_name == info->last_directory)
+		return;
+
+	/*
+	 * If we are just starting (last_directory is NULL), or last_directory
+	 * is a prefix of the current directory, then we can just update
+	 * last_directory and record the offset where we started this directory.
+	 */
+	if (info->last_directory == NULL ||
+	    !strncmp(new_directory_name, info->last_directory,
+		     info->last_directory_len)) {
+		uintptr_t offset = info->versions.nr;
+
+		info->last_directory = new_directory_name;
+		info->last_directory_len = strlen(info->last_directory);
+		string_list_append(&info->offsets,
+				   info->last_directory)->util = (void*)offset;
+		return;
+	}
+
+	/*
+	 * At this point, ne (next entry) is within a different directory
+	 * than the last entry, so we need to create a tree object for all
+	 * the entries in info->versions that are under info->last_directory.
+	 */
+	dir_info = strmap_get(&opt->priv->paths, info->last_directory);
+	assert(dir_info);
+	offset = (uintptr_t)info->offsets.items[info->offsets.nr-1].util;
+	if (offset == info->versions.nr) {
+		dir_info->is_null = 1;
+	} else {
+		dir_info->result.mode = S_IFDIR;
+		write_tree(&dir_info->result.oid, &info->versions, offset);
+		wrote_a_new_tree = 1;
+	}
+
+	/*
+	 * We've now used several entries from info->versions and one entry
+	 * from info->offsets, so we get rid of those values.
+	 */
+	info->offsets.nr--;
+	info->versions.nr = offset;
+
+	/*
+	 * Now we've got an OID for last_directory in dir_info.  We need to
+	 * add it to info->versions for it to be part of the computation of
+	 * its parent directories' OID.  But first, we have to find out what
+	 * its' parent name was and whether that matches the previous
+	 * info->offsets or we need to set up a new one.
+	 */
+	prev_dir = info->offsets.nr == 0 ? NULL :
+		   info->offsets.items[info->offsets.nr-1].string;
+	if (new_directory_name != prev_dir) {
+		uintptr_t c = info->versions.nr;
+		string_list_append(&info->offsets,
+				   new_directory_name)->util = (void*)c;
+	}
+
+	/*
+	 * Okay, finally record OID for last_directory in info->versions,
+	 * and update last_directory.
+	 */
+	if (wrote_a_new_tree) {
+		const char *dir_name = strrchr(info->last_directory, '/');
+		dir_name = dir_name ? dir_name+1 : info->last_directory;
+		string_list_append(&info->versions, dir_name)->util = dir_info;
+	}
+	info->last_directory = new_directory_name;
+	info->last_directory_len = strlen(info->last_directory);
+}
+
 /* Per entry merge function */
 static void process_entry(struct merge_options *opt,
 			  const char *path,
@@ -528,6 +619,9 @@  static void process_entries(struct merge_options *opt,
 
 	/* other setup */
 	string_list_init(&dir_metadata.versions, 0);
+	string_list_init(&dir_metadata.offsets, 0);
+	dir_metadata.last_directory = NULL;
+	dir_metadata.last_directory_len = 0;
 
 	/*
 	 * Iterate over the items in reverse order, so we can handle paths
@@ -541,22 +635,27 @@  static void process_entries(struct merge_options *opt,
 		 */
 		struct conflict_info *ci = entry->util;
 
+		write_completed_directories(opt, ci->merged.directory_name,
+					    &dir_metadata);
 		if (ci->merged.clean)
 			record_entry_for_tree(&dir_metadata, path, ci);
 		else
 			process_entry(opt, path, ci, &dir_metadata);
 	}
 
-	/*
-	 * TODO: We can't actually write a tree yet, because dir_metadata just
-	 * contains all basenames of all files throughout the tree with their
-	 * mode and hash.  Not only is that a nonsensical tree, it will have
-	 * lots of duplicates for paths such as "Makefile" or ".gitignore".
-	 */
-	die("Not yet implemented; need to process subtrees separately");
+	if (dir_metadata.offsets.nr != 1 ||
+	    (uintptr_t)dir_metadata.offsets.items[0].util != 0) {
+		printf("dir_metadata.offsets.nr = %d (should be 1)\n",
+		       dir_metadata.offsets.nr);
+		printf("dir_metadata.offsets.items[0].util = %u (should be 0)\n",
+		       (unsigned)(uintptr_t)dir_metadata.offsets.items[0].util);
+		fflush(stdout);
+		BUG("dir_metadata accounting completely off; shouldn't happen");
+	}
 	write_tree(result_oid, &dir_metadata.versions, 0);
 	string_list_clear(&plist, 0);
 	string_list_clear(&dir_metadata.versions, 0);
+	string_list_clear(&dir_metadata.offsets, 0);
 }
 
 void merge_switch_to_result(struct merge_options *opt,