Message ID | 01364bb020ee2836016ec0e8eafa2261fb7800ab.1641403655.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | RFC: Server side merges (no ref updating, no commit creating, no touching worktree or index) | expand |
On 05/01/2022 17:27, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > Callers of `git merge-tree --real` might want an easy way to determine > which files conflicted. While they could potentially use the --messages > option and parse the resulting messages written to that file, those > messages are not meant to be machine readable. Provide a simpler > mechanism of having the user specify --unmerged-list=$FILENAME, and then s/unmerged-list/conflicted-list/ ATB, Ramsay Jones
On Wed, Jan 5, 2022 at 11:09 AM Ramsay Jones <ramsay@ramsayjones.plus.com> wrote: > > On 05/01/2022 17:27, Elijah Newren via GitGitGadget wrote: > > From: Elijah Newren <newren@gmail.com> > > > > Callers of `git merge-tree --real` might want an easy way to determine > > which files conflicted. While they could potentially use the --messages > > option and parse the resulting messages written to that file, those > > messages are not meant to be machine readable. Provide a simpler > > mechanism of having the user specify --unmerged-list=$FILENAME, and then > > s/unmerged-list/conflicted-list/ Indeed. I had noticed after v1 that I had a mixture of using both "unmerged" and "conflicted" to refer to the same thing, and tried to fix it by always using the latter term. Unfortunately, I missed updating this commit message. Thanks for pointing it out; will fix.
Hi Elijah, On Wed, 5 Jan 2022, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > Callers of `git merge-tree --real` might want an easy way to determine > which files conflicted. While they could potentially use the --messages > option and parse the resulting messages written to that file, those > messages are not meant to be machine readable. Provide a simpler > mechanism of having the user specify --unmerged-list=$FILENAME, and then > write a NUL-separated list of unmerged filenames to the specified file. This patch does what the commit message says, and it looks quite plausible. However, in practice it seems that you need either a tree (if the merge succeeded) or the list of conflicted files (if the merge succeeded). So while it looks relatively clean from the implementation's point of view, the design itself could probably withstand a bit of consideration. As I hinted earlier (to be precise, in https://lore.kernel.org/git/nycvar.QRO.7.76.6.2201071602110.339@tvgsbejvaqbjf.bet/), I had the chance in December to work on the server-side, using `merge-ort` for a bit. In the following, I will talk about this a bit more than about this particular patch, but I think it is highly relevant (not a tangent). One of the things that became clear to me is that we really have an either/or situation here. Either the merge succeeds, and we _need_ that tree, or it fails, and we could not care less about the tree at all. In fact, if the merge fails, we completely ignore the tree, and it would be better if we would not even write out any Git objects in that case at all: even just writing the objects would be quite costly at the server-side scale. So my (somewhat hacky) patches for a proof-of-concept produced _either_ the hash of the tree on `stdout`, _or_ a header saying that there were conflicts followed by a NUL-separated list of file names. Mind you, I did not even get to the point of analyzing things even more deeply. My partner in crime and I only got to comparing the `merge-ort` way to the libgit2-based way, trying to get them to compare as much apples-to-apples as possible [*1*], and we found that even the time to spawn the Git process (~1-3ms, with all overhead counted in) is _quite_ noticeable, at server-side scale. Of course, the `merge-ort` performance was _really_ nice when doing anything remotely complex, then `merge-ort` really blew the libgit2-based merge out of the water. But that's not the common case. The common case are merges that involve very few modified files, a single merge base, and they don't conflict. And those can be processed by the libgit2-based method within a fraction of the time it takes to even only so much as spawn `git` (libgit2-based merges can complete in less than a fifth millisecond, that's at most a fifth of the time it takes to merely run `git merge-tree`). The difference between 0.2-0.5ms for libgit2-based merges on the one hand, and 1-3ms for `merge-ort`-based merges on the other hand, might not seem like much, but you have to multiply it by the times such a merge is performed on the server. Which is a _lot_. Way more often than I thought. In this particular instance, there is a silver-lining: the libgit2-based merge is not actually recursive. It is a three-way merge. Which means that we first have to determine a merge base. In our case, this is done by spawning a Git process anyway, so one of my ideas to move forward is to fold that merge-base logic into `git merge-tree`, too. Anyway, the short short is: whenever we can avoid unnecessary work, we should do so. In the context of this patch, I would say that we should avoid writing out a tree (and avoid printing its hash to `stdout`) if there are merge conflicts. And we should avoid writing (and later reading) a file, if we can get away with avoiding it. At least in the default case, that is. We still might need a flag to produce some more information about those merge conflicts. But even in that case, it would be better to have a list of file names with the three associated stages than to output the hash of a tree that contains conflicts (and tons of files _without_ conflicts). The UI needs to re-generate those conflicts anyway. And remember: a tree can contain millions of files even if there is but a single conflict. It makes more sense for `merge-tree --real` to output a concrete list of files that conflicted, rather than expecting the caller to discern between conflicts and non-conflicts by processing a tree object. Maybe you agree with this rationale and re-design the `--real` mode to try to avoid writing out files in the common case? About the form of the patch itself: I was tempted to go with the nitpicking spirit I see on the Git mailing list these days, especially about the shell script code in the test scripts. But then I realized that I find such nitpicking pretty unhelpful, myself. The code is good as-is, even if I would write it differently. It is clear, and it does exactly what it is supposed to do. Thank you, Dscho Footnote *1*: I did not _quite_ get to the point of comparing the `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to add code to respect `merge.renames = false` so that we could _truly_ compare the `merge-ort` merges to the libgit2 merges (we really will want to verify that the output is identical, before even considering to enable recursive merges on the server side, and then only after studying the time difference), and then had to take off due to the holidays. If you already have that need to be able to turn off rename-detection on your radar, even if only for a transitional period, I would be _so_ delighted.
On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > > Hi Elijah, > > On Wed, 5 Jan 2022, Elijah Newren via GitGitGadget wrote: > > > From: Elijah Newren <newren@gmail.com> > > > > Callers of `git merge-tree --real` might want an easy way to determine > > which files conflicted. While they could potentially use the --messages > > option and parse the resulting messages written to that file, those > > messages are not meant to be machine readable. Provide a simpler > > mechanism of having the user specify --unmerged-list=$FILENAME, and then > > write a NUL-separated list of unmerged filenames to the specified file. > > This patch does what the commit message says, and it looks quite > plausible. However, in practice it seems that you need either a tree (if > the merge succeeded) or the list of conflicted files (if the merge > succeeded). > > So while it looks relatively clean from the implementation's point of > view, the design itself could probably withstand a bit of consideration. > > As I hinted earlier (to be precise, in > https://lore.kernel.org/git/nycvar.QRO.7.76.6.2201071602110.339@tvgsbejvaqbjf.bet/), > I had the chance in December to work on the server-side, using `merge-ort` > for a bit. In the following, I will talk about this a bit more than about > this particular patch, but I think it is highly relevant (not a tangent). > > One of the things that became clear to me is that we really have an > either/or situation here. Either the merge succeeds, and we _need_ that > tree, or it fails, and we could not care less about the tree at all. > > In fact, if the merge fails, we completely ignore the tree, and it would > be better if we would not even write out any Git objects in that case at > all: even just writing the objects would be quite costly at the > server-side scale. > > So my (somewhat hacky) patches for a proof-of-concept produced _either_ > the hash of the tree on `stdout`, _or_ a header saying that there were > conflicts followed by a NUL-separated list of file names. Did you really check that it only produced one of these? If you were using ort, you wrote blob and tree objects to disk, even if you didn't print their hash on stdout. > Mind you, I did not even get to the point of analyzing things even more > deeply. My partner in crime and I only got to comparing the `merge-ort` > way to the libgit2-based way, trying to get them to compare as much > apples-to-apples as possible [*1*], and we found that even the time to > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_ > noticeable, at server-side scale. 1-3ms? I thought it was a good bit more than that. > Of course, the `merge-ort` performance was _really_ nice when doing > anything remotely complex, then `merge-ort` really blew the libgit2-based > merge out of the water. But that's not the common case. The common case > are merges that involve very few modified files, a single merge base, and > they don't conflict. And those can be processed by the libgit2-based > method within a fraction of the time it takes to even only so much as > spawn `git` (libgit2-based merges can complete in less than a fifth > millisecond, that's at most a fifth of the time it takes to merely run > `git merge-tree`). > > The difference between 0.2-0.5ms for libgit2-based merges on the one hand, > and 1-3ms for `merge-ort`-based merges on the other hand, might not seem > like much, but you have to multiply it by the times such a merge is > performed on the server. Which is a _lot_. Way more often than I thought. Ah, I had been wondering a bit about process overhead. Having something that avoids that is definitely helpful. I tried to design ort such that its API is relatively clean and easy-to-use, and I carefully tested and fixed it until it ran memory-leak free. If process execution overhead is such a big problem, perhaps you could use these new functions via libgit.a instead of invoking a git process and get the best of both worlds? Unfortunately, in-process would run into problems with finding merge bases, because the revision walking machinery is definitely not leak-free -- an annoyance I had to deal with while attempting to clean up merge-ort since the code I was using always did the merge-base finding as well as the calls into merge-ort. (I hear Ævar has some not-yet-submitted patches that might help with memory leaks in the revision walking machinery.) > In this particular instance, there is a silver-lining: the libgit2-based > merge is not actually recursive. It is a three-way merge. Which means that > we first have to determine a merge base. In our case, this is done by > spawning a Git process anyway, so one of my ideas to move forward is to fold > that merge-base logic into `git merge-tree`, too. The merge-base logic is already part of merge-tree in my patches. What exactly did you do with merge-tree? Were you using the existing one and feeding it with a merge-base as an input, or did you write your own that was more like Christian's that expected a merge base? > Anyway, the short short is: whenever we can avoid unnecessary work, we > should do so. In the context of this patch, I would say that we should > avoid writing out a tree (and avoid printing its hash to `stdout`) if > there are merge conflicts. We can avoid printing its hash to `stdout`, but it'd take significant work to avoid writing the tree to an object store, and it cannot be done at the merge-tree level, it'd require replumbing some bits of merge-ort (and making some already complex codepaths a bit more complex, but that's a price I'm willing to pay for significant performance wins). Can I first suggest a simpler alternative that may give some performance wins despite keeping the object writes: We could make use of the tmp_objdir API that recently merged (see b3cecf49ea ("tmp-objdir: new API for creating temporary writable databases", 2021-12-06)). We could put that tmp-objdir on /dev/shm or other ramdisk, and if the merge is clean, migrate the contents into the real object store. Perhaps we could even pack those objects first if there are a large number of them, but If it's not clean, we can just discard the tmp-objdir. Also, as a further variant on this alternative... packing these objects before migrating if there are a sufficient number of them. Now, this is rather unlikely to be needed in general by merge-tree, because you only need to write new objects (thus representing files modified on both sides, or whatever leading trees are needed to reference the updated paths). However, it might matter for big enough repos with large enough numbers of changes on both sides. And it'd align nicely with my idea for server-side rebases (where implementing this is on my TODO list), because server-side rebases are much more likely to generate a large number of objects. But if you really want to learn about avoiding object writes... If you really want to only write tree and blob objects when the merge is clean, then as far as I can tell you have two options in regards to the blobs: (1) you'll need to keep all files from three-way content merges simultaneously in memory until you've determined if the result is clean, so that you can then write the merged contents out as blobs at the end. Or (2) doing all the three-way content merges and keeping track of whether the result for each is clean, and if they all turn out to be clean, then redo every single one of those three-way content merges afterwards so that you can actually write out the merged-result to disk that time. I think (2) would cost you a lot more work than you'd save, and I worry that (1) might risk using large amounts of memory in the big repositories if there are lots of changes on both sides. While that may be uncommon, I've seen folks try to merge things with lots of changes on both sides, and you do have the server side to worry about after all. There are similar issues with the fact that trees are written as they are processed as well. Those would also require re-running afterwards to re-generate the trees from the list of relevant-files-and-trees we operate on. However, if you are really curious about trying this out despite the fact that I think you might be causing more work than you're avoiding (or potentially requiring a lot more RAM), look for calls to write_tree() (there are precisely two in merge-ort.c, one for intermediate trees and one for the toplevel tree) and write_object_file() (there are precisely two in merge-ort.c, one within write_tree() for writing tree objects, and one in handle_content_merge() for writing blob objects). > And we should avoid writing (and later reading) > a file, if we can get away with avoiding it. Sure, we can do that. Christian had a suggestion that if the --conflicted-list didn't have an associated filename, then we just print those to stdout. Would that be to your liking? And would you prefer NUL-separated, or ls-tree style escaping of filenames? > At least in the default case, that is. We still might need a flag to > produce some more information about those merge conflicts. But even in > that case, it would be better to have a list of file names with the three > associated stages than to output the hash of a tree that contains > conflicts (and tons of files _without_ conflicts). The UI needs to > re-generate those conflicts anyway. And remember: a tree can contain > millions of files even if there is but a single conflict. It makes more > sense for `merge-tree --real` to output a concrete list of files that > conflicted, rather than expecting the caller to discern between conflicts > and non-conflicts by processing a tree object. ??? Where did I suggest that we discern between conflicts and non-conflicts by processing a tree object? That's not merely something with atrocious performance, it's also utterly crazy from a UI perspective, and is downright *impossible* to achieve in general anyway. (Failure to merge binary files. modify/delete conflicts. mode conflicts. file/directory conflicts. Various rename permutatations. There's all kinds of non-content conflicts that are not representable in a tree, and which folks will miss if they attempt to parse a tree and surmise what conflicts there were.) Whatever I wrote that might have suggested such a course of action needs some serious rewording or clarification; I consider attempting that to be a horrible idea. The tree exists so that if people want to get extra information (and presumably in a format similar to what they would find in their working directory if they had asked `git merge` to merge those same two branches on their laptop), then they can do so. For example, perhaps in the list of --conflicted-files they notice a file of interest and want to ask, "What's found in this particular file?". They can use the tree together with the filename to get that kind of info. > Maybe you agree with this rationale and re-design the `--real` mode to try > to avoid writing out files in the common case? I'm totally open to something like having --conflicted-list be given without a filename (or maybe a '-') and write to stdout instead. I'm amenable to various tradeoffs to improve performance, but I'm worried that "not-writing-tree-and-blob object files" sounds like a goal that might hurt performance rather than help it (either worse performance in general due to the need to do things twice when the merge is clean, or else a peak memory usage that is unmanageable and causes problems). Perhaps there's a smarter solution I'm not seeing, or my worries about maximum memory usage are overblown. However, if you still want to pursue such core changes, I would also like to see more performance measuring work to justify them -- especially since in the common case merge-ort won't recurse into most directories and will end up only writing a few object files (since only new blob or tree objects need to be written). In particular, I think we'd need to do the following first: * Do a real comparison; libgit2 + the separate find-merge-base git process, vs. the single `git merge-tree --real` call that handles both. Excluding the merge-base computation makes it apples to oranges in my opinion. * While measuring the performance of the above, specifically use trace2 to measure time spent in write_loose_object() in object-file.c. See how much time object writing actually takes vs. everything else. * Try the tmp-objdir changes I suggested (examples of using the API can be found by searching for "tmp_objdir" at https://lore.kernel.org/git/d57ae218cf9eaee0b66db299ee1bba9b488b69b1.1640907369.git.gitgitgadget@gmail.com/ and https://lore.kernel.org/git/e1747ce00af7ab3170a69955b07d995d5321d6f3.1637020263.git.gitgitgadget@gmail.com/), and then re-measure the performance particular in respect to write_loose_object() as a percentage of overall time. * Get some kind of measure of the maximum number of three-way content merges needed in merges and the overall size of holding all the results in memory simultaneously. If repositories have a million files and there are merges involving a high enough percentage of files changes on both sides, then that could certainly theoretically be quite large. I also think that doing something in-process (via linking against libgit.a?) rather than forking `git merge-tree --real` would probably net you _much_ bigger wins than avoiding writing these object files, though it'd be unsafe without memory leak fixes for all the merge-base computations. That, of course, might not be the only reason you're leery of such an approach. > About the form of the patch itself: I was tempted to go with the > nitpicking spirit I see on the Git mailing list these days, especially > about the shell script code in the test scripts. But then I realized that > I find such nitpicking pretty unhelpful, myself. The code is good as-is, > even if I would write it differently. It is clear, and it does exactly > what it is supposed to do. > > Thank you, > Dscho > > Footnote *1*: I did not _quite_ get to the point of comparing the > `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to > add code to respect `merge.renames = false` so that we could _truly_ > compare the `merge-ort` merges to the libgit2 merges (we really will want > to verify that the output is identical, before even considering to enable > recursive merges on the server side, and then only after studying the time > difference), and then had to take off due to the holidays. If you already > have that need to be able to turn off rename-detection on your radar, even > if only for a transitional period, I would be _so_ delighted. Well, I had no intention of submitting it (and still don't), but I did implement it a while back for folks who have needs for it in a transitional period. It's a pretty simple change. https://github.com/newren/git/commit/5330a9de77f56f20e546acc65c924fc783f092e6 https://github.com/newren/git/commit/2e7e8d79b0995d352558608b6308060fbc055fd1 :-)
Hi Dscho, One more thing I forgot to ask... On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > ... > Mind you, I did not even get to the point of analyzing things even more > deeply. My partner in crime and I only got to comparing the `merge-ort` > way to the libgit2-based way, trying to get them to compare as much > apples-to-apples as possible [*1*], and we found that even the time to > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_ > noticeable, at server-side scale. > > Of course, the `merge-ort` performance was _really_ nice when doing > anything remotely complex, then `merge-ort` really blew the libgit2-based > merge out of the water. But that's not the common case. The common case > are merges that involve very few modified files, a single merge base, and > they don't conflict. And those can be processed by the libgit2-based > method within a fraction of the time it takes to even only so much as > spawn `git` (libgit2-based merges can complete in less than a fifth > millisecond, that's at most a fifth of the time it takes to merely run > `git merge-tree`). Out of curiosity, are you only doing merges, or are you also attempting server-side rebases in some fashion?
Hi Elijah, I meant to answer this mail much earlier, oh well. Sorry. I will only answer the still open questions below, clipping the quoted text to save every reader some time. On Fri, 7 Jan 2022, Elijah Newren wrote: > On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > So my (somewhat hacky) patches for a proof-of-concept produced > > _either_ the hash of the tree on `stdout`, _or_ a header saying that > > there were conflicts followed by a NUL-separated list of file names. > > Did you really check that it only produced one of these? If you were > using ort, you wrote blob and tree objects to disk, even if you didn't > print their hash on stdout. D'oh. No, I had not checked that. > > Mind you, I did not even get to the point of analyzing things even more > > deeply. My partner in crime and I only got to comparing the `merge-ort` > > way to the libgit2-based way, trying to get them to compare as much > > apples-to-apples as possible [*1*], and we found that even the time to > > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_ > > noticeable, at server-side scale. > > 1-3ms? I thought it was a good bit more than that. The absolute number really depends on a lot of factors, i.e. what is relevant is the relative difference between in-process vs spawning a new process. > If process execution overhead is such a big problem, perhaps you could > use these new functions via libgit.a instead of invoking a git process > and get the best of both worlds? For now, I solved this by combining multiple Git process invocations into a single one, as described here: > > In this particular instance, there is a silver-lining: the > > libgit2-based merge is not actually recursive. It is a three-way > > merge. Which means that we first have to determine a merge base. In > > our case, this is done by spawning a Git process anyway, so one of my > > ideas to move forward is to fold that merge-base logic into `git > > merge-tree`, too. > > The merge-base logic is already part of merge-tree in my patches. > What exactly did you do with merge-tree? Were you using the existing > one and feeding it with a merge-base as an input, or did you write > your own that was more like Christian's that expected a merge base? For an apples-to-apples comparison, I had to stay with the very same merge base logic as before, i.e. originally I added a new option to `merge-tree` that would take a merge base and then take a non-recursive path. In my current version, it is merely an option that even determines said merge base in the same process (and yes, it leaks memory, but that does not matter because the process is short-lived anyway). > > Anyway, the short short is: whenever we can avoid unnecessary work, we > > should do so. In the context of this patch, I would say that we should > > avoid writing out a tree (and avoid printing its hash to `stdout`) if > > there are merge conflicts. > > We can avoid printing its hash to `stdout`, but it'd take significant > work to avoid writing the tree to an object store, and it cannot be > done at the merge-tree level, it'd require replumbing some bits of > merge-ort (and making some already complex codepaths a bit more > complex, but that's a price I'm willing to pay for significant > performance wins). Yes, let's leave things as-are. It is not worth the trouble. > We could make use of the tmp_objdir API that recently merged (see > b3cecf49ea ("tmp-objdir: new API for creating temporary writable > databases", 2021-12-06)). We could put that tmp-objdir on /dev/shm or > other ramdisk, and if the merge is clean, migrate the contents into > the real object store. Perhaps we could even pack those objects first > if there are a large number of them, but If it's not clean, we can > just discard the tmp-objdir. Also, as a further variant on this > alternative... packing these objects before migrating if there are a > sufficient number of them. Now, this is rather unlikely to be needed > in general by merge-tree, because you only need to write new objects > (thus representing files modified on both sides, or whatever leading > trees are needed to reference the updated paths). However, it might > matter for big enough repos with large enough numbers of changes on > both sides. And it'd align nicely with my idea for server-side > rebases (where implementing this is on my TODO list), because > server-side rebases are much more likely to generate a large number of > objects. > > But if you really want to learn about avoiding object writes... > > If you really want to only write tree and blob objects when the merge > is clean, then as far as I can tell you have two options in regards to > the blobs: (1) you'll need to keep all files from three-way content > merges simultaneously in memory until you've determined if the result > is clean, so that you can then write the merged contents out as blobs > at the end. Or (2) doing all the three-way content merges and keeping > track of whether the result for each is clean, and if they all turn > out to be clean, then redo every single one of those three-way content > merges afterwards so that you can actually write out the merged-result > to disk that time. > > I think (2) would cost you a lot more work than you'd save, and I > worry that (1) might risk using large amounts of memory in the big > repositories if there are lots of changes on both sides. While that > may be uncommon, I've seen folks try to merge things with lots of > changes on both sides, and you do have the server side to worry about > after all. > > There are similar issues with the fact that trees are written as they > are processed as well. Those would also require re-running afterwards > to re-generate the trees from the list of relevant-files-and-trees we > operate on. > > However, if you are really curious about trying this out despite the > fact that I think you might be causing more work than you're avoiding > (or potentially requiring a lot more RAM), look for calls to > write_tree() (there are precisely two in merge-ort.c, one for > intermediate trees and one for the toplevel tree) and > write_object_file() (there are precisely two in merge-ort.c, one > within write_tree() for writing tree objects, and one in > handle_content_merge() for writing blob objects). Thank you for this thorough analysis. I am a big fan of crossing bridges when they are reached, and not miles before that. So _iff_ it turns out that the speed, or the potential cluttering with objects, should present a problem in the future, I am inclined to follow the tmp-objdir route you described above (thank you for pointing it out, I had not made the connection to this here scenario). > > Footnote *1*: I did not _quite_ get to the point of comparing the > > `merge-ort` merges to the libgit2 ones, unfortunately. I was on my way to > > add code to respect `merge.renames = false` so that we could _truly_ > > compare the `merge-ort` merges to the libgit2 merges (we really will want > > to verify that the output is identical, before even considering to enable > > recursive merges on the server side, and then only after studying the time > > difference), and then had to take off due to the holidays. If you already > > have that need to be able to turn off rename-detection on your radar, even > > if only for a transitional period, I would be _so_ delighted. > > Well, I had no intention of submitting it (and still don't), but I did > implement it a while back for folks who have needs for it in a > transitional period. It's a pretty simple change. > > https://github.com/newren/git/commit/5330a9de77f56f20e546acc65c924fc783f092e6 > https://github.com/newren/git/commit/2e7e8d79b0995d352558608b6308060fbc055fd1 Thank you _so_ much for that. It was super helpful, and it allowed me to gather enough evidence to justify continuing to work on this code. Ciao, Dscho
Hi Elijah, On Fri, 7 Jan 2022, Elijah Newren wrote: > On Fri, Jan 7, 2022 at 11:36 AM Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > ... > > Mind you, I did not even get to the point of analyzing things even more > > deeply. My partner in crime and I only got to comparing the `merge-ort` > > way to the libgit2-based way, trying to get them to compare as much > > apples-to-apples as possible [*1*], and we found that even the time to > > spawn the Git process (~1-3ms, with all overhead counted in) is _quite_ > > noticeable, at server-side scale. > > > > Of course, the `merge-ort` performance was _really_ nice when doing > > anything remotely complex, then `merge-ort` really blew the libgit2-based > > merge out of the water. But that's not the common case. The common case > > are merges that involve very few modified files, a single merge base, and > > they don't conflict. And those can be processed by the libgit2-based > > method within a fraction of the time it takes to even only so much as > > spawn `git` (libgit2-based merges can complete in less than a fifth > > millisecond, that's at most a fifth of the time it takes to merely run > > `git merge-tree`). > > Out of curiosity, are you only doing merges, or are you also > attempting server-side rebases in some fashion? One step after another. For now, I am focusing on merges. But yes, rebases are on my radar, too, and I am very grateful for the head-start you provided in `t/helper/test-fast-rebase.c` (and for the pun therein). Ciao, Dscho
diff --git a/Documentation/git-merge-tree.txt b/Documentation/git-merge-tree.txt index 4d5857b390b..542cea1a1a8 100644 --- a/Documentation/git-merge-tree.txt +++ b/Documentation/git-merge-tree.txt @@ -9,7 +9,7 @@ git-merge-tree - Perform merge without touching index or working tree SYNOPSIS -------- [verse] -'git merge-tree' --real [--messages=<file>] <branch1> <branch2> +'git merge-tree' --real [--messages=<file>] [--conflicted-list=<file>] <branch1> <branch2> 'git merge-tree' <base-tree> <branch1> <branch2> DESCRIPTION @@ -23,7 +23,9 @@ will be `0`, and if the merge has conflicts, the exit status will be `1`. The output will consist solely of the resulting toplevel tree (which may have files including conflict markers). With `--messages`, it will write any informational messages (such as "Auto-merging -<path>" and conflict notices) to the given file. +<path>" and conflict notices) to the given file. With +`--conflicted-list`, it will write a list of unmerged files, one per +line, to the given file. The second form is meant for backward compatibility and will only do a trival merge. It reads three tree-ish, and outputs trivial merge diff --git a/builtin/merge-tree.c b/builtin/merge-tree.c index 46b746b6b7c..4ae34da98b1 100644 --- a/builtin/merge-tree.c +++ b/builtin/merge-tree.c @@ -391,6 +391,7 @@ static int trivial_merge(const char *base, struct merge_tree_options { int real; char *messages_file; + char *conflicted_file; }; static int real_merge(struct merge_tree_options *o, @@ -450,6 +451,19 @@ static int real_merge(struct merge_tree_options *o, merge_display_update_messages(&opt, &result, fp); fclose(fp); } + if (o->conflicted_file) { + struct string_list conflicted_files = STRING_LIST_INIT_NODUP; + FILE *fp = xfopen(o->conflicted_file, "w"); + int i; + + merge_get_conflicted_files(&result, &conflicted_files); + for (i = 0; i < conflicted_files.nr; i++) { + fprintf(fp, "%s", conflicted_files.items[i].string); + fputc('\0', fp); + } + fclose(fp); + string_list_clear(&conflicted_files, 0); + } printf("%s\n", oid_to_hex(&result.tree->object.oid)); merge_finalize(&opt, &result); @@ -472,6 +486,8 @@ int cmd_merge_tree(int argc, const char **argv, const char *prefix) N_("do a real merge instead of a trivial merge")), OPT_STRING(0, "messages", &o.messages_file, N_("file"), N_("filename to write informational/conflict messages to")), + OPT_STRING(0, "conflicted-list", &o.conflicted_file, N_("file"), + N_("filename to write list of unmerged files")), OPT_END() }; diff --git a/merge-ort.c b/merge-ort.c index 86eebf39166..3d6dd1b234c 100644 --- a/merge-ort.c +++ b/merge-ort.c @@ -4234,6 +4234,19 @@ void merge_display_update_messages(struct merge_options *opt, trace2_region_leave("merge", "display messages", opt->repo); } +void merge_get_conflicted_files(struct merge_result *result, + struct string_list *conflicted_files) +{ + struct hashmap_iter iter; + struct strmap_entry *e; + struct merge_options_internal *opti = result->priv; + + strmap_for_each_entry(&opti->conflicted, &iter, e) { + string_list_append(conflicted_files, e->key); + } + string_list_sort(conflicted_files); +} + void merge_switch_to_result(struct merge_options *opt, struct tree *head, struct merge_result *result, diff --git a/merge-ort.h b/merge-ort.h index 55819a57da8..165cef6616f 100644 --- a/merge-ort.h +++ b/merge-ort.h @@ -79,6 +79,9 @@ void merge_display_update_messages(struct merge_options *opt, struct merge_result *result, FILE *stream); +void merge_get_conflicted_files(struct merge_result *result, + struct string_list *conflicted_files); + /* Do needed cleanup when not calling merge_switch_to_result() */ void merge_finalize(struct merge_options *opt, struct merge_result *result); diff --git a/t/t4301-merge-tree-real.sh b/t/t4301-merge-tree-real.sh index 5f3f27f504d..ec7bd8efd06 100755 --- a/t/t4301-merge-tree-real.sh +++ b/t/t4301-merge-tree-real.sh @@ -96,4 +96,13 @@ test_expect_success '--messages gives us the conflict notices and such' ' test_cmp expect MSG_FILE ' +test_expect_success '--messages gives us the conflict notices and such' ' + test_must_fail git merge-tree --real --conflicted-list=UNMERGED side1 side2 && + + cat UNMERGED | tr "\0" "\n" >actual && + test_write_lines greeting whatever~side1 >expect && + + test_cmp expect actual +' + test_done