Message ID | 67d3b2b09f9ddda616cdd0d1b12ab7afc73670ed.1630376800.git.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Add a new --remerge-diff capability to show & log | expand |
On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote: I commented just now on how this API is duplicated between here & another in-flight series in https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just comments on this patch here: > diff --git a/tmp-objdir.c b/tmp-objdir.c > index b8d880e3626..9f75a75d1c0 100644 > --- a/tmp-objdir.c > +++ b/tmp-objdir.c > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t) > return t->env.v; > } > > +const char *tmp_objdir_path(struct tmp_objdir *t) > +{ > + return t->path.buf; > +} Not your fault & this pre-dates your patch, but FWIW I prefer our APIs that don't have these "hidden struct" shenanigans (like say "struct string_list") so a caller could just access this, we can just declare it "const" appropriately. We're also all in-tree friends here, so having various accessors for no reason other than to access struct members seems a bit too much. All of which is to say if you + Neeraj come up with some common API I for one wouldn't mind moving the struct deceleration to tmp-objdir.h, but it looks like in his version it can stay "private" more easily. In this particular case I hadn't understood on a previous reading of tmp-objdir.c why this "path" isn't a statically allocated "char path[PATH_MAX]", or a least we that hardcoded "1024" should be a PATH_MAX, as it is in some other cases. But I digress :)
On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote: > From: Elijah Newren <newren@gmail.com> > > The tmp_objdir API provides the ability to create temporary object > directories, but was designed with the goal of having subprocesses > access these object stores, followed by the main process migrating > objects from it to the main object store or just deleting it. The > subprocesses would view it as their primary datastore and write to it. > > For the --remerge-diff option we want to add to show & log, we want all > writes of intermediate merge results (both blobs and trees) to go to > this alternate object store; since those writes will be done by the main > process, we need this "alternate" object store to actually be the > primary object store. When show & log are done, they'll simply remove > this temporary object store. I think this is consistent with the original design of tmp_objdir. I just never needed to do anything in-process before, so overriding the environment of sub-processes was sufficient. I do think these are dangerous and may cause bugs, though. Anything you write while the tmp_objdir is marked as "primary" is going to go away when you remove it. So any object names you reference outside of that, either: - by another object that you create after the tmp_objdir is removed; or - in a ref are going to turn into repository corruption. Of course that's true for the existing sub-processes, too, but here we're touching a wider variety of code. Obviously the objects we write as part of remerge-diff are meant to be temporary, and we'll never reference them in any other way. And usually we would not expect a diff to write any other objects. But one thing that comes to mind if textconv caching. If you do a remerge diff on a blob that uses textconv, and the caching feature is turned on, then we'll write out a new blob with the cached value, and eventually a new tree and refs/notes/ pointer referencing it. I'm not sure of the timing of all of that, but it seems likely to me that at least some of that will end up in your tmp_objdir. If you remove the tmp_objdir as the primary as soon as you're done with the merge, but before you run the diff, you might be OK, though. If not, then I think the solution is probably not to install this as the "primary", but rather: - do the specific remerge-diff writes we want using a special "write to this object dir" variant of write_object_file() - install the tmp_objdir as an alternate, so the reading side (which is diff code that doesn't otherwise know about our remerge-diff trickery) can find them And that lets other writers avoid writing into the temporary area accidentally. In that sense this is kind of like the pretend_object_file() interface, except that it's storing the objects on disk instead of in memory. Of course another way of doing that would be to stuff the object data into tempfiles and just put pointers into the in-memory cached_objects array. It's also not entirely foolproof (nor is the existing pretend_object_file()). Any operation which is fooled into thinking we have object X because it's in the fake-object list or the tmp_objdir may reference that object erroneously, creating a corruption. But it's still safer than allowing arbitrary writes into the tmp_objdir. Side note: The pretend_object_file() approach is actually even better, because we know the object is fake. So it does not confuse write_object_file()'s "do we already have this object" freshening check. I suspect it could even be made faster than the tmp_objdir approach. From our perspective, these objects really are tempfiles. So we could write them as such, not worrying about things like fsyncing them, naming them into place, etc. We could just write them out, then mmap the results, and put the pointers into cached_objects (currently it insists on malloc-ing a copy of the input buffer, but that seems like an easy extension to add). In fact, I think you could get away with just _one_ tempfile per merge. Open up one tempfile. Write out all of the objects you want to "store" into it in sequence, and record the lseek() offsets before and after for each object. Then mmap the whole result, and stuff the appropriate pointers (based on arithmetic with the offsets) into the cached_objects list. As a bonus, this tempfile can easily be in $TMPDIR, meaning remerge-diff could work on a read-only repository (and the OS can perhaps handle the data better, especially if $TMPDIR isn't a regular filesystem). > We also need one more thing -- `git log --remerge-diff` can cause the > temporary object store to fill up with loose objects. Rather than > trying to gc that are known garbage anyway, we simply want to know the > location of the temporary object store so we can purge the loose objects > after each merge. This paragraph confused me. At first I thought you were talking about how to avoid calling "gc --auto" because we don't want to waste time thinking all those loose objects were worth gc-ing. But we wouldn't do that anyway (git-log does not expect to make objects so doesn't call it at all, and anyway you'd expect it to happen at the end of a process, after we've already removed the tmp_objdir). But I think you just mean: we can build up a bunch of loose objects, which is inefficient. We don't want to gc them, because that's even less efficient. So we want to clean them out between merges. But I don't see any code to do that here. I guess that's maybe why you've added tmp_objdir_path(), to find them later. But IMHO this would be much better encapsulated as tmp_objdir_clear_objects() or something. But simpler still: is there any reason not to just drop and re-create the tmp-objdir for each merge? It's not very expensive to do so (and certainly not compared to the cost of actually writing out the objects). -Peff
Jeff King <peff@peff.net> writes: > Side note: The pretend_object_file() approach is actually even better, > because we know the object is fake. So it does not confuse > write_object_file()'s "do we already have this object" freshening > check. > > I suspect it could even be made faster than the tmp_objdir approach. > From our perspective, these objects really are tempfiles. So we could > write them as such, not worrying about things like fsyncing them, > naming them into place, etc. We could just write them out, then mmap > the results, and put the pointers into cached_objects (currently it > insists on malloc-ing a copy of the input buffer, but that seems like > an easy extension to add). > > In fact, I think you could get away with just _one_ tempfile per > merge. Open up one tempfile. Write out all of the objects you want to > "store" into it in sequence, and record the lseek() offsets before and > after for each object. Then mmap the whole result, and stuff the > appropriate pointers (based on arithmetic with the offsets) into the > cached_objects list. Cute. The remerge diff code path creates a full tree that records the mechanical merge result. By hooking into the lowest layer of write_object() interface, we'd serialize all objects in such a tree in the order they are computed (bottom up from the leaf level, I'd presume) into a single flat file ;-)
On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote: > > I commented just now on how this API is duplicated between here & > another in-flight series in > https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just > comments on this patch here: > > > diff --git a/tmp-objdir.c b/tmp-objdir.c > > index b8d880e3626..9f75a75d1c0 100644 > > --- a/tmp-objdir.c > > +++ b/tmp-objdir.c > > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t) > > return t->env.v; > > } > > > > +const char *tmp_objdir_path(struct tmp_objdir *t) > > +{ > > + return t->path.buf; > > +} > > Not your fault & this pre-dates your patch, but FWIW I prefer our APIs > that don't have these "hidden struct" shenanigans (like say "struct > string_list") so a caller could just access this, we can just declare it > "const" appropriately. > > We're also all in-tree friends here, so having various accessors for no > reason other than to access struct members seems a bit too much. That's a fair point, but just to play counterpoint for a minute... FWIW, I dislike when our public facing APIs are polluted with all kinds of internal details. merge-recursive being a case in point. When writing merge-ort, although I had a struct with public fields, that struct also contained an opaque struct (pointer) within it to hide several private fields. (I would have liked to hide or remove a few more fields, but couldn't do so while the merge_recursive_options struct was shared between both merge-ort and merge-recursive.) That said, I agree it can certainly be overdone, and tmp_objdir is pretty simple. However, sometimes even in simple cases I like when folks make use of an opaque struct because it means folks put some effort into thinking more about the API that should be exposed. That's something we as a project have often overlooked in the past, as evidenced both by our one-shot usage mentality, and the existence of external projects like libgit2 attempting to correct this design shortcoming. I'd like git to move more towards being structured as a reusable library as well as a useful command-line tool. > All of which is to say if you + Neeraj come up with some common API I > for one wouldn't mind moving the struct deceleration to tmp-objdir.h, > but it looks like in his version it can stay "private" more easily. Peff's ideas to just delete and recreate the temporary object store after every merge would let me reduce the API extensions and keep things more private. So I think our ideas are converging a bit. :-) > In this particular case I hadn't understood on a previous reading of > tmp-objdir.c why this "path" isn't a statically allocated "char > path[PATH_MAX]", or a least we that hardcoded "1024" should be a > PATH_MAX, as it is in some other cases. Actually, PATH_MAX shouldn't be used: https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/ https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/
On Tue, Sep 28, 2021 at 4:17 PM Jeff King <peff@peff.net> wrote: > > On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote: > > > From: Elijah Newren <newren@gmail.com> > > > > The tmp_objdir API provides the ability to create temporary object > > directories, but was designed with the goal of having subprocesses > > access these object stores, followed by the main process migrating > > objects from it to the main object store or just deleting it. The > > subprocesses would view it as their primary datastore and write to it. > > > > For the --remerge-diff option we want to add to show & log, we want all > > writes of intermediate merge results (both blobs and trees) to go to > > this alternate object store; since those writes will be done by the main > > process, we need this "alternate" object store to actually be the > > primary object store. When show & log are done, they'll simply remove > > this temporary object store. > > I think this is consistent with the original design of tmp_objdir. I > just never needed to do anything in-process before, so overriding the > environment of sub-processes was sufficient. > > I do think these are dangerous and may cause bugs, though. Anything you > write while the tmp_objdir is marked as "primary" is going to go away > when you remove it. So any object names you reference outside of that, > either: > > - by another object that you create after the tmp_objdir is removed; > or > > - in a ref > > are going to turn into repository corruption. Of course that's true for > the existing sub-processes, too, but here we're touching a wider variety > of code. > > Obviously the objects we write as part of remerge-diff are meant to be > temporary, and we'll never reference them in any other way. And usually > we would not expect a diff to write any other objects. But one thing > that comes to mind if textconv caching. > > If you do a remerge diff on a blob that uses textconv, and the caching > feature is turned on, then we'll write out a new blob with the cached > value, and eventually a new tree and refs/notes/ pointer referencing it. > I'm not sure of the timing of all of that, but it seems likely to me > that at least some of that will end up in your tmp_objdir. Interesting. Thanks for explaining this case, and for the testcase in the other thread. So there are other parts of running `git log` that are not read-only, besides what I'm doing. That's unfortunate. Can we just disable those things in this section of the code? Also, I think it makes sense that textconv caching causes a problem via writing new blobs and trees *and* refs that reference these blobs and trees. However, I'm not sure I'm really grasping it, because I don't see how your solutions are safe either. I'll mention more after each one. > If you remove the tmp_objdir as the primary as soon as you're done with > the merge, but before you run the diff, you might be OK, though. It has to be after I run the diff, because the diff needs access to the temporary files to diff against them. > If not, then I think the solution is probably not to install this as the > "primary", but rather: > > - do the specific remerge-diff writes we want using a special "write > to this object dir" variant of write_object_file() > > - install the tmp_objdir as an alternate, so the reading side (which > is diff code that doesn't otherwise know about our remerge-diff > trickery) can find them > > And that lets other writers avoid writing into the temporary area > accidentally. Doesn't this have the same problem? If something similar to textconv caching were to create new trees that referenced the existing blobs and then added a ref that referred to that tree, then you have the exact same problem, right? The new tree and the new ref would be corrupt as soon as the tmp-objdir went away. Whether or not other callers write to the temporary area isn't the issue, it's whether they write refs that can refer to anything from the temporary area. Or am I missing something? > In that sense this is kind of like the pretend_object_file() interface, > except that it's storing the objects on disk instead of in memory. Of > course another way of doing that would be to stuff the object data into > tempfiles and just put pointers into the in-memory cached_objects array. > > It's also not entirely foolproof (nor is the existing > pretend_object_file()). Any operation which is fooled into thinking we > have object X because it's in the fake-object list or the tmp_objdir may > reference that object erroneously, creating a corruption. But it's still > safer than allowing arbitrary writes into the tmp_objdir. Why is the pretend_object_file() interface safer? Does it disable the textconv caching somehow? I don't see why it helps avoid this problem. > Side note: The pretend_object_file() approach is actually even better, > because we know the object is fake. So it does not confuse > write_object_file()'s "do we already have this object" freshening > check. Oh, that's interesting. That would be helpful. > I suspect it could even be made faster than the tmp_objdir approach. > From our perspective, these objects really are tempfiles. So we could > write them as such, not worrying about things like fsyncing them, > naming them into place, etc. We could just write them out, then mmap > the results, and put the pointers into cached_objects (currently it > insists on malloc-ing a copy of the input buffer, but that seems like > an easy extension to add). > > In fact, I think you could get away with just _one_ tempfile per > merge. Open up one tempfile. Write out all of the objects you want to > "store" into it in sequence, and record the lseek() offsets before and > after for each object. Then mmap the whole result, and stuff the > appropriate pointers (based on arithmetic with the offsets) into the > cached_objects list. > > As a bonus, this tempfile can easily be in $TMPDIR, meaning > remerge-diff could work on a read-only repository (and the OS can > perhaps handle the data better, especially if $TMPDIR isn't a regular > filesystem). Interesting. I'm not familiar with the pretend_object_file() code, but I think I catch what you're saying. I don't see anything in object-file.c that allows removing items from the cache, as I would want to do, but I'm sure I could add something. (I'm still not sure how this avoids the problem of things like textconv caching trying to write an object that references one of these, though. Should we just manually disable textconv caching during a remerge-diff?) > > We also need one more thing -- `git log --remerge-diff` can cause the > > temporary object store to fill up with loose objects. Rather than > > trying to gc that are known garbage anyway, we simply want to know the > > location of the temporary object store so we can purge the loose objects > > after each merge. > > This paragraph confused me. At first I thought you were talking about > how to avoid calling "gc --auto" because we don't want to waste time > thinking all those loose objects were worth gc-ing. But we wouldn't do > that anyway (git-log does not expect to make objects so doesn't call it > at all, and anyway you'd expect it to happen at the end of a process, > after we've already removed the tmp_objdir). > > But I think you just mean: we can build up a bunch of loose objects, > which is inefficient. We don't want to gc them, because that's even less > efficient. So we want to clean them out between merges. Yes, sorry that wasn't clear. > But I don't see any code to do that here. I guess that's maybe why > you've added tmp_objdir_path(), to find them later. But IMHO this would > be much better encapsulated as tmp_objdir_clear_objects() or something. > > But simpler still: is there any reason not to just drop and re-create > the tmp-objdir for each merge? It's not very expensive to do so (and > certainly not compared to the cost of actually writing out the objects). Ooh, indeed. I like that, and it turns out that the majority of the cost of dropping the tmp_objdir is the recursive directory removal, that I wanted anyway as part of me inbetween-merges cleanup. But, of course, if we go the pretend_object_file() route then the tmp_objdir() stuff probably becomes moot.
On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote: > So there are other parts of running `git log` that are not read-only, > besides what I'm doing. That's unfortunate. Can we just disable > those things in this section of the code? It might be possible for the cache (we'd still generate entries, but just not write them). It's "just" an optimization, though it's possible there are cases where it's really annoying to do so. I'd imagine those are a stretch. I do use textconv to decrypt gpg blobs with a key on a hardware token, for example. I don't require a touch for each one, though, nor would I use textconv caching anyway (since the point is not to store the unencrypted contents). It wouldn't necessarily be OK to disable object writes in general. Though I'd think since a diff is _conceptually_ a read-only operation that any writes it wants to do would be in the same caching / optimization boat. I'd also worry about the maintenance cost of having to annotate any such case. > Also, I think it makes sense that textconv caching causes a problem > via writing new blobs and trees *and* refs that reference these blobs > and trees. However, I'm not sure I'm really grasping it, because I > don't see how your solutions are safe either. I'll mention more after > each one. Heh. So I originally wrote more on this in my earlier email, but worried I was getting too into the weeds and would confuse you, so I pared it down. But I managed to be confusing anyway. :) Here's one mental model. If you have a temporary object store, then these are the bad things that can happen: 1. Some code may write an object to it, thinking the data is saved, but we later throw it away. 2. Some code may _skip_ writing an object, thinking that we already have that object available (when in fact we'll later throw it away). 3. Some code may record the fact that we have an object in memory, and then later (after the temporary object store is not in use), may point to it when writing an object or reference. Pointing the primary object-dir at a temporary directory (like your patch does) means we're subject to all three. Problem (1) happens because unrelated code may not realize we've swapped out the primary object writing code for this throw-away directory. But we could require writers to explicitly indicate that they want to write to the temporary area, while allowing other writes to go to the regular object store. This could be done with a flag or variant of write_object_file(). Or if the temporary area isn't a regular object directory at all (like the whole tempfile / pretend thing), then this happens automatically, because the regular object-write paths don't even know about our temporary area. Problem (2) is mostly handled inside write_object_file(). It will optimize out writes of objects we already have. So it needs to know when we have an object "for real", and when we just have it in our temporary store. If we make the tmp_objdir the primary, then it doesn't know this (though we could teach it to check. In the pretend_object_file() system, it already knows this (it uses the "freshen" code paths which actually want to find the entry in the filesystem). Problem (3) is the hardest one, because we don't really distinguish between readers and writers. So if I call has_object_file(), is it because I want to access the object? Or is it because I want to generate a new reference to it, which must ensure that it exists? One of those is OK to look at the tmp_objdir (and indeed, is the whole point of making it in the first place), and the other is the first step to corrupting the repository. ;) With the environment variables we set up for running external commands (like hooks) in the receive-pack quarantine, one bit of safety we have is to prevent ref writes entirely in those commands. That works reasonably well. Even if they may write new objects that might get thrown away (if the push is rejected), they'd ultimately need to point to those objects with a ref. And because those hooks spend their _entire_ run-time in the quarantine state, they can't ever write a ref. Whereas doing it in-process is a little dicier. Right now the textconv cache writes out a new tree for every entry we add, so it happens to do the ref write while the tmp-objdir is still in place. But as this comment notes: $ git grep -B6 notes_cache_write diff.c diff.c- /* diff.c- * we could save up changes and flush them all at the end, diff.c- * but we would need an extra call after all diffing is done. diff.c- * Since generating a cache entry is the slow path anyway, diff.c- * this extra overhead probably isn't a big deal. diff.c- */ diff.c: notes_cache_write(textconv->cache); this is slow, and it would be perfectly reasonable to flush the cache at the end of the process (e.g., using an atexit() handler). In which case we'd hit this problem (3) exactly: we'd generate and remember objects during the tmp-objdir period, but only later actually reference them. This is more likely than the receive-pack hook case, because we're doing it all in process. > > If you remove the tmp_objdir as the primary as soon as you're done with > > the merge, but before you run the diff, you might be OK, though. > > It has to be after I run the diff, because the diff needs access to > the temporary files to diff against them. Right, of course. I was too fixated on the object-write part, forgetting that the whole point of the exercise is to later read them back. :) > > If not, then I think the solution is probably not to install this as the > > "primary", but rather: > > > > - do the specific remerge-diff writes we want using a special "write > > to this object dir" variant of write_object_file() > > > > - install the tmp_objdir as an alternate, so the reading side (which > > is diff code that doesn't otherwise know about our remerge-diff > > trickery) can find them > > > > And that lets other writers avoid writing into the temporary area > > accidentally. > > Doesn't this have the same problem? If something similar to textconv > caching were to create new trees that referenced the existing blobs > and then added a ref that referred to that tree, then you have the > exact same problem, right? The new tree and the new ref would be > corrupt as soon as the tmp-objdir went away. Whether or not other > callers write to the temporary area isn't the issue, it's whether they > write refs that can refer to anything from the temporary area. Or am > I missing something? The key thing here is in the first step, where remerge-diff is explicitly saying "I want to write to this temporary object-dir". But the textconv-cache code does not; it gets to write to the normal spot. So it avoids problem (1). You're right that it does not avoid problem (3) exactly. But triggering that would require some code not just writing other objects or references while the tmp-objdir is in place, but specifically referencing the objects that remerge-diff did put into the tmp-objdir. That seems a lot less likely to me (because the thing we're most worried about is unrelated code that just happens to write while the tmp-objdir is in place). > > In that sense this is kind of like the pretend_object_file() interface, > > except that it's storing the objects on disk instead of in memory. Of > > course another way of doing that would be to stuff the object data into > > tempfiles and just put pointers into the in-memory cached_objects array. > > > > It's also not entirely foolproof (nor is the existing > > pretend_object_file()). Any operation which is fooled into thinking we > > have object X because it's in the fake-object list or the tmp_objdir may > > reference that object erroneously, creating a corruption. But it's still > > safer than allowing arbitrary writes into the tmp_objdir. > > Why is the pretend_object_file() interface safer? Does it disable the > textconv caching somehow? I don't see why it helps avoid this > problem. So hopefully it's clearer now from what I wrote above, but just connecting the dots: 1. Unrelated code calling write_object_file() would still write real, durable objects, as usual. 2. The write_object_file() "skip this write" optimization already ignores the pretend_object_file() objects while checking "do we have this object". > Interesting. I'm not familiar with the pretend_object_file() code, > but I think I catch what you're saying. I don't see anything in > object-file.c that allows removing items from the cache, as I would > want to do, but I'm sure I could add something. Right, it's pretty limited now, and there's only one caller. And in fact I've wanted to get rid of it, exactly because it can create the danger we're discussing here. But I still think it's _less_ dangerous than fully replacing the object directory. > (I'm still not sure how this avoids the problem of things like > textconv caching trying to write an object that references one of > these, though. Should we just manually disable textconv caching > during a remerge-diff?) It can't avoid it completely, but in general, the textconv cache should be writing and referencing its own objects. Even if they happen to be the same as something remerge-diff generated, it won't know that until it has tried to write it (and so that's why having write_object_file() continue to really write the object is important). Hopefully that clears up my line of thinking. I do think a lot of this is kind of theoretical, in the sense that: - textconv caching is an obscure feature that we _could_ just disable during remerge-diffs - there's a reasonable chance that there isn't any other code that wants to write during a diff But this is all scary and error-prone enough that I'd prefer an approach that has the "least surprise". So any solution where random code calling write_object_file() _can't_ accidentally write to a throw-away directory seems like a safer less surprising thing. It does mean that the remerge-diff code needs to be explicit in its object writes to say "this needs to go to the temporary area" (whether it's a flag, or calling into a totally different function or subsystem). I'm hoping that's not hard to do (because its writes are done explicitly by the remerge-diff code), but I worry that it may be (because you are relying on more generic tree code to the write under the hood). -Peff
On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > Side note: The pretend_object_file() approach is actually even better, > > because we know the object is fake. So it does not confuse > > write_object_file()'s "do we already have this object" freshening > > check. > > > > I suspect it could even be made faster than the tmp_objdir approach. > > From our perspective, these objects really are tempfiles. So we could > > write them as such, not worrying about things like fsyncing them, > > naming them into place, etc. We could just write them out, then mmap > > the results, and put the pointers into cached_objects (currently it > > insists on malloc-ing a copy of the input buffer, but that seems like > > an easy extension to add). > > > > In fact, I think you could get away with just _one_ tempfile per > > merge. Open up one tempfile. Write out all of the objects you want to > > "store" into it in sequence, and record the lseek() offsets before and > > after for each object. Then mmap the whole result, and stuff the > > appropriate pointers (based on arithmetic with the offsets) into the > > cached_objects list. > > Cute. The remerge diff code path creates a full tree that records > the mechanical merge result. By hooking into the lowest layer of > write_object() interface, we'd serialize all objects in such a tree > in the order they are computed (bottom up from the leaf level, I'd > presume) into a single flat file ;-) I do still like this approach, but just two possible gotchas I was thinking of: - This side-steps all of our usual code for getting object data into memory. In general, I'd expect this content to not be too enormous, but it _could_ be if there are many / large blobs in the result. So we may end up with large maps. Probably not a big deal on modern 64-bit systems. Maybe an issue on 32-bit systems, just because of virtual address space. Likewise, we do support systems with NO_MMAP. They'd work here, but it would probably mean putting all that object data into the heap. I could live with that, given how rare such systems are these days, and that it only matters if you're using --remerge-diff with big blobs. - I wonder to what degree --remerge-diff benefits from omitting writes for objects we already have. I.e., if you are writing out a whole tree representing the conflicted state, then you don't want to write all of the trees that aren't interesting. Hopefully the code is already figuring out which paths the merge even touched, and ignoring the rest. It probably benefits performance-wise from write_object_file() deciding to skip some object writes, as well (e.g., for resolutions which the final tree already took, as they'd be in the merge commit). The whole pretend-we-have-this-object thing may want to likewise make sure we don't write out objects that we already have in the real odb. -Peff
On Tue, Sep 28, 2021 at 09:22:32PM -0700, Elijah Newren wrote: > > Not your fault & this pre-dates your patch, but FWIW I prefer our APIs > > that don't have these "hidden struct" shenanigans (like say "struct > > string_list") so a caller could just access this, we can just declare it > > "const" appropriately. > > > > We're also all in-tree friends here, so having various accessors for no > > reason other than to access struct members seems a bit too much. > > That's a fair point, but just to play counterpoint for a minute... > > FWIW, I dislike when our public facing APIs are polluted with all > kinds of internal details. merge-recursive being a case in point. > When writing merge-ort, although I had a struct with public fields, > that struct also contained an opaque struct (pointer) within it to > hide several private fields. (I would have liked to hide or remove a > few more fields, but couldn't do so while the merge_recursive_options > struct was shared between both merge-ort and merge-recursive.) > > That said, I agree it can certainly be overdone, and tmp_objdir is > pretty simple. However, sometimes even in simple cases I like when > folks make use of an opaque struct because it means folks put some > effort into thinking more about the API that should be exposed. > That's something we as a project have often overlooked in the past, as > evidenced both by our one-shot usage mentality, and the existence of > external projects like libgit2 attempting to correct this design > shortcoming. I'd like git to move more towards being structured as a > reusable library as well as a useful command-line tool. Right, it was definitely a conscious decision to keep the tmp-objdir API as slim as possible, just because it's such a complicated and confusing thing in the first place. For something like a strbuf, giving direct access to the fields makes sense. Exposing the details of how the struct works (like accessing ".buf" as a NUL-terminated string) are part of its usefulness. But tmp_objdir represents a more abstract concept, and I wanted to insulate callers from the details. That said, the notion of "this is the path of the objdir" is not that contentious, so I don't mind it too much (but it would be a jump to exposing the details at all). -Peff
On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote: > > > If you remove the tmp_objdir as the primary as soon as you're done with > > > the merge, but before you run the diff, you might be OK, though. > > > > It has to be after I run the diff, because the diff needs access to > > the temporary files to diff against them. > > Right, of course. I was too fixated on the object-write part, forgetting > that the whole point of the exercise is to later read them back. :) Ah, no, I remember what I was trying to say here. The distinction is between "remove the tmp_objdir" and "remove it as the primary". I.e., if you do this: 1. create tmp_objdir 2. make tmp_objdir primary for writes 3. run the "merge" half of remerge-diff, writing objects into the temporary space 4. stop having tmp_objdir as the primary; instead make it an alternate 5. run the diff 6. remove tmp_objdir totally Then step 5 can't accidentally write objects into the temporary space, but it can still read them. So it's not entirely safe, but it's safer, and it would be a much smaller change. Some ways it could go wrong: - is it possible for the merge code to ever write an object? I kind of wonder if we'd ever do any cache-able transformations as part of a content-level merge. I don't think we do now, though. - in step 5, write_object_file() may still be confused by the presence of the to-be-thrown-away objects in the alternate. This is pretty unlikely, as it implies that the remerge-diff wrote a blob or tree that is byte-identical to something that the diff wants to write. -Peff
On Thu, Sep 30 2021, Jeff King wrote: > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: > >> Jeff King <peff@peff.net> writes: >> >> > Side note: The pretend_object_file() approach is actually even better, >> > because we know the object is fake. So it does not confuse >> > write_object_file()'s "do we already have this object" freshening >> > check. >> > >> > I suspect it could even be made faster than the tmp_objdir approach. >> > From our perspective, these objects really are tempfiles. So we could >> > write them as such, not worrying about things like fsyncing them, >> > naming them into place, etc. We could just write them out, then mmap >> > the results, and put the pointers into cached_objects (currently it >> > insists on malloc-ing a copy of the input buffer, but that seems like >> > an easy extension to add). >> > >> > In fact, I think you could get away with just _one_ tempfile per >> > merge. Open up one tempfile. Write out all of the objects you want to >> > "store" into it in sequence, and record the lseek() offsets before and >> > after for each object. Then mmap the whole result, and stuff the >> > appropriate pointers (based on arithmetic with the offsets) into the >> > cached_objects list. >> >> Cute. The remerge diff code path creates a full tree that records >> the mechanical merge result. By hooking into the lowest layer of >> write_object() interface, we'd serialize all objects in such a tree >> in the order they are computed (bottom up from the leaf level, I'd >> presume) into a single flat file ;-) > > I do still like this approach, but just two possible gotchas I was > thinking of: > > - This side-steps all of our usual code for getting object data into > memory. In general, I'd expect this content to not be too enormous, > but it _could_ be if there are many / large blobs in the result. So > we may end up with large maps. Probably not a big deal on modern > 64-bit systems. Maybe an issue on 32-bit systems, just because of > virtual address space. > > Likewise, we do support systems with NO_MMAP. They'd work here, but > it would probably mean putting all that object data into the heap. I > could live with that, given how rare such systems are these days, and > that it only matters if you're using --remerge-diff with big blobs. > > - I wonder to what degree --remerge-diff benefits from omitting writes > for objects we already have. I.e., if you are writing out a whole > tree representing the conflicted state, then you don't want to write > all of the trees that aren't interesting. Hopefully the code is > already figuring out which paths the merge even touched, and ignoring > the rest. It probably benefits performance-wise from > write_object_file() deciding to skip some object writes, as well > (e.g., for resolutions which the final tree already took, as they'd > be in the merge commit). The whole pretend-we-have-this-object thing > may want to likewise make sure we don't write out objects that we > already have in the real odb. I haven't benchmarked since my core.checkCollisions RFC patch[1] resulted in the somewhat related loose object cache patch from you, and not with something like the midx, but just a note that on some setups just writing things out is faster than exhaustively checking if we absolutely need to write things out. I also wonder how much if anything writing out the one file v.s. lots of loose objects is worthwhile on systems where we could write out those loose objects on a ramdisk, which is commonly available on e.g. Linux distros these days out of the box. If you care about performance but not about your transitory data using a ramdisk is generally much better than any other potential I/O optimization. Finally, and I don't mean to throw a monkey wrench into this whole discussion, so take this as a random musing: I wonder how much faster this thing could be on its second run if instead of avoiding writing to the store & cleaning up, it just wrote to the store, and then wrote another object keyed on the git version and any revision paramaters etc., and then pretty much just had to do a "git cat-file -p <that-obj>" to present the result to the user :) I suppose that would be throwing a lot more work at an eventual "git gc" than we ever do now, so maybe it's a bit crazy, but I think it might be an interesting direction in general to (ab)use either the primary or some secondary store in the .git dir as a semi-permanent cache of resolved queries from the likes of "git log". 1. https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/
On Tue, Sep 28 2021, Elijah Newren wrote: > On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote: >> >> I commented just now on how this API is duplicated between here & >> another in-flight series in >> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just >> comments on this patch here: >> >> > diff --git a/tmp-objdir.c b/tmp-objdir.c >> > index b8d880e3626..9f75a75d1c0 100644 >> > --- a/tmp-objdir.c >> > +++ b/tmp-objdir.c >> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t) >> > return t->env.v; >> > } >> > >> > +const char *tmp_objdir_path(struct tmp_objdir *t) >> > +{ >> > + return t->path.buf; >> > +} >> >> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs >> that don't have these "hidden struct" shenanigans (like say "struct >> string_list") so a caller could just access this, we can just declare it >> "const" appropriately. >> >> We're also all in-tree friends here, so having various accessors for no >> reason other than to access struct members seems a bit too much. > > That's a fair point, but just to play counterpoint for a minute... > > FWIW, I dislike when our public facing APIs are polluted with all > kinds of internal details. merge-recursive being a case in point. > When writing merge-ort, although I had a struct with public fields, > that struct also contained an opaque struct (pointer) within it to > hide several private fields. (I would have liked to hide or remove a > few more fields, but couldn't do so while the merge_recursive_options > struct was shared between both merge-ort and merge-recursive.) > > That said, I agree it can certainly be overdone, and tmp_objdir is > pretty simple. However, sometimes even in simple cases I like when > folks make use of an opaque struct because it means folks put some > effort into thinking more about the API that should be exposed. > That's something we as a project have often overlooked in the past, as > evidenced both by our one-shot usage mentality, and the existence of > external projects like libgit2 attempting to correct this design > shortcoming. I'd like git to move more towards being structured as a > reusable library as well as a useful command-line tool. [This is somewhat of a continuation of https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/]. I like our APIs being usable, but right now we're not promising any maintenance of APIs that work the same as git v2.30.0 or whatever. I think some external users directly link to libgit.a, but they're tracking whatever churn we have from release to release, just like someone maintaining out-of-tree linux kernel drivers. For maintenance of git.git itself I think it's going to stay like that for any foreseeable future[1]. And that's before we even get into the practicalities of git's license likely not winning over many (if any) libgit2 users. I think those users are if anything migrating from libgit2 to git's plumbing CLI, or in some cases folding what they needed into git.git itself (or just staying with libgit2). So assuming all of that I really don't see the point in general of having a "foo" field that has the equivalent of get_foo() and set_foo() accessors in the context of git.git. That's something that's really useful if you're maintaining API and ABI compatibility, but for us it's usually just verbosity. All the users are in-tree, so just add "const" as appropriate, or perhaps we should give the field a "priv" name or whatever. Obviously not everything should be done that way, e.g. being able to have trace2 start/end points for something you'd otherwise push/clear from a "struct string_list" or whatever directly *is* usable, so it's not black and white. I'm just saying thaht using convoluted patterns *just* to get in the way of some hypothetical code that's going to mess with your internals can make things harder in other areas. E.g. the init/release pattern via macros I mentioned in the linked E-Mail above. > [...] >> In this particular case I hadn't understood on a previous reading of >> tmp-objdir.c why this "path" isn't a statically allocated "char >> path[PATH_MAX]", or a least we that hardcoded "1024" should be a >> PATH_MAX, as it is in some other cases. > > Actually, PATH_MAX shouldn't be used: > > https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/ > https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/ Thanks. I didn't know about that. I've fixed (some of that) up in some semi-related WIP work I've got. 1. Aside: I *do* think it would be really useful for us to expose a stable C API eventually, using some of the same license workarounds/shenanigans (depending on your view) as the Linux kernel did. I.e. to say that "we're GPLv2, but this small API is the equivalent of our syscall interface". Rather than that API being something that would look anything like internal git.git users it would basically be a run_command() + Perl's unpack/pack + output formatter on steroids. I.e. we'd be able to invoke say a 'git cat-file --batch' in the same process as the caller by essentially calling cmd_cat_file() directly. Just with some callback replacements for "provide this on stdin" and "this line accepts stdout" without actually involving different processes or I/O. You'd then have the equivalent of a --format output that would just so happen to be in your OS's / C ABI's understood struct format. Similar to "perldoc -f pack". The maintenance burden of such an API would be trivial. It would basically be a handful of functions just to setup input/output callbacks. Most of the interface would be a rather convoluted construction of command-line parameters (like our own run_command() use), not ideal, but most users are after structured data and eliminating process overhead, and they'd get that. We would need to replace fprintf() and the like with some structured formatting mechanism, but that would be useful for non-API users as new commands learning a shiny new --format parameter. You could even use such a --format to have a non-C-API user get the same struct output, perhaps for direct mapping into memory, say if you didn't trust the built-in/command in question to not leak memory yet. I don't know if this has been suggested before, just my 0.02. It's also partially why I'm interested in making even-built-ins not leak memory with some of my parallel SANITIZE=leak work. Once you've got that you can usually call the cmd_whatever() in a loop without much effort, and once you've got that, well, see above :)
On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote: > On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote: > > > So there are other parts of running `git log` that are not read-only, > > besides what I'm doing. That's unfortunate. Can we just disable > > those things in this section of the code? > > It might be possible for the cache (we'd still generate entries, but > just not write them). It's "just" an optimization, though it's possible > there are cases where it's really annoying to do so. I'd imagine those > are a stretch. I do use textconv to decrypt gpg blobs with a key on a > hardware token, for example. I don't require a touch for each one, > though, nor would I use textconv caching anyway (since the point is not > to store the unencrypted contents). > > It wouldn't necessarily be OK to disable object writes in general. > Though I'd think since a diff is _conceptually_ a read-only operation > that any writes it wants to do would be in the same caching / > optimization boat. I'd also worry about the maintenance cost of having > to annotate any such case. > > > Also, I think it makes sense that textconv caching causes a problem > > via writing new blobs and trees *and* refs that reference these blobs > > and trees. However, I'm not sure I'm really grasping it, because I > > don't see how your solutions are safe either. I'll mention more after > > each one. > > Heh. So I originally wrote more on this in my earlier email, but worried > I was getting too into the weeds and would confuse you, so I pared it > down. But I managed to be confusing anyway. :) > > Here's one mental model. If you have a temporary object store, then > these are the bad things that can happen: > > 1. Some code may write an object to it, thinking the data is saved, > but we later throw it away. > > 2. Some code may _skip_ writing an object, thinking that we already > have that object available (when in fact we'll later throw it > away). > > 3. Some code may record the fact that we have an object in memory, and > then later (after the temporary object store is not in use), may > point to it when writing an object or reference. > > Pointing the primary object-dir at a temporary directory (like your > patch does) means we're subject to all three. > > Problem (1) happens because unrelated code may not realize we've swapped > out the primary object writing code for this throw-away directory. But > we could require writers to explicitly indicate that they want to write > to the temporary area, while allowing other writes to go to the regular > object store. This could be done with a flag or variant of > write_object_file(). Or if the temporary area isn't a regular object > directory at all (like the whole tempfile / pretend thing), then this > happens automatically, because the regular object-write paths don't even > know about our temporary area. > > Problem (2) is mostly handled inside write_object_file(). It will > optimize out writes of objects we already have. So it needs to know when > we have an object "for real", and when we just have it in our temporary > store. If we make the tmp_objdir the primary, then it doesn't know this > (though we could teach it to check. In the pretend_object_file() system, > it already knows this (it uses the "freshen" code paths which actually > want to find the entry in the filesystem). > > Problem (3) is the hardest one, because we don't really distinguish > between readers and writers. So if I call has_object_file(), is it > because I want to access the object? Or is it because I want to generate > a new reference to it, which must ensure that it exists? One of those is > OK to look at the tmp_objdir (and indeed, is the whole point of making > it in the first place), and the other is the first step to corrupting > the repository. ;) > This is a good taxonomy. I think these problems apply to any transactional system. If you keep state outside of the transaction, then you may wind up with inconsistency. Problem (3) is the worst and kind of subsumes (2). If we think an object is durable with a specific oid, we probably are going to write that oid out somewhere else that is outside of the transaction. If we write it to the ODB (e.g. in a tree or some blob) without first caching it in memory, the references in the ODB will be equally durable as the referees. > With the environment variables we set up for running external commands > (like hooks) in the receive-pack quarantine, one bit of safety we have > is to prevent ref writes entirely in those commands. That works > reasonably well. Even if they may write new objects that might get > thrown away (if the push is rejected), they'd ultimately need to point > to those objects with a ref. And because those hooks spend their > _entire_ run-time in the quarantine state, they can't ever write a ref. > > Whereas doing it in-process is a little dicier. Right now the textconv > cache writes out a new tree for every entry we add, so it happens to do > the ref write while the tmp-objdir is still in place. But as this > comment notes: > > $ git grep -B6 notes_cache_write diff.c > diff.c- /* > diff.c- * we could save up changes and flush them all at the end, > diff.c- * but we would need an extra call after all diffing is done. > diff.c- * Since generating a cache entry is the slow path anyway, > diff.c- * this extra overhead probably isn't a big deal. > diff.c- */ > diff.c: notes_cache_write(textconv->cache); > > this is slow, and it would be perfectly reasonable to flush the cache at > the end of the process (e.g., using an atexit() handler). In which case > we'd hit this problem (3) exactly: we'd generate and remember objects > during the tmp-objdir period, but only later actually reference them. > This is more likely than the receive-pack hook case, because we're doing > it all in process. > Is there state we care about besides objects and refs? Maybe we need to introduce a transactional system for the entire repository like we have for the refs-db. In-process code that cares about oids that might be only transactionally present can register a callback for when the transaction commits or rollsback. I'm not volunteering to write this, btw :). Right now, disabling Ref updates is a pretty good compromise, since that's the one definite piece of durable state outside the ODB that can point in. In memory state is generally hard to solve and could wind up persisting in either the ODB or the Refs or both. > > > If you remove the tmp_objdir as the primary as soon as you're done with > > > the merge, but before you run the diff, you might be OK, though. > > > > It has to be after I run the diff, because the diff needs access to > > the temporary files to diff against them. > > Right, of course. I was too fixated on the object-write part, forgetting > that the whole point of the exercise is to later read them back. :) > > > > If not, then I think the solution is probably not to install this as the > > > "primary", but rather: > > > > > > - do the specific remerge-diff writes we want using a special "write > > > to this object dir" variant of write_object_file() > > > > > > - install the tmp_objdir as an alternate, so the reading side (which > > > is diff code that doesn't otherwise know about our remerge-diff > > > trickery) can find them > > > > > > And that lets other writers avoid writing into the temporary area > > > accidentally. > > > > Doesn't this have the same problem? If something similar to textconv > > caching were to create new trees that referenced the existing blobs > > and then added a ref that referred to that tree, then you have the > > exact same problem, right? The new tree and the new ref would be > > corrupt as soon as the tmp-objdir went away. Whether or not other > > callers write to the temporary area isn't the issue, it's whether they > > write refs that can refer to anything from the temporary area. Or am > > I missing something? > > The key thing here is in the first step, where remerge-diff is > explicitly saying "I want to write to this temporary object-dir". But > the textconv-cache code does not; it gets to write to the normal spot. > So it avoids problem (1). > > You're right that it does not avoid problem (3) exactly. But triggering > that would require some code not just writing other objects or > references while the tmp-objdir is in place, but specifically > referencing the objects that remerge-diff did put into the tmp-objdir. > That seems a lot less likely to me (because the thing we're most worried > about is unrelated code that just happens to write while the tmp-objdir > is in place). > > > > In that sense this is kind of like the pretend_object_file() interface, > > > except that it's storing the objects on disk instead of in memory. Of > > > course another way of doing that would be to stuff the object data into > > > tempfiles and just put pointers into the in-memory cached_objects array. > > > > > > It's also not entirely foolproof (nor is the existing > > > pretend_object_file()). Any operation which is fooled into thinking we > > > have object X because it's in the fake-object list or the tmp_objdir may > > > reference that object erroneously, creating a corruption. But it's still > > > safer than allowing arbitrary writes into the tmp_objdir. > > > > Why is the pretend_object_file() interface safer? Does it disable the > > textconv caching somehow? I don't see why it helps avoid this > > problem. > > So hopefully it's clearer now from what I wrote above, but just > connecting the dots: > > 1. Unrelated code calling write_object_file() would still write real, > durable objects, as usual. > > 2. The write_object_file() "skip this write" optimization already > ignores the pretend_object_file() objects while checking "do we > have this object". > > > Interesting. I'm not familiar with the pretend_object_file() code, > > but I think I catch what you're saying. I don't see anything in > > object-file.c that allows removing items from the cache, as I would > > want to do, but I'm sure I could add something. > > Right, it's pretty limited now, and there's only one caller. And in fact > I've wanted to get rid of it, exactly because it can create the danger > we're discussing here. But I still think it's _less_ dangerous than > fully replacing the object directory. > > > (I'm still not sure how this avoids the problem of things like > > textconv caching trying to write an object that references one of > > these, though. Should we just manually disable textconv caching > > during a remerge-diff?) > > It can't avoid it completely, but in general, the textconv cache should > be writing and referencing its own objects. Even if they happen to be > the same as something remerge-diff generated, it won't know that until > it has tried to write it (and so that's why having write_object_file() > continue to really write the object is important). > > > Hopefully that clears up my line of thinking. I do think a lot of this > is kind of theoretical, in the sense that: > > - textconv caching is an obscure feature that we _could_ just disable > during remerge-diffs > > - there's a reasonable chance that there isn't any other code that > wants to write during a diff > > But this is all scary and error-prone enough that I'd prefer an approach > that has the "least surprise". So any solution where random code calling > write_object_file() _can't_ accidentally write to a throw-away directory > seems like a safer less surprising thing. > There is something nice to be said for having the cached state go away with the ephemeral objects, since that caching refers to objects that may never be seen again. Any solution which makes in-flight transactional state visible (either as the primary writing spot or as a readable alternate) will suffer from state leakage for non-transactionally-aware readers/writers. The bulk-fsync code has some advantage in that it is more constrained in what it's calling while the tmp objdir is active. Also, in typical cases, the tmp objdir will be committed rather than discarded. But I believe it suffers from the same essential issue. Can we press forward, and just try to make sure all the in-process stuff is aware of temporary object state? Maybe we can hook up the temp odb stuff with the refsdb transactions into a single system. We can make the tmp object directory stuff more prevalent by using it in git-apply and the merge stategy, and by using bulk-checkin while writing out the cached tree and commits. This should flush out any issues and keep other parts of the codebase honest. > It does mean that the remerge-diff code needs to be explicit in its > object writes to say "this needs to go to the temporary area" (whether > it's a flag, or calling into a totally different function or subsystem). > I'm hoping that's not hard to do (because its writes are done explicitly > by the remerge-diff code), but I worry that it may be (because you are > relying on more generic tree code to the write under the hood). > I don't think that really helps the trickiest problems. The textconv cache, for instance, may now be writing notes that refer to nonexistent objects. That's fine from a correctness point of view, but now we're polluting the repo with state that won't be gc'ed and won't ever be accessed. IMO, it might be better for the textconv cache to have a single notes root tree (rather than a notes 'branch' with a commit history). We would want to put a ref to this tree somewhere inside the object directory itself so that it's transactionally consistent with the rest of the object state. For instance, we could store the current textconv notes tree root in .git/objects/info/textconv-cache. Thanks, Neeraj
Jeff King <peff@peff.net> writes: >> > If not, then I think the solution is probably not to install this as the >> > "primary", but rather: >> > >> > - do the specific remerge-diff writes we want using a special "write >> > to this object dir" variant of write_object_file() >> > >> > - install the tmp_objdir as an alternate, so the reading side (which >> > is diff code that doesn't otherwise know about our remerge-diff >> > trickery) can find them >> > >> > And that lets other writers avoid writing into the temporary area >> > accidentally. >> >> ... > > The key thing here is in the first step, where remerge-diff is > explicitly saying "I want to write to this temporary object-dir". But > the textconv-cache code does not; it gets to write to the normal spot. > So it avoids problem (1). > > You're right that it does not avoid problem (3) exactly. But triggering > that would require some code not just writing other objects or > references while the tmp-objdir is in place, but specifically > referencing the objects that remerge-diff did put into the tmp-objdir. > That seems a lot less likely to me (because the thing we're most worried > about is unrelated code that just happens to write while the tmp-objdir > is in place). I do not offhand remember if a new object creation codepath that avoids writing an object that we already have considers an object we find in an alternate as "we have it". If it does not and we make a duplicated copy in our primary object store, then it would be OK, but otherwise, those that know tmp_objdir is an alternate may still try to write a new object and find that the same object already exists in an alternate. Once tmp_objdir is gone, we would end up with a corrupt repository, right? freshen_loose_object() seems to go to check_and_freshen() which consult check_and_freshen_nonlocal() before returning yes/no, so having the same object in an alternate does seem to count as having one. > So hopefully it's clearer now from what I wrote above, but just > connecting the dots: > > 1. Unrelated code calling write_object_file() would still write real, > durable objects, as usual. > > 2. The write_object_file() "skip this write" optimization already > ignores the pretend_object_file() objects while checking "do we > have this object". Yup for (2)---the pretend interface is safer than alternate in that sense. But the pretend interface was designed to just hold a handful of phoney objects in core. I am not sure if it is a good fit for this use case. > But this is all scary and error-prone enough that I'd prefer an approach > that has the "least surprise". So any solution where random code calling > write_object_file() _can't_ accidentally write to a throw-away directory > seems like a safer less surprising thing. > > It does mean that the remerge-diff code needs to be explicit in its > object writes to say "this needs to go to the temporary area" (whether > it's a flag, or calling into a totally different function or subsystem). ... and also writes by others that happens to write the same object as what is already in the temporary area should be made to the real object store. Then we would be much safer. > I'm hoping that's not hard to do (because its writes are done explicitly > by the remerge-diff code), but I worry that it may be (because you are > relying on more generic tree code to the write under the hood). Thanks.
Jeff King <peff@peff.net> writes: > - is it possible for the merge code to ever write an object? I kind of > wonder if we'd ever do any cache-able transformations as part of a > content-level merge. I don't think we do now, though. How is virtual merge base, result of an inner merge that recursively merging two merge bases, fed to an outer merge as the ancestor? Isn't it written as a tree object by the inner merge as a tree with full of blob objects, so the outer merge does not have to treat a merge base tree that is virtual any specially from a sole merge base?
Neeraj Singh <nksingh85@gmail.com> writes: > Is there state we care about besides objects and refs? Maybe we need to > introduce a transactional system for the entire repository like we have > for the refs-db. In-process code that cares about oids that might be > only transactionally present can register a callback for when the transaction > commits or rollsback. I'm not volunteering to write this, btw :). I take it to mean by "objects and refs" that we care only about objects that are reachable from the refs? The index (the primary index array plus extensions that record object names) also has anchoring points we care about (ask "git fsck" what they are). > We can make the tmp object directory stuff more prevalent by using it in > git-apply and the merge stategy, and by using bulk-checkin while writing out > the cached tree and commits. This should flush out any issues and keep other > parts of the codebase honest. I wonder what's so wrong about cruft objects in the main object store that may be gc-ed away laster, though. IOW, I wonder if it is simpler to treat the quarantine while accepting dubious foreign data via "git push" in receive-pack as an exception, not a norm.
On Thu, Sep 30, 2021 at 03:16:19PM +0200, Ævar Arnfjörð Bjarmason wrote: > I also wonder how much if anything writing out the one file v.s. lots of > loose objects is worthwhile on systems where we could write out those > loose objects on a ramdisk, which is commonly available on e.g. Linux > distros these days out of the box. If you care about performance but not > about your transitory data using a ramdisk is generally much better than > any other potential I/O optimization. I'd think in general we won't be using a ramdisk, because tmp_objdir is putting its directory inside $GIT_DIR/objects. It doesn't _have_ to, but using a ramdisk works against its original purpose (which was to store potentially quite a lot of data from an incoming push, and to be able to rename it cheaply into its final resting place). It would probably not be too hard to provide a flag that indicates the intended use, though (and then we decide where to create the temporary directory based on that). > Finally, and I don't mean to throw a monkey wrench into this whole > discussion, so take this as a random musing: I wonder how much faster > this thing could be on its second run if instead of avoiding writing to > the store & cleaning up, it just wrote to the store, and then wrote > another object keyed on the git version and any revision paramaters > etc., and then pretty much just had to do a "git cat-file -p <that-obj>" > to present the result to the user :) > > I suppose that would be throwing a lot more work at an eventual "git gc" > than we ever do now, so maybe it's a bit crazy, but I think it might be > an interesting direction in general to (ab)use either the primary or > some secondary store in the .git dir as a semi-permanent cache of > resolved queries from the likes of "git log". I don't think it's crazy to just write the objects to the main object store. We already generate cruft objects for some other operations (Junio asked elsewhere in the thread about virtual trees for recursive merges; I don't know the answer offhand, but I'd guess we do there). They do get cleaned up eventually. I'm not sure it helps performance much by itself. In a merge (or even just writing a tree out from the index), by the time you realize you already have the object, you've done most of the work to generate it. I think what you're describing is to make some kind of cache structure on top. That might be sensible (and indeed, the index already does this with the cachetree extension). But it can also easily come later if the objects are just in the regular odb. -Peff
On Thu, Sep 30, 2021 at 12:26 AM Jeff King <peff@peff.net> wrote: > > On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote: > > > So there are other parts of running `git log` that are not read-only, > > besides what I'm doing. That's unfortunate. Can we just disable > > those things in this section of the code? > > It might be possible for the cache (we'd still generate entries, but > just not write them). It's "just" an optimization, though it's possible > there are cases where it's really annoying to do so. I'd imagine those > are a stretch. I do use textconv to decrypt gpg blobs with a key on a > hardware token, for example. I don't require a touch for each one, > though, nor would I use textconv caching anyway (since the point is not > to store the unencrypted contents). > > It wouldn't necessarily be OK to disable object writes in general. > Though I'd think since a diff is _conceptually_ a read-only operation > that any writes it wants to do would be in the same caching / > optimization boat. I'd also worry about the maintenance cost of having > to annotate any such case. > > > Also, I think it makes sense that textconv caching causes a problem > > via writing new blobs and trees *and* refs that reference these blobs > > and trees. However, I'm not sure I'm really grasping it, because I > > don't see how your solutions are safe either. I'll mention more after > > each one. > > Heh. So I originally wrote more on this in my earlier email, but worried > I was getting too into the weeds and would confuse you, so I pared it > down. But I managed to be confusing anyway. :) > > Here's one mental model. If you have a temporary object store, then > these are the bad things that can happen: > > 1. Some code may write an object to it, thinking the data is saved, > but we later throw it away. > > 2. Some code may _skip_ writing an object, thinking that we already > have that object available (when in fact we'll later throw it > away). > > 3. Some code may record the fact that we have an object in memory, and > then later (after the temporary object store is not in use), may > point to it when writing an object or reference. > > Pointing the primary object-dir at a temporary directory (like your > patch does) means we're subject to all three. > > Problem (1) happens because unrelated code may not realize we've swapped > out the primary object writing code for this throw-away directory. But > we could require writers to explicitly indicate that they want to write > to the temporary area, while allowing other writes to go to the regular > object store. This could be done with a flag or variant of > write_object_file(). Or if the temporary area isn't a regular object > directory at all (like the whole tempfile / pretend thing), then this > happens automatically, because the regular object-write paths don't even > know about our temporary area. > > Problem (2) is mostly handled inside write_object_file(). It will > optimize out writes of objects we already have. So it needs to know when > we have an object "for real", and when we just have it in our temporary > store. If we make the tmp_objdir the primary, then it doesn't know this > (though we could teach it to check. In the pretend_object_file() system, > it already knows this (it uses the "freshen" code paths which actually > want to find the entry in the filesystem). > > Problem (3) is the hardest one, because we don't really distinguish > between readers and writers. So if I call has_object_file(), is it > because I want to access the object? Or is it because I want to generate > a new reference to it, which must ensure that it exists? One of those is > OK to look at the tmp_objdir (and indeed, is the whole point of making > it in the first place), and the other is the first step to corrupting > the repository. ;) > > With the environment variables we set up for running external commands > (like hooks) in the receive-pack quarantine, one bit of safety we have > is to prevent ref writes entirely in those commands. That works > reasonably well. Even if they may write new objects that might get > thrown away (if the push is rejected), they'd ultimately need to point > to those objects with a ref. And because those hooks spend their > _entire_ run-time in the quarantine state, they can't ever write a ref. > > Whereas doing it in-process is a little dicier. Right now the textconv > cache writes out a new tree for every entry we add, so it happens to do > the ref write while the tmp-objdir is still in place. But as this > comment notes: > > $ git grep -B6 notes_cache_write diff.c > diff.c- /* > diff.c- * we could save up changes and flush them all at the end, > diff.c- * but we would need an extra call after all diffing is done. > diff.c- * Since generating a cache entry is the slow path anyway, > diff.c- * this extra overhead probably isn't a big deal. > diff.c- */ > diff.c: notes_cache_write(textconv->cache); > > this is slow, and it would be perfectly reasonable to flush the cache at > the end of the process (e.g., using an atexit() handler). In which case > we'd hit this problem (3) exactly: we'd generate and remember objects > during the tmp-objdir period, but only later actually reference them. > This is more likely than the receive-pack hook case, because we're doing > it all in process. > > > > If you remove the tmp_objdir as the primary as soon as you're done with > > > the merge, but before you run the diff, you might be OK, though. > > > > It has to be after I run the diff, because the diff needs access to > > the temporary files to diff against them. > > Right, of course. I was too fixated on the object-write part, forgetting > that the whole point of the exercise is to later read them back. :) > > > > If not, then I think the solution is probably not to install this as the > > > "primary", but rather: > > > > > > - do the specific remerge-diff writes we want using a special "write > > > to this object dir" variant of write_object_file() > > > > > > - install the tmp_objdir as an alternate, so the reading side (which > > > is diff code that doesn't otherwise know about our remerge-diff > > > trickery) can find them > > > > > > And that lets other writers avoid writing into the temporary area > > > accidentally. > > > > Doesn't this have the same problem? If something similar to textconv > > caching were to create new trees that referenced the existing blobs > > and then added a ref that referred to that tree, then you have the > > exact same problem, right? The new tree and the new ref would be > > corrupt as soon as the tmp-objdir went away. Whether or not other > > callers write to the temporary area isn't the issue, it's whether they > > write refs that can refer to anything from the temporary area. Or am > > I missing something? > > The key thing here is in the first step, where remerge-diff is > explicitly saying "I want to write to this temporary object-dir". But > the textconv-cache code does not; it gets to write to the normal spot. > So it avoids problem (1). > > You're right that it does not avoid problem (3) exactly. But triggering > that would require some code not just writing other objects or > references while the tmp-objdir is in place, but specifically > referencing the objects that remerge-diff did put into the tmp-objdir. > That seems a lot less likely to me (because the thing we're most worried > about is unrelated code that just happens to write while the tmp-objdir > is in place). > > > > In that sense this is kind of like the pretend_object_file() interface, > > > except that it's storing the objects on disk instead of in memory. Of > > > course another way of doing that would be to stuff the object data into > > > tempfiles and just put pointers into the in-memory cached_objects array. > > > > > > It's also not entirely foolproof (nor is the existing > > > pretend_object_file()). Any operation which is fooled into thinking we > > > have object X because it's in the fake-object list or the tmp_objdir may > > > reference that object erroneously, creating a corruption. But it's still > > > safer than allowing arbitrary writes into the tmp_objdir. > > > > Why is the pretend_object_file() interface safer? Does it disable the > > textconv caching somehow? I don't see why it helps avoid this > > problem. > > So hopefully it's clearer now from what I wrote above, but just > connecting the dots: > > 1. Unrelated code calling write_object_file() would still write real, > durable objects, as usual. > > 2. The write_object_file() "skip this write" optimization already > ignores the pretend_object_file() objects while checking "do we > have this object". > > > Interesting. I'm not familiar with the pretend_object_file() code, > > but I think I catch what you're saying. I don't see anything in > > object-file.c that allows removing items from the cache, as I would > > want to do, but I'm sure I could add something. > > Right, it's pretty limited now, and there's only one caller. And in fact > I've wanted to get rid of it, exactly because it can create the danger > we're discussing here. But I still think it's _less_ dangerous than > fully replacing the object directory. > > > (I'm still not sure how this avoids the problem of things like > > textconv caching trying to write an object that references one of > > these, though. Should we just manually disable textconv caching > > during a remerge-diff?) > > It can't avoid it completely, but in general, the textconv cache should > be writing and referencing its own objects. Even if they happen to be > the same as something remerge-diff generated, it won't know that until > it has tried to write it (and so that's why having write_object_file() > continue to really write the object is important). > > > Hopefully that clears up my line of thinking. I do think a lot of this > is kind of theoretical, in the sense that: > > - textconv caching is an obscure feature that we _could_ just disable > during remerge-diffs > > - there's a reasonable chance that there isn't any other code that > wants to write during a diff > > But this is all scary and error-prone enough that I'd prefer an approach > that has the "least surprise". So any solution where random code calling > write_object_file() _can't_ accidentally write to a throw-away directory > seems like a safer less surprising thing. Yes, thanks for taking the time to go into this detail. > It does mean that the remerge-diff code needs to be explicit in its > object writes to say "this needs to go to the temporary area" (whether > it's a flag, or calling into a totally different function or subsystem). > I'm hoping that's not hard to do (because its writes are done explicitly > by the remerge-diff code), but I worry that it may be (because you are > relying on more generic tree code to the write under the hood). All blob and tree objects written during merge-ort are done by explicit write_object_file() calls that exist within merge-ort.c. So this approach should be doable, it just needs the external caller to set some flag saying to write these objects elsewhere.
On Thu, Sep 30, 2021 at 12:46 AM Jeff King <peff@peff.net> wrote: > > On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote: > > > > > If you remove the tmp_objdir as the primary as soon as you're done with > > > > the merge, but before you run the diff, you might be OK, though. > > > > > > It has to be after I run the diff, because the diff needs access to > > > the temporary files to diff against them. > > > > Right, of course. I was too fixated on the object-write part, forgetting > > that the whole point of the exercise is to later read them back. :) > > Ah, no, I remember what I was trying to say here. The distinction is > between "remove the tmp_objdir" and "remove it as the primary". > > I.e., if you do this: > > 1. create tmp_objdir > > 2. make tmp_objdir primary for writes > > 3. run the "merge" half of remerge-diff, writing objects into the > temporary space > > 4. stop having tmp_objdir as the primary; instead make it an alternate > > 5. run the diff > > 6. remove tmp_objdir totally > > Then step 5 can't accidentally write objects into the temporary space, > but it can still read them. So it's not entirely safe, but it's safer, > and it would be a much smaller change. Interesting. > Some ways it could go wrong: > > - is it possible for the merge code to ever write an object? I kind of > wonder if we'd ever do any cache-able transformations as part of a > content-level merge. I don't think we do now, though. Yes, of course -- otherwise there would have been no need for the tmp_objdir in the first place. In particular, it needs to write three-way-content merges of individual files, and it needs to write new tree objects. (And it needs to do this both for creating the virtual merge bases if the merge is recursive, as well as doing it for the outer merge.) It doesn't write anything for caching reasons, such as line ending normalizations (that's all kept in-memory and done on demand). > - in step 5, write_object_file() may still be confused by the presence > of the to-be-thrown-away objects in the alternate. This is pretty > unlikely, as it implies that the remerge-diff wrote a blob or tree > that is byte-identical to something that the diff wants to write. That's one reason it could be confused. The textconv filtering in particular was creating a new object based on an existing one, and a tree, and a ref. What if there was some other form of caching or statistic gathering that didn't write a new object based on an existing one, but did add trees and especially refs that referenced the existing object? It's not that the diff wanted to write something byte-identical to what the merge wrote, it's just that the diff wants to reference the object somehow.
On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Thu, Sep 30 2021, Jeff King wrote: > > > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: > > > >> Jeff King <peff@peff.net> writes: > >> > >> > Side note: The pretend_object_file() approach is actually even better, > >> > because we know the object is fake. So it does not confuse > >> > write_object_file()'s "do we already have this object" freshening > >> > check. > >> > > >> > I suspect it could even be made faster than the tmp_objdir approach. > >> > From our perspective, these objects really are tempfiles. So we could > >> > write them as such, not worrying about things like fsyncing them, > >> > naming them into place, etc. We could just write them out, then mmap > >> > the results, and put the pointers into cached_objects (currently it > >> > insists on malloc-ing a copy of the input buffer, but that seems like > >> > an easy extension to add). > >> > > >> > In fact, I think you could get away with just _one_ tempfile per > >> > merge. Open up one tempfile. Write out all of the objects you want to > >> > "store" into it in sequence, and record the lseek() offsets before and > >> > after for each object. Then mmap the whole result, and stuff the > >> > appropriate pointers (based on arithmetic with the offsets) into the > >> > cached_objects list. > >> > >> Cute. The remerge diff code path creates a full tree that records > >> the mechanical merge result. By hooking into the lowest layer of > >> write_object() interface, we'd serialize all objects in such a tree > >> in the order they are computed (bottom up from the leaf level, I'd > >> presume) into a single flat file ;-) > > > > I do still like this approach, but just two possible gotchas I was > > thinking of: > > > > - This side-steps all of our usual code for getting object data into > > memory. In general, I'd expect this content to not be too enormous, > > but it _could_ be if there are many / large blobs in the result. So > > we may end up with large maps. Probably not a big deal on modern > > 64-bit systems. Maybe an issue on 32-bit systems, just because of > > virtual address space. > > > > Likewise, we do support systems with NO_MMAP. They'd work here, but > > it would probably mean putting all that object data into the heap. I > > could live with that, given how rare such systems are these days, and > > that it only matters if you're using --remerge-diff with big blobs. > > > > - I wonder to what degree --remerge-diff benefits from omitting writes > > for objects we already have. I.e., if you are writing out a whole > > tree representing the conflicted state, then you don't want to write > > all of the trees that aren't interesting. Hopefully the code is > > already figuring out which paths the merge even touched, and ignoring > > the rest. It probably benefits performance-wise from > > write_object_file() deciding to skip some object writes, as well > > (e.g., for resolutions which the final tree already took, as they'd > > be in the merge commit). The whole pretend-we-have-this-object thing > > may want to likewise make sure we don't write out objects that we > > already have in the real odb. > > I haven't benchmarked since my core.checkCollisions RFC patch[1] > resulted in the somewhat related loose object cache patch from you, and > not with something like the midx, but just a note that on some setups > just writing things out is faster than exhaustively checking if we > absolutely need to write things out. > > I also wonder how much if anything writing out the one file v.s. lots of > loose objects is worthwhile on systems where we could write out those > loose objects on a ramdisk, which is commonly available on e.g. Linux > distros these days out of the box. If you care about performance but not > about your transitory data using a ramdisk is generally much better than > any other potential I/O optimization. > > Finally, and I don't mean to throw a monkey wrench into this whole > discussion, so take this as a random musing: I wonder how much faster > this thing could be on its second run if instead of avoiding writing to > the store & cleaning up, it just wrote to the store, and then wrote It'd be _much_ slower. My first implementation in fact did that; it just wrote objects to the store, left them there, and didn't bother to do any auto-gcs. It slowed down quite a bit as it ran. Adding auto-gc's during the run were really slow too. But stepping back, gc'ing objects that I already knew were garbage seemed like a waste; why not just prune them pre-emptively? To do so, though, I'd have to track all the individual objects I added to make sure I didn't prune something else. Following that idea and a few different attempts eventually led me to the discovery of tmp_objdir. In case it's not clear to you why just writing all the objects to the normal store and leaving them there slows things down so much... Let's say 1 in 10 merges had originally needed some kind of conflict resolution (either in the outer merge or in the inner merge for the virtual merge bases), meaning that 9 out of 10 merges traversed by --remerge-diff don't write any objects. Now for each merge for which --remerge-diff does need to create conflicted blobs and new trees, let's say it writes on average 3 blobs and 7 trees. (I don't know the real average numbers, it could well be ~5 total, but ~10 seems a realistic first order approximation and it makes the math easy.) Then, if we keep all objects we write, then `git log --remerge-diff` on a history with 100,000 merge commits, will have added 100,000 loose objects by the time it finishes. That means that all diff and merge operations slow down considerably as it runs due to all the extra loose objects. > another object keyed on the git version and any revision paramaters > etc., and then pretty much just had to do a "git cat-file -p <that-obj>" > to present the result to the user :) So caching the full `git log ...` output, based on a hash of the command line flags, and then merely re-showing it later? And having that output be invalidated as soon as any head advances? Or are you thinking of caching the output per-commit based on a hash of the other command line flags...potentially slowing non-log operations down with the huge number of loose objects? > I suppose that would be throwing a lot more work at an eventual "git gc" > than we ever do now, so maybe it's a bit crazy, but I think it might be > an interesting direction in general to (ab)use either the primary or > some secondary store in the .git dir as a semi-permanent cache of > resolved queries from the likes of "git log". If you do per-commit caching, and the user scrolls through enough output (not hard to do, just searching the output for some string often is enough), that "eventual" git-gc will be the very next git operation. If you cache the entire output, it'll be invalidated pretty quickly. So I don't see how this works. Or am I misunderstanding something you're suggesting here?
On Thu, Sep 30, 2021 at 7:39 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Tue, Sep 28 2021, Elijah Newren wrote: > > > On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason > > <avarab@gmail.com> wrote: > >> > >> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote: > >> > >> I commented just now on how this API is duplicated between here & > >> another in-flight series in > >> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just > >> comments on this patch here: > >> > >> > diff --git a/tmp-objdir.c b/tmp-objdir.c > >> > index b8d880e3626..9f75a75d1c0 100644 > >> > --- a/tmp-objdir.c > >> > +++ b/tmp-objdir.c > >> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t) > >> > return t->env.v; > >> > } > >> > > >> > +const char *tmp_objdir_path(struct tmp_objdir *t) > >> > +{ > >> > + return t->path.buf; > >> > +} > >> > >> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs > >> that don't have these "hidden struct" shenanigans (like say "struct > >> string_list") so a caller could just access this, we can just declare it > >> "const" appropriately. > >> > >> We're also all in-tree friends here, so having various accessors for no > >> reason other than to access struct members seems a bit too much. > > > > That's a fair point, but just to play counterpoint for a minute... > > > > FWIW, I dislike when our public facing APIs are polluted with all > > kinds of internal details. merge-recursive being a case in point. > > When writing merge-ort, although I had a struct with public fields, > > that struct also contained an opaque struct (pointer) within it to > > hide several private fields. (I would have liked to hide or remove a > > few more fields, but couldn't do so while the merge_recursive_options > > struct was shared between both merge-ort and merge-recursive.) > > > > That said, I agree it can certainly be overdone, and tmp_objdir is > > pretty simple. However, sometimes even in simple cases I like when > > folks make use of an opaque struct because it means folks put some > > effort into thinking more about the API that should be exposed. > > That's something we as a project have often overlooked in the past, as > > evidenced both by our one-shot usage mentality, and the existence of > > external projects like libgit2 attempting to correct this design > > shortcoming. I'd like git to move more towards being structured as a > > reusable library as well as a useful command-line tool. > > [This is somewhat of a continuation of > https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/]. > > I like our APIs being usable, but right now we're not promising any > maintenance of APIs that work the same as git v2.30.0 or whatever. > > I think some external users directly link to libgit.a, but they're > tracking whatever churn we have from release to release, just like > someone maintaining out-of-tree linux kernel drivers. > > For maintenance of git.git itself I think it's going to stay like that > for any foreseeable future[1]. > > And that's before we even get into the practicalities of git's license > likely not winning over many (if any) libgit2 users. I think those users > are if anything migrating from libgit2 to git's plumbing CLI, or in some > cases folding what they needed into git.git itself (or just staying with > libgit2). > > So assuming all of that I really don't see the point in general of > having a "foo" field that has the equivalent of get_foo() and set_foo() > accessors in the context of git.git. > > That's something that's really useful if you're maintaining API and ABI > compatibility, but for us it's usually just verbosity. All the users are > in-tree, so just add "const" as appropriate, or perhaps we should give > the field a "priv" name or whatever. It wasn't just a get or set accessors, and there was more involved than just adding const: * There was a typechange from struct strbuf -> const char *; strbuf != char * * Having an API that _enforces_ accesses to be const provides more safety. Relying on callers to remember to declare their variable access as const provides virtually none. But more importantly than either of the above, tmp_objdir was locked down because it was too easy to make big mistakes with, as Peff pointed out. That meant that folks would only use it through the API, and if folks tried to do unusual things, they'd do so by adding new API, which is easier to flag in code reviews and make sure the right people get cc'ed than if folks just reach into some internals from some other file. And the new API that I added resulted in a discussion that pointed out I didn't need to access that field at all and should instead create and delete the tmp_objdir after every merge. > Obviously not everything should be done that way, e.g. being able to > have trace2 start/end points for something you'd otherwise push/clear > from a "struct string_list" or whatever directly *is* usable, so it's > not black and white. > > I'm just saying thaht using convoluted patterns *just* to get in the way > of some hypothetical code that's going to mess with your internals can > make things harder in other areas. E.g. the init/release pattern via > macros I mentioned in the linked E-Mail above. Fair points, but I just don't think that's relevant to the context here; i.e. these aren't convoluted patterns *just* to get in the way. Further, I don't see how this would adversely affect the *_INIT macros you mentioned at all. tmp_objdir is an opaque struct so you can only declare a pointer to it. Pointers being initialized to NULL is handled quite nicely by the *_INIT macros. > > [...] > >> In this particular case I hadn't understood on a previous reading of > >> tmp-objdir.c why this "path" isn't a statically allocated "char > >> path[PATH_MAX]", or a least we that hardcoded "1024" should be a > >> PATH_MAX, as it is in some other cases. > > > > Actually, PATH_MAX shouldn't be used: > > > > https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/ > > https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/ > > Thanks. I didn't know about that. I've fixed (some of that) up in some > semi-related WIP work I've got. > > 1. Aside: I *do* think it would be really useful for us to expose a > stable C API eventually, using some of the same license > workarounds/shenanigans (depending on your view) as the Linux kernel > did. I.e. to say that "we're GPLv2, but this small API is the > equivalent of our syscall interface". > > Rather than that API being something that would look anything like > internal git.git users it would basically be a run_command() + Perl's > unpack/pack + output formatter on steroids. > > I.e. we'd be able to invoke say a 'git cat-file --batch' in the same > process as the caller by essentially calling cmd_cat_file() > directly. Just with some callback replacements for "provide this on > stdin" and "this line accepts stdout" without actually involving > different processes or I/O. > > You'd then have the equivalent of a --format output that would just > so happen to be in your OS's / C ABI's understood struct > format. Similar to "perldoc -f pack". > > The maintenance burden of such an API would be trivial. It would > basically be a handful of functions just to setup input/output > callbacks. > > Most of the interface would be a rather convoluted construction of > command-line parameters (like our own run_command() use), not ideal, > but most users are after structured data and eliminating process > overhead, and they'd get that. > > We would need to replace fprintf() and the like with some structured > formatting mechanism, but that would be useful for non-API users as > new commands learning a shiny new --format parameter. You could even > use such a --format to have a non-C-API user get the same struct > output, perhaps for direct mapping into memory, say if you didn't > trust the built-in/command in question to not leak memory yet. > > I don't know if this has been suggested before, just my 0.02. It's > also partially why I'm interested in making even-built-ins not leak > memory with some of my parallel SANITIZE=leak work. > > Once you've got that you can usually call the cmd_whatever() in a > loop without much effort, and once you've got that, well, see above > :) This sounds really cool.
On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote: > > Jeff King <peff@peff.net> writes: > > > - is it possible for the merge code to ever write an object? I kind of > > wonder if we'd ever do any cache-able transformations as part of a > > content-level merge. I don't think we do now, though. > > How is virtual merge base, result of an inner merge that recursively > merging two merge bases, fed to an outer merge as the ancestor? > Isn't it written as a tree object by the inner merge as a tree with > full of blob objects, so the outer merge does not have to treat a > merge base tree that is virtual any specially from a sole merge > base? Yes. And that's not unique to the inner merges; it also writes new blob and tree objects for the outer merge, and then passes the resulting toplevel tree id out as the result.
On Thu, Sep 30, 2021 at 12:33 AM Jeff King <peff@peff.net> wrote: > > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: > > > Jeff King <peff@peff.net> writes: > > > > > Side note: The pretend_object_file() approach is actually even better, > > > because we know the object is fake. So it does not confuse > > > write_object_file()'s "do we already have this object" freshening > > > check. > > > > > > I suspect it could even be made faster than the tmp_objdir approach. > > > From our perspective, these objects really are tempfiles. So we could > > > write them as such, not worrying about things like fsyncing them, > > > naming them into place, etc. We could just write them out, then mmap > > > the results, and put the pointers into cached_objects (currently it > > > insists on malloc-ing a copy of the input buffer, but that seems like > > > an easy extension to add). > > > > > > In fact, I think you could get away with just _one_ tempfile per > > > merge. Open up one tempfile. Write out all of the objects you want to > > > "store" into it in sequence, and record the lseek() offsets before and > > > after for each object. Then mmap the whole result, and stuff the > > > appropriate pointers (based on arithmetic with the offsets) into the > > > cached_objects list. > > > > Cute. The remerge diff code path creates a full tree that records > > the mechanical merge result. By hooking into the lowest layer of > > write_object() interface, we'd serialize all objects in such a tree > > in the order they are computed (bottom up from the leaf level, I'd > > presume) into a single flat file ;-) > > I do still like this approach, but just two possible gotchas I was > thinking of: > > - This side-steps all of our usual code for getting object data into > memory. In general, I'd expect this content to not be too enormous, > but it _could_ be if there are many / large blobs in the result. So > we may end up with large maps. Probably not a big deal on modern > 64-bit systems. Maybe an issue on 32-bit systems, just because of > virtual address space. > > Likewise, we do support systems with NO_MMAP. They'd work here, but > it would probably mean putting all that object data into the heap. I > could live with that, given how rare such systems are these days, and > that it only matters if you're using --remerge-diff with big blobs. Um, I'm starting to get uncomfortable with this pretend_object stuff. Part of the reason that merge-ort isn't truly "in memory" despite attempting to do exactly that, was because for large enough repos with enough files modified on both sides, I wasn't comfortable assuming that all new files from three-way content merges and all new trees fit into memory. I'm sure we'd be fine with current-day linux kernel sized repos. No big deal. In fact, most merges probably don't add more than a few dozen new files. But for microsoft-sized repos, and with repos tending to grow over time, more so when the tools themselves scale nicely (which we've all been working on enabling), makes me worry there might be enough new objects within a single merge (especially given the recursive inner merges) that we might need to worry about this. > - I wonder to what degree --remerge-diff benefits from omitting writes > for objects we already have. I.e., if you are writing out a whole > tree representing the conflicted state, then you don't want to write > all of the trees that aren't interesting. Hopefully the code is > already figuring out which paths the merge even touched, and ignoring > the rest. Not only do you want to avoid writing all of the trees that aren't interesting, you also want to avoid traversing into them in the first place and avoid doing trivial file merges for each entry underneath. Sadly, merge-recursive did anyway. Because renames and directory renames can sometimes throw a big wrench in the desire to avoid traversing into such directories. Figuring out how to avoid it was kind of tricky, but merge-ort definitely handles this when it can safely do so; see the cover letter for my trivial directory resolution series[1]. [1] https://lore.kernel.org/git/pull.988.v4.git.1626841444.gitgitgadget@gmail.com/ > It probably benefits performance-wise from > write_object_file() deciding to skip some object writes, as well > (e.g., for resolutions which the final tree already took, as they'd > be in the merge commit). Yes, it will also benefit from that, but the much bigger side is avoiding needlessly recursing into directories unchanged on (at least) one side. > The whole pretend-we-have-this-object thing > may want to likewise make sure we don't write out objects that we > already have in the real odb. Right, so I'd have to copy the relevant logic from write_object_file() -- I think that means instead of write_object_file()'s calls to freshen_packed_object() and freshen_loose_object() that I instead call find_pack_entry() and make has_loose_object() in object-file.c not be static and then call it. Does that sound right? Of course, that's assuming we're okay with this pretend_object thing, which I'm starting to worry about.
On Thu, Sep 30, 2021 at 07:31:44PM -0700, Elijah Newren wrote: > > Some ways it could go wrong: > > > > - is it possible for the merge code to ever write an object? I kind of > > wonder if we'd ever do any cache-able transformations as part of a > > content-level merge. I don't think we do now, though. > > Yes, of course -- otherwise there would have been no need for the > tmp_objdir in the first place. In particular, it needs to write > three-way-content merges of individual files, and it needs to write > new tree objects. (And it needs to do this both for creating the > virtual merge bases if the merge is recursive, as well as doing it for > the outer merge.) Right, sorry, I should have added: ...in addition to the merge-results we're writing. What I'm getting at is whether there might be _other_ parts of the code that the merge code would invoke. Say, if a .gitattribute caused us to convert a file's encoding in order to perform a more semantic merge, and we wanted to store that result somehow (e.g., in a similar cache). I don't think anything like that exists now, but I don't find it outside the realm of possibility in the future. It's sort of the merge equivalent of "textconv"; we have external diff and external merge drivers, but being able to convert contents and feed them to the regular text diff/merge code simplifies things. > > - in step 5, write_object_file() may still be confused by the presence > > of the to-be-thrown-away objects in the alternate. This is pretty > > unlikely, as it implies that the remerge-diff wrote a blob or tree > > that is byte-identical to something that the diff wants to write. > > That's one reason it could be confused. The textconv filtering in > particular was creating a new object based on an existing one, and a > tree, and a ref. What if there was some other form of caching or > statistic gathering that didn't write a new object based on an > existing one, but did add trees and especially refs that referenced > the existing object? It's not that the diff wanted to write something > byte-identical to what the merge wrote, it's just that the diff wants > to reference the object somehow. Yes, good point. It can help a little if the alternate-odb added by tmp_objdir was marked with a flag to say "hey, this is temporary". That would help write_object_file() decide not to depend on it. But the problem is so much wider than that that I think it will always be a bit dangerous; any code could say "can I access this object" and we don't know if its intent is simply to read it, or to create a new object that references it. -Peff
On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote: > > - This side-steps all of our usual code for getting object data into > > memory. In general, I'd expect this content to not be too enormous, > > but it _could_ be if there are many / large blobs in the result. So > > we may end up with large maps. Probably not a big deal on modern > > 64-bit systems. Maybe an issue on 32-bit systems, just because of > > virtual address space. > > > > Likewise, we do support systems with NO_MMAP. They'd work here, but > > it would probably mean putting all that object data into the heap. I > > could live with that, given how rare such systems are these days, and > > that it only matters if you're using --remerge-diff with big blobs. > > Um, I'm starting to get uncomfortable with this pretend_object stuff. > Part of the reason that merge-ort isn't truly "in memory" despite > attempting to do exactly that, was because for large enough repos with > enough files modified on both sides, I wasn't comfortable assuming > that all new files from three-way content merges and all new trees fit > into memory. I'm sure we'd be fine with current-day linux kernel > sized repos. No big deal. In fact, most merges probably don't add > more than a few dozen new files. But for microsoft-sized repos, and > with repos tending to grow over time, more so when the tools > themselves scale nicely (which we've all been working on enabling), > makes me worry there might be enough new objects within a single merge > (especially given the recursive inner merges) that we might need to > worry about this. I do think we need to consider that the content might be larger than will comfortably fit in memory. But the point of using mmap is that we don't have to care. The OS is taking care of it for us (just like it would in regular object files). The question is just whether we're comfortable assuming that mmap exists if you're working on such a large repository. I'd guess that big repos are pretty painful with out it (and again, I'm not even clear which systems Git runs on even lack mmap these days). So IMHO this isn't really a blocker for going in this direction. > > The whole pretend-we-have-this-object thing > > may want to likewise make sure we don't write out objects that we > > already have in the real odb. > > Right, so I'd have to copy the relevant logic from write_object_file() > -- I think that means instead of write_object_file()'s calls to > freshen_packed_object() and freshen_loose_object() that I instead call > find_pack_entry() and make has_loose_object() in object-file.c not be > static and then call it. Does that sound right? Yes. You _could_ call the same freshening functions, but I think since we know that our objects are throw-away anyway, we do not even need to perform that freshening operation. I think just has_object_file() would be sufficient for your needs. -Peff
On Thu, Sep 30 2021, Elijah Newren wrote: > On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason > <avarab@gmail.com> wrote: >> >> On Thu, Sep 30 2021, Jeff King wrote: >> >> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: >> > >> >> Jeff King <peff@peff.net> writes: >> >> >> >> > Side note: The pretend_object_file() approach is actually even better, >> >> > because we know the object is fake. So it does not confuse >> >> > write_object_file()'s "do we already have this object" freshening >> >> > check. >> >> > >> >> > I suspect it could even be made faster than the tmp_objdir approach. >> >> > From our perspective, these objects really are tempfiles. So we could >> >> > write them as such, not worrying about things like fsyncing them, >> >> > naming them into place, etc. We could just write them out, then mmap >> >> > the results, and put the pointers into cached_objects (currently it >> >> > insists on malloc-ing a copy of the input buffer, but that seems like >> >> > an easy extension to add). >> >> > >> >> > In fact, I think you could get away with just _one_ tempfile per >> >> > merge. Open up one tempfile. Write out all of the objects you want to >> >> > "store" into it in sequence, and record the lseek() offsets before and >> >> > after for each object. Then mmap the whole result, and stuff the >> >> > appropriate pointers (based on arithmetic with the offsets) into the >> >> > cached_objects list. >> >> >> >> Cute. The remerge diff code path creates a full tree that records >> >> the mechanical merge result. By hooking into the lowest layer of >> >> write_object() interface, we'd serialize all objects in such a tree >> >> in the order they are computed (bottom up from the leaf level, I'd >> >> presume) into a single flat file ;-) >> > >> > I do still like this approach, but just two possible gotchas I was >> > thinking of: >> > >> > - This side-steps all of our usual code for getting object data into >> > memory. In general, I'd expect this content to not be too enormous, >> > but it _could_ be if there are many / large blobs in the result. So >> > we may end up with large maps. Probably not a big deal on modern >> > 64-bit systems. Maybe an issue on 32-bit systems, just because of >> > virtual address space. >> > >> > Likewise, we do support systems with NO_MMAP. They'd work here, but >> > it would probably mean putting all that object data into the heap. I >> > could live with that, given how rare such systems are these days, and >> > that it only matters if you're using --remerge-diff with big blobs. >> > >> > - I wonder to what degree --remerge-diff benefits from omitting writes >> > for objects we already have. I.e., if you are writing out a whole >> > tree representing the conflicted state, then you don't want to write >> > all of the trees that aren't interesting. Hopefully the code is >> > already figuring out which paths the merge even touched, and ignoring >> > the rest. It probably benefits performance-wise from >> > write_object_file() deciding to skip some object writes, as well >> > (e.g., for resolutions which the final tree already took, as they'd >> > be in the merge commit). The whole pretend-we-have-this-object thing >> > may want to likewise make sure we don't write out objects that we >> > already have in the real odb. >> >> I haven't benchmarked since my core.checkCollisions RFC patch[1] >> resulted in the somewhat related loose object cache patch from you, and >> not with something like the midx, but just a note that on some setups >> just writing things out is faster than exhaustively checking if we >> absolutely need to write things out. >> >> I also wonder how much if anything writing out the one file v.s. lots of >> loose objects is worthwhile on systems where we could write out those >> loose objects on a ramdisk, which is commonly available on e.g. Linux >> distros these days out of the box. If you care about performance but not >> about your transitory data using a ramdisk is generally much better than >> any other potential I/O optimization. >> >> Finally, and I don't mean to throw a monkey wrench into this whole >> discussion, so take this as a random musing: I wonder how much faster >> this thing could be on its second run if instead of avoiding writing to >> the store & cleaning up, it just wrote to the store, and then wrote > > It'd be _much_ slower. My first implementation in fact did that; it > just wrote objects to the store, left them there, and didn't bother to > do any auto-gcs. It slowed down quite a bit as it ran. Adding > auto-gc's during the run were really slow too. But stepping back, > gc'ing objects that I already knew were garbage seemed like a waste; > why not just prune them pre-emptively? To do so, though, I'd have to > track all the individual objects I added to make sure I didn't prune > something else. Following that idea and a few different attempts > eventually led me to the discovery of tmp_objdir. > > In case it's not clear to you why just writing all the objects to the > normal store and leaving them there slows things down so much... > > Let's say 1 in 10 merges had originally needed some kind of conflict > resolution (either in the outer merge or in the inner merge for the > virtual merge bases), meaning that 9 out of 10 merges traversed by > --remerge-diff don't write any objects. Now for each merge for which > --remerge-diff does need to create conflicted blobs and new trees, > let's say it writes on average 3 blobs and 7 trees. (I don't know the > real average numbers, it could well be ~5 total, but ~10 seems a > realistic first order approximation and it makes the math easy.) > Then, if we keep all objects we write, then `git log --remerge-diff` > on a history with 100,000 merge commits, will have added 100,000 loose > objects by the time it finishes. That means that all diff and merge > operations slow down considerably as it runs due to all the extra > loose objects. ... >> another object keyed on the git version and any revision paramaters >> etc., and then pretty much just had to do a "git cat-file -p <that-obj>" >> to present the result to the user :) > > So caching the full `git log ...` output, based on a hash of the > command line flags, and then merely re-showing it later? And having > that output be invalidated as soon as any head advances? Or are you > thinking of caching the output per-commit based on a hash of the other > command line flags...potentially slowing non-log operations down with > the huge number of loose objects? Yeah I meant caching the full 'git log' output. Anyway, I think what you said above about number of loose objects clearly makes that a stupid suggestion on my part. FWIW I meant that more as a "maybe in the future we can ..." musing than anything for this series. I.e. in general I've often wanted say the second run of a 'git log -G<rx>"' to be faster than it is, which could use such on-the-fly caching. But if we had such caching then something like --remerge-diff would need to (from my understanding of what you're saying) need both this temporary staging area for objects and perhaps to cache the output in the main (or some cache) store. I.e. they're orthagonal. >> I suppose that would be throwing a lot more work at an eventual "git gc" >> than we ever do now, so maybe it's a bit crazy, but I think it might be >> an interesting direction in general to (ab)use either the primary or >> some secondary store in the .git dir as a semi-permanent cache of >> resolved queries from the likes of "git log". > > If you do per-commit caching, and the user scrolls through enough > output (not hard to do, just searching the output for some string > often is enough), that "eventual" git-gc will be the very next git > operation. If you cache the entire output, it'll be invalidated > pretty quickly. So I don't see how this works. Or am I > misunderstanding something you're suggesting here? With the caveat above that I think this is all a pretty stupid suggestion on my part: Just on the narrow aspect of how "git gc" behaves in that scenario: We'd keep those objects around for 2 weeks by default, or whatever you have gc.pruneExpire set to. So such a setup would certainly cause lots of pathological issues (see my [1] for one), but I don't think over-eager expiry of loose objects would be one. I'm not sure but are you referring to "invalidated pretty quickly" because it would go over the gc.auto limit? If so that's now how it works in this scenario, see [2]. I.e. it's moderated by the gc.pruneExpire setting. 1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/ 2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/
On Fri, Oct 01 2021, Jeff King wrote: > On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote: > >> > - This side-steps all of our usual code for getting object data into >> > memory. In general, I'd expect this content to not be too enormous, >> > but it _could_ be if there are many / large blobs in the result. So >> > we may end up with large maps. Probably not a big deal on modern >> > 64-bit systems. Maybe an issue on 32-bit systems, just because of >> > virtual address space. >> > >> > Likewise, we do support systems with NO_MMAP. They'd work here, but >> > it would probably mean putting all that object data into the heap. I >> > could live with that, given how rare such systems are these days, and >> > that it only matters if you're using --remerge-diff with big blobs. >> >> Um, I'm starting to get uncomfortable with this pretend_object stuff. >> Part of the reason that merge-ort isn't truly "in memory" despite >> attempting to do exactly that, was because for large enough repos with >> enough files modified on both sides, I wasn't comfortable assuming >> that all new files from three-way content merges and all new trees fit >> into memory. I'm sure we'd be fine with current-day linux kernel >> sized repos. No big deal. In fact, most merges probably don't add >> more than a few dozen new files. But for microsoft-sized repos, and >> with repos tending to grow over time, more so when the tools >> themselves scale nicely (which we've all been working on enabling), >> makes me worry there might be enough new objects within a single merge >> (especially given the recursive inner merges) that we might need to >> worry about this. > > I do think we need to consider that the content might be larger than > will comfortably fit in memory. But the point of using mmap is that we > don't have to care. The OS is taking care of it for us (just like it > would in regular object files). > > The question is just whether we're comfortable assuming that mmap > exists if you're working on such a large repository. I'd guess that big > repos are pretty painful with out it (and again, I'm not even clear > which systems Git runs on even lack mmap these days). So IMHO this isn't > really a blocker for going in this direction. On the not a blocker: Even without mmap() such a user also has the out of increasing the size of their swap/page file. And generally I agree that it's fair for us to say that if you've got such outsized performance needs you're going to need something more than the lowest common denominator git can be ported to. I also wouldn't be surprised if in that scenario we'd run faster if we were using memory (and no mmap) than if we tried to fallback to the FS, i.e. your suggestion of replacing N loose objects with one file with index offsets is basically what the OS would be doing for you as it pages out memory it can't fit in RAM to disk.
On Fri, Oct 1, 2021 at 12:41 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote: > > On Thu, Sep 30 2021, Elijah Newren wrote: > > > On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason > > <avarab@gmail.com> wrote: > >> > >> On Thu, Sep 30 2021, Jeff King wrote: > >> > >> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote: > >> > > >> >> Jeff King <peff@peff.net> writes: > >> >> > >> >> > Side note: The pretend_object_file() approach is actually even better, > >> >> > because we know the object is fake. So it does not confuse > >> >> > write_object_file()'s "do we already have this object" freshening > >> >> > check. > >> >> > > >> >> > I suspect it could even be made faster than the tmp_objdir approach. > >> >> > From our perspective, these objects really are tempfiles. So we could > >> >> > write them as such, not worrying about things like fsyncing them, > >> >> > naming them into place, etc. We could just write them out, then mmap > >> >> > the results, and put the pointers into cached_objects (currently it > >> >> > insists on malloc-ing a copy of the input buffer, but that seems like > >> >> > an easy extension to add). > >> >> > > >> >> > In fact, I think you could get away with just _one_ tempfile per > >> >> > merge. Open up one tempfile. Write out all of the objects you want to > >> >> > "store" into it in sequence, and record the lseek() offsets before and > >> >> > after for each object. Then mmap the whole result, and stuff the > >> >> > appropriate pointers (based on arithmetic with the offsets) into the > >> >> > cached_objects list. > >> >> > >> >> Cute. The remerge diff code path creates a full tree that records > >> >> the mechanical merge result. By hooking into the lowest layer of > >> >> write_object() interface, we'd serialize all objects in such a tree > >> >> in the order they are computed (bottom up from the leaf level, I'd > >> >> presume) into a single flat file ;-) > >> > > >> > I do still like this approach, but just two possible gotchas I was > >> > thinking of: > >> > > >> > - This side-steps all of our usual code for getting object data into > >> > memory. In general, I'd expect this content to not be too enormous, > >> > but it _could_ be if there are many / large blobs in the result. So > >> > we may end up with large maps. Probably not a big deal on modern > >> > 64-bit systems. Maybe an issue on 32-bit systems, just because of > >> > virtual address space. > >> > > >> > Likewise, we do support systems with NO_MMAP. They'd work here, but > >> > it would probably mean putting all that object data into the heap. I > >> > could live with that, given how rare such systems are these days, and > >> > that it only matters if you're using --remerge-diff with big blobs. > >> > > >> > - I wonder to what degree --remerge-diff benefits from omitting writes > >> > for objects we already have. I.e., if you are writing out a whole > >> > tree representing the conflicted state, then you don't want to write > >> > all of the trees that aren't interesting. Hopefully the code is > >> > already figuring out which paths the merge even touched, and ignoring > >> > the rest. It probably benefits performance-wise from > >> > write_object_file() deciding to skip some object writes, as well > >> > (e.g., for resolutions which the final tree already took, as they'd > >> > be in the merge commit). The whole pretend-we-have-this-object thing > >> > may want to likewise make sure we don't write out objects that we > >> > already have in the real odb. > >> > >> I haven't benchmarked since my core.checkCollisions RFC patch[1] > >> resulted in the somewhat related loose object cache patch from you, and > >> not with something like the midx, but just a note that on some setups > >> just writing things out is faster than exhaustively checking if we > >> absolutely need to write things out. > >> > >> I also wonder how much if anything writing out the one file v.s. lots of > >> loose objects is worthwhile on systems where we could write out those > >> loose objects on a ramdisk, which is commonly available on e.g. Linux > >> distros these days out of the box. If you care about performance but not > >> about your transitory data using a ramdisk is generally much better than > >> any other potential I/O optimization. > >> > >> Finally, and I don't mean to throw a monkey wrench into this whole > >> discussion, so take this as a random musing: I wonder how much faster > >> this thing could be on its second run if instead of avoiding writing to > >> the store & cleaning up, it just wrote to the store, and then wrote > > > > It'd be _much_ slower. My first implementation in fact did that; it > > just wrote objects to the store, left them there, and didn't bother to > > do any auto-gcs. It slowed down quite a bit as it ran. Adding > > auto-gc's during the run were really slow too. But stepping back, > > gc'ing objects that I already knew were garbage seemed like a waste; > > why not just prune them pre-emptively? To do so, though, I'd have to > > track all the individual objects I added to make sure I didn't prune > > something else. Following that idea and a few different attempts > > eventually led me to the discovery of tmp_objdir. > > > > In case it's not clear to you why just writing all the objects to the > > normal store and leaving them there slows things down so much... > > > > Let's say 1 in 10 merges had originally needed some kind of conflict > > resolution (either in the outer merge or in the inner merge for the > > virtual merge bases), meaning that 9 out of 10 merges traversed by > > --remerge-diff don't write any objects. Now for each merge for which > > --remerge-diff does need to create conflicted blobs and new trees, > > let's say it writes on average 3 blobs and 7 trees. (I don't know the > > real average numbers, it could well be ~5 total, but ~10 seems a > > realistic first order approximation and it makes the math easy.) > > Then, if we keep all objects we write, then `git log --remerge-diff` > > on a history with 100,000 merge commits, will have added 100,000 loose > > objects by the time it finishes. That means that all diff and merge > > operations slow down considerably as it runs due to all the extra > > loose objects. > > ... > > >> another object keyed on the git version and any revision paramaters > >> etc., and then pretty much just had to do a "git cat-file -p <that-obj>" > >> to present the result to the user :) > > > > So caching the full `git log ...` output, based on a hash of the > > command line flags, and then merely re-showing it later? And having > > that output be invalidated as soon as any head advances? Or are you > > thinking of caching the output per-commit based on a hash of the other > > command line flags...potentially slowing non-log operations down with > > the huge number of loose objects? > > Yeah I meant caching the full 'git log' output. Anyway, I think what you > said above about number of loose objects clearly makes that a stupid > suggestion on my part. Most my ideas for merge-ort were stupid. I didn't know until I tried them, and then I just discarded them and only showed off the good ideas. :-) Brainstorming new ways of doing things is a useful exercise, I think. > FWIW I meant that more as a "maybe in the future we can ..." musing than > anything for this series. I.e. in general I've often wanted say the > second run of a 'git log -G<rx>"' to be faster than it is, which could > use such on-the-fly caching. > > But if we had such caching then something like --remerge-diff would need > to (from my understanding of what you're saying) need both this > temporary staging area for objects and perhaps to cache the output in > the main (or some cache) store. I.e. they're orthagonal. > > >> I suppose that would be throwing a lot more work at an eventual "git gc" > >> than we ever do now, so maybe it's a bit crazy, but I think it might be > >> an interesting direction in general to (ab)use either the primary or > >> some secondary store in the .git dir as a semi-permanent cache of > >> resolved queries from the likes of "git log". > > > > If you do per-commit caching, and the user scrolls through enough > > output (not hard to do, just searching the output for some string > > often is enough), that "eventual" git-gc will be the very next git > > operation. If you cache the entire output, it'll be invalidated > > pretty quickly. So I don't see how this works. Or am I > > misunderstanding something you're suggesting here? > > With the caveat above that I think this is all a pretty stupid > suggestion on my part: > > Just on the narrow aspect of how "git gc" behaves in that scenario: We'd > keep those objects around for 2 weeks by default, or whatever you have > gc.pruneExpire set to. > > So such a setup would certainly cause lots of pathological issues (see > my [1] for one), but I don't think over-eager expiry of loose objects > would be one. > > I'm not sure but are you referring to "invalidated pretty quickly" > because it would go over the gc.auto limit? If so that's now how it > works in this scenario, see [2]. I.e. it's moderated by the > gc.pruneExpire setting. By "invalidated pretty quickly" I meant that if the user ran "git log --all" and you cached the entire output in a single cache file, then as soon as any ref is updated, that cached output is invalid because "git log --all" should now show different output than before. It seems to be a case of pick your poison: either (a) cache output per commit so that it can be re-used, and suffer from extreme numbers of loose objects, OR (b) cache the entire output to avoid the large number of loose objects, resulting in the cache only being useful if the user passed the _exact_ same arguments and revisions to git-log -- and also having the cache become stale as soon as any involved ref is updated. > > 1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/ > 2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/
Elijah Newren <newren@gmail.com> writes: > On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote: >> >> Jeff King <peff@peff.net> writes: >> >> > - is it possible for the merge code to ever write an object? I kind of >> > wonder if we'd ever do any cache-able transformations as part of a >> > content-level merge. I don't think we do now, though. >> >> How is virtual merge base, result of an inner merge that recursively >> merging two merge bases, fed to an outer merge as the ancestor? >> Isn't it written as a tree object by the inner merge as a tree with >> full of blob objects, so the outer merge does not have to treat a >> merge base tree that is virtual any specially from a sole merge >> base? > > Yes. And that's not unique to the inner merges; it also writes new > blob and tree objects for the outer merge, and then passes the > resulting toplevel tree id out as the result. I think Peff's question was "other than what is contained in the merge result, are there objects we create during the merge process that we may want to keep?" And my inner merge example was objects that do not appear in the merge result, but they are discardable, so it wasn't a good example. A tentative merge to recreate what the manual resolver would have seen while showing remerge-diff needs to be recorded somewhere before the diff gets shown, but after showing the diff, we can lose them unless we need them for some other reason (e.g. they already existed, another process happened to have created the same blob while we are running remerge-diff, etc.). So in that sense, the objects that represent the result of the outermost merge is also discardable. The question is if there are other things similar to the textconv cache, where we need to create during the operation and may keep the result for later use. The closest thing I can think of is the rename cache we've talked about from time to time without actually adding one ;-). I guess using cached textconv result, without putting newly discovered textconv result back to the cache, would be the best we can do? The blobs (hence their textconv results) involved in showing a merge M with --remerge-diff are the ones in the trees of M^@ (all parents) and $(git merge-base $(git rev-parse M^@)), and they may have been created when running "git show" on or when letting "git log -p" pass, these commits, so it is not like we are disabling the cache entirely, and allowing read-only access may still have value. Thanks.
diff --git a/tmp-objdir.c b/tmp-objdir.c index b8d880e3626..9f75a75d1c0 100644 --- a/tmp-objdir.c +++ b/tmp-objdir.c @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t) return t->env.v; } +const char *tmp_objdir_path(struct tmp_objdir *t) +{ + return t->path.buf; +} + void tmp_objdir_add_as_alternate(const struct tmp_objdir *t) { add_to_alternates_memory(t->path.buf); } + +void tmp_objdir_make_primary(struct repository *r, const struct tmp_objdir *t) +{ + struct object_directory *od; + od = xcalloc(1, sizeof(*od)); + + od->path = xstrdup(t->path.buf); + od->next = r->objects->odb; + r->objects->odb = od; +} + +void tmp_objdir_remove_as_primary(struct repository *r, + const struct tmp_objdir *t) +{ + struct object_directory *od; + + od = r->objects->odb; + if (strcmp(t->path.buf, od->path)) + BUG("expected %s as primary object store; found %s", + t->path.buf, od->path); + r->objects->odb = od->next; + free(od->path); + free(od); +} diff --git a/tmp-objdir.h b/tmp-objdir.h index b1e45b4c75d..6067da24e8c 100644 --- a/tmp-objdir.h +++ b/tmp-objdir.h @@ -19,6 +19,7 @@ * */ +struct repository; struct tmp_objdir; /* @@ -33,6 +34,11 @@ struct tmp_objdir *tmp_objdir_create(void); */ const char **tmp_objdir_env(const struct tmp_objdir *); +/* + * Return the path used for the temporary object directory. + */ +const char *tmp_objdir_path(struct tmp_objdir *t); + /* * Finalize a temporary object directory by migrating its objects into the main * object database, removing the temporary directory, and freeing any @@ -51,4 +57,14 @@ int tmp_objdir_destroy(struct tmp_objdir *); */ void tmp_objdir_add_as_alternate(const struct tmp_objdir *); +/* + * Add the temporary object directory as the *primary* object store in the + * current process, turning the previous primary object store into an + * alternate. + */ +void tmp_objdir_make_primary(struct repository *r, + const struct tmp_objdir *t); +void tmp_objdir_remove_as_primary(struct repository *r, + const struct tmp_objdir *t); + #endif /* TMP_OBJDIR_H */