diff mbox series

[5/7] tmp-objdir: new API for creating and removing primary object dirs

Message ID 67d3b2b09f9ddda616cdd0d1b12ab7afc73670ed.1630376800.git.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series Add a new --remerge-diff capability to show & log | expand

Commit Message

Elijah Newren Aug. 31, 2021, 2:26 a.m. UTC
From: Elijah Newren <newren@gmail.com>

The tmp_objdir API provides the ability to create temporary object
directories, but was designed with the goal of having subprocesses
access these object stores, followed by the main process migrating
objects from it to the main object store or just deleting it.  The
subprocesses would view it as their primary datastore and write to it.

For the --remerge-diff option we want to add to show & log, we want all
writes of intermediate merge results (both blobs and trees) to go to
this alternate object store; since those writes will be done by the main
process, we need this "alternate" object store to actually be the
primary object store.  When show & log are done, they'll simply remove
this temporary object store.

We also need one more thing -- `git log --remerge-diff` can cause the
temporary object store to fill up with loose objects.  Rather than
trying to gc that are known garbage anyway, we simply want to know the
location of the temporary object store so we can purge the loose objects
after each merge.

Signed-off-by: Elijah Newren <newren@gmail.com>
---
 tmp-objdir.c | 29 +++++++++++++++++++++++++++++
 tmp-objdir.h | 16 ++++++++++++++++
 2 files changed, 45 insertions(+)

Comments

Ævar Arnfjörð Bjarmason Sept. 28, 2021, 7:55 a.m. UTC | #1
On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:

I commented just now on how this API is duplicated between here &
another in-flight series in
https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
comments on this patch here:

> diff --git a/tmp-objdir.c b/tmp-objdir.c
> index b8d880e3626..9f75a75d1c0 100644
> --- a/tmp-objdir.c
> +++ b/tmp-objdir.c
> @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
>  	return t->env.v;
>  }
>  
> +const char *tmp_objdir_path(struct tmp_objdir *t)
> +{
> +	return t->path.buf;
> +}

Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
that don't have these "hidden struct" shenanigans (like say "struct
string_list") so a caller could just access this, we can just declare it
"const" appropriately.

We're also all in-tree friends here, so having various accessors for no
reason other than to access struct members seems a bit too much.

All of which is to say if you + Neeraj come up with some common API I
for one wouldn't mind moving the struct deceleration to tmp-objdir.h,
but it looks like in his version it can stay "private" more easily.

In this particular case I hadn't understood on a previous reading of
tmp-objdir.c why this "path" isn't a statically allocated "char
path[PATH_MAX]", or a least we that hardcoded "1024" should be a
PATH_MAX, as it is in some other cases.

But I digress :)
Jeff King Sept. 28, 2021, 11:17 p.m. UTC | #2
On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote:

> From: Elijah Newren <newren@gmail.com>
> 
> The tmp_objdir API provides the ability to create temporary object
> directories, but was designed with the goal of having subprocesses
> access these object stores, followed by the main process migrating
> objects from it to the main object store or just deleting it.  The
> subprocesses would view it as their primary datastore and write to it.
> 
> For the --remerge-diff option we want to add to show & log, we want all
> writes of intermediate merge results (both blobs and trees) to go to
> this alternate object store; since those writes will be done by the main
> process, we need this "alternate" object store to actually be the
> primary object store.  When show & log are done, they'll simply remove
> this temporary object store.

I think this is consistent with the original design of tmp_objdir. I
just never needed to do anything in-process before, so overriding the
environment of sub-processes was sufficient.

I do think these are dangerous and may cause bugs, though. Anything you
write while the tmp_objdir is marked as "primary" is going to go away
when you remove it. So any object names you reference outside of that,
either:

  - by another object that you create after the tmp_objdir is removed;
    or

  - in a ref

are going to turn into repository corruption. Of course that's true for
the existing sub-processes, too, but here we're touching a wider variety
of code.

Obviously the objects we write as part of remerge-diff are meant to be
temporary, and we'll never reference them in any other way. And usually
we would not expect a diff to write any other objects. But one thing
that comes to mind if textconv caching.

If you do a remerge diff on a blob that uses textconv, and the caching
feature is turned on, then we'll write out a new blob with the cached
value, and eventually a new tree and refs/notes/ pointer referencing it.
I'm not sure of the timing of all of that, but it seems likely to me
that at least some of that will end up in your tmp_objdir.

If you remove the tmp_objdir as the primary as soon as you're done with
the merge, but before you run the diff, you might be OK, though.

If not, then I think the solution is probably not to install this as the
"primary", but rather:

  - do the specific remerge-diff writes we want using a special "write
    to this object dir" variant of write_object_file()

  - install the tmp_objdir as an alternate, so the reading side (which
    is diff code that doesn't otherwise know about our remerge-diff
    trickery) can find them

And that lets other writers avoid writing into the temporary area
accidentally.

In that sense this is kind of like the pretend_object_file() interface,
except that it's storing the objects on disk instead of in memory. Of
course another way of doing that would be to stuff the object data into
tempfiles and just put pointers into the in-memory cached_objects array.

It's also not entirely foolproof (nor is the existing
pretend_object_file()). Any operation which is fooled into thinking we
have object X because it's in the fake-object list or the tmp_objdir may
reference that object erroneously, creating a corruption. But it's still
safer than allowing arbitrary writes into the tmp_objdir.

  Side note: The pretend_object_file() approach is actually even better,
  because we know the object is fake. So it does not confuse
  write_object_file()'s "do we already have this object" freshening
  check.

  I suspect it could even be made faster than the tmp_objdir approach.
  From our perspective, these objects really are tempfiles. So we could
  write them as such, not worrying about things like fsyncing them,
  naming them into place, etc. We could just write them out, then mmap
  the results, and put the pointers into cached_objects (currently it
  insists on malloc-ing a copy of the input buffer, but that seems like
  an easy extension to add).

  In fact, I think you could get away with just _one_ tempfile per
  merge. Open up one tempfile. Write out all of the objects you want to
  "store" into it in sequence, and record the lseek() offsets before and
  after for each object. Then mmap the whole result, and stuff the
  appropriate pointers (based on arithmetic with the offsets) into the
  cached_objects list.

  As a bonus, this tempfile can easily be in $TMPDIR, meaning
  remerge-diff could work on a read-only repository (and the OS can
  perhaps handle the data better, especially if $TMPDIR isn't a regular
  filesystem).

> We also need one more thing -- `git log --remerge-diff` can cause the
> temporary object store to fill up with loose objects.  Rather than
> trying to gc that are known garbage anyway, we simply want to know the
> location of the temporary object store so we can purge the loose objects
> after each merge.

This paragraph confused me. At first I thought you were talking about
how to avoid calling "gc --auto" because we don't want to waste time
thinking all those loose objects were worth gc-ing. But we wouldn't do
that anyway (git-log does not expect to make objects so doesn't call it
at all, and anyway you'd expect it to happen at the end of a process,
after we've already removed the tmp_objdir).

But I think you just mean: we can build up a bunch of loose objects,
which is inefficient. We don't want to gc them, because that's even less
efficient. So we want to clean them out between merges.

But I don't see any code to do that here. I guess that's maybe why
you've added tmp_objdir_path(), to find them later. But IMHO this would
be much better encapsulated as tmp_objdir_clear_objects() or something.

But simpler still: is there any reason not to just drop and re-create
the tmp-objdir for each merge? It's not very expensive to do so (and
certainly not compared to the cost of actually writing out the objects).

-Peff
Junio C Hamano Sept. 29, 2021, 4:08 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

>   Side note: The pretend_object_file() approach is actually even better,
>   because we know the object is fake. So it does not confuse
>   write_object_file()'s "do we already have this object" freshening
>   check.
>
>   I suspect it could even be made faster than the tmp_objdir approach.
>   From our perspective, these objects really are tempfiles. So we could
>   write them as such, not worrying about things like fsyncing them,
>   naming them into place, etc. We could just write them out, then mmap
>   the results, and put the pointers into cached_objects (currently it
>   insists on malloc-ing a copy of the input buffer, but that seems like
>   an easy extension to add).
>
>   In fact, I think you could get away with just _one_ tempfile per
>   merge. Open up one tempfile. Write out all of the objects you want to
>   "store" into it in sequence, and record the lseek() offsets before and
>   after for each object. Then mmap the whole result, and stuff the
>   appropriate pointers (based on arithmetic with the offsets) into the
>   cached_objects list.

Cute.  The remerge diff code path creates a full tree that records
the mechanical merge result.  By hooking into the lowest layer of
write_object() interface, we'd serialize all objects in such a tree
in the order they are computed (bottom up from the leaf level, I'd
presume) into a single flat file ;-)
Elijah Newren Sept. 29, 2021, 4:22 a.m. UTC | #4
On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>
> I commented just now on how this API is duplicated between here &
> another in-flight series in
> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
> comments on this patch here:
>
> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> > index b8d880e3626..9f75a75d1c0 100644
> > --- a/tmp-objdir.c
> > +++ b/tmp-objdir.c
> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
> >       return t->env.v;
> >  }
> >
> > +const char *tmp_objdir_path(struct tmp_objdir *t)
> > +{
> > +     return t->path.buf;
> > +}
>
> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> that don't have these "hidden struct" shenanigans (like say "struct
> string_list") so a caller could just access this, we can just declare it
> "const" appropriately.
>
> We're also all in-tree friends here, so having various accessors for no
> reason other than to access struct members seems a bit too much.

That's a fair point, but just to play counterpoint for a minute...

FWIW, I dislike when our public facing APIs are polluted with all
kinds of internal details.  merge-recursive being a case in point.
When writing merge-ort, although I had a struct with public fields,
that struct also contained an opaque struct (pointer) within it to
hide several private fields.  (I would have liked to hide or remove a
few more fields, but couldn't do so while the merge_recursive_options
struct was shared between both merge-ort and merge-recursive.)

That said, I agree it can certainly be overdone, and tmp_objdir is
pretty simple.  However, sometimes even in simple cases I like when
folks make use of an opaque struct because it means folks put some
effort into thinking more about the API that should be exposed.
That's something we as a project have often overlooked in the past, as
evidenced both by our one-shot usage mentality, and the existence of
external projects like libgit2 attempting to correct this design
shortcoming.  I'd like git to move more towards being structured as a
reusable library as well as a useful command-line tool.

> All of which is to say if you + Neeraj come up with some common API I
> for one wouldn't mind moving the struct deceleration to tmp-objdir.h,
> but it looks like in his version it can stay "private" more easily.

Peff's ideas to just delete and recreate the temporary object store
after every merge would let me reduce the API extensions and keep
things more private.  So I think our ideas are converging a bit.  :-)

> In this particular case I hadn't understood on a previous reading of
> tmp-objdir.c why this "path" isn't a statically allocated "char
> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
> PATH_MAX, as it is in some other cases.

Actually, PATH_MAX shouldn't be used:

https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/
Elijah Newren Sept. 29, 2021, 5:05 a.m. UTC | #5
On Tue, Sep 28, 2021 at 4:17 PM Jeff King <peff@peff.net> wrote:
>
> On Tue, Aug 31, 2021 at 02:26:38AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > From: Elijah Newren <newren@gmail.com>
> >
> > The tmp_objdir API provides the ability to create temporary object
> > directories, but was designed with the goal of having subprocesses
> > access these object stores, followed by the main process migrating
> > objects from it to the main object store or just deleting it.  The
> > subprocesses would view it as their primary datastore and write to it.
> >
> > For the --remerge-diff option we want to add to show & log, we want all
> > writes of intermediate merge results (both blobs and trees) to go to
> > this alternate object store; since those writes will be done by the main
> > process, we need this "alternate" object store to actually be the
> > primary object store.  When show & log are done, they'll simply remove
> > this temporary object store.
>
> I think this is consistent with the original design of tmp_objdir. I
> just never needed to do anything in-process before, so overriding the
> environment of sub-processes was sufficient.
>
> I do think these are dangerous and may cause bugs, though. Anything you
> write while the tmp_objdir is marked as "primary" is going to go away
> when you remove it. So any object names you reference outside of that,
> either:
>
>   - by another object that you create after the tmp_objdir is removed;
>     or
>
>   - in a ref
>
> are going to turn into repository corruption. Of course that's true for
> the existing sub-processes, too, but here we're touching a wider variety
> of code.
>
> Obviously the objects we write as part of remerge-diff are meant to be
> temporary, and we'll never reference them in any other way. And usually
> we would not expect a diff to write any other objects. But one thing
> that comes to mind if textconv caching.
>
> If you do a remerge diff on a blob that uses textconv, and the caching
> feature is turned on, then we'll write out a new blob with the cached
> value, and eventually a new tree and refs/notes/ pointer referencing it.
> I'm not sure of the timing of all of that, but it seems likely to me
> that at least some of that will end up in your tmp_objdir.

Interesting.  Thanks for explaining this case, and for the testcase in
the other thread.

So there are other parts of running `git log` that are not read-only,
besides what I'm doing.  That's unfortunate.  Can we just disable
those things in this section of the code?

Also, I think it makes sense that textconv caching causes a problem
via writing new blobs and trees *and* refs that reference these blobs
and trees.  However, I'm not sure I'm really grasping it, because I
don't see how your solutions are safe either.  I'll mention more after
each one.

> If you remove the tmp_objdir as the primary as soon as you're done with
> the merge, but before you run the diff, you might be OK, though.

It has to be after I run the diff, because the diff needs access to
the temporary files to diff against them.

> If not, then I think the solution is probably not to install this as the
> "primary", but rather:
>
>   - do the specific remerge-diff writes we want using a special "write
>     to this object dir" variant of write_object_file()
>
>   - install the tmp_objdir as an alternate, so the reading side (which
>     is diff code that doesn't otherwise know about our remerge-diff
>     trickery) can find them
>
> And that lets other writers avoid writing into the temporary area
> accidentally.

Doesn't this have the same problem?  If something similar to textconv
caching were to create new trees that referenced the existing blobs
and then added a ref that referred to that tree, then you have the
exact same problem, right?  The new tree and the new ref would be
corrupt as soon as the tmp-objdir went away.  Whether or not other
callers write to the temporary area isn't the issue, it's whether they
write refs that can refer to anything from the temporary area.  Or am
I missing something?

> In that sense this is kind of like the pretend_object_file() interface,
> except that it's storing the objects on disk instead of in memory. Of
> course another way of doing that would be to stuff the object data into
> tempfiles and just put pointers into the in-memory cached_objects array.
>
> It's also not entirely foolproof (nor is the existing
> pretend_object_file()). Any operation which is fooled into thinking we
> have object X because it's in the fake-object list or the tmp_objdir may
> reference that object erroneously, creating a corruption. But it's still
> safer than allowing arbitrary writes into the tmp_objdir.

Why is the pretend_object_file() interface safer?  Does it disable the
textconv caching somehow?  I don't see why it helps avoid this
problem.

>   Side note: The pretend_object_file() approach is actually even better,
>   because we know the object is fake. So it does not confuse
>   write_object_file()'s "do we already have this object" freshening
>   check.

Oh, that's interesting.  That would be helpful.

>   I suspect it could even be made faster than the tmp_objdir approach.
>   From our perspective, these objects really are tempfiles. So we could
>   write them as such, not worrying about things like fsyncing them,
>   naming them into place, etc. We could just write them out, then mmap
>   the results, and put the pointers into cached_objects (currently it
>   insists on malloc-ing a copy of the input buffer, but that seems like
>   an easy extension to add).
>
>   In fact, I think you could get away with just _one_ tempfile per
>   merge. Open up one tempfile. Write out all of the objects you want to
>   "store" into it in sequence, and record the lseek() offsets before and
>   after for each object. Then mmap the whole result, and stuff the
>   appropriate pointers (based on arithmetic with the offsets) into the
>   cached_objects list.
>
>   As a bonus, this tempfile can easily be in $TMPDIR, meaning
>   remerge-diff could work on a read-only repository (and the OS can
>   perhaps handle the data better, especially if $TMPDIR isn't a regular
>   filesystem).

Interesting.  I'm not familiar with the pretend_object_file() code,
but I think I catch what you're saying.  I don't see anything in
object-file.c that allows removing items from the cache, as I would
want to do, but I'm sure I could add something.

(I'm still not sure how this avoids the problem of things like
textconv caching trying to write an object that references one of
these, though.  Should we just manually disable textconv caching
during a remerge-diff?)

> > We also need one more thing -- `git log --remerge-diff` can cause the
> > temporary object store to fill up with loose objects.  Rather than
> > trying to gc that are known garbage anyway, we simply want to know the
> > location of the temporary object store so we can purge the loose objects
> > after each merge.
>
> This paragraph confused me. At first I thought you were talking about
> how to avoid calling "gc --auto" because we don't want to waste time
> thinking all those loose objects were worth gc-ing. But we wouldn't do
> that anyway (git-log does not expect to make objects so doesn't call it
> at all, and anyway you'd expect it to happen at the end of a process,
> after we've already removed the tmp_objdir).
>
> But I think you just mean: we can build up a bunch of loose objects,
> which is inefficient. We don't want to gc them, because that's even less
> efficient. So we want to clean them out between merges.

Yes, sorry that wasn't clear.

> But I don't see any code to do that here. I guess that's maybe why
> you've added tmp_objdir_path(), to find them later. But IMHO this would
> be much better encapsulated as tmp_objdir_clear_objects() or something.
>
> But simpler still: is there any reason not to just drop and re-create
> the tmp-objdir for each merge? It's not very expensive to do so (and
> certainly not compared to the cost of actually writing out the objects).

Ooh, indeed.  I like that, and it turns out that the majority of the
cost of dropping the tmp_objdir is the recursive directory removal,
that I wanted anyway as part of me inbetween-merges cleanup.

But, of course, if we go the pretend_object_file() route then the
tmp_objdir() stuff probably becomes moot.
Jeff King Sept. 30, 2021, 7:26 a.m. UTC | #6
On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:

> So there are other parts of running `git log` that are not read-only,
> besides what I'm doing.  That's unfortunate.  Can we just disable
> those things in this section of the code?

It might be possible for the cache (we'd still generate entries, but
just not write them). It's "just" an optimization, though it's possible
there are cases where it's really annoying to do so. I'd imagine those
are a stretch. I do use textconv to decrypt gpg blobs with a key on a
hardware token, for example. I don't require a touch for each one,
though, nor would I use textconv caching anyway (since the point is not
to store the unencrypted contents).

It wouldn't necessarily be OK to disable object writes in general.
Though I'd think since a diff is _conceptually_ a read-only operation
that any writes it wants to do would be in the same caching /
optimization boat.  I'd also worry about the maintenance cost of having
to annotate any such case.

> Also, I think it makes sense that textconv caching causes a problem
> via writing new blobs and trees *and* refs that reference these blobs
> and trees.  However, I'm not sure I'm really grasping it, because I
> don't see how your solutions are safe either.  I'll mention more after
> each one.

Heh. So I originally wrote more on this in my earlier email, but worried
I was getting too into the weeds and would confuse you, so I pared it
down. But I managed to be confusing anyway. :)

Here's one mental model. If you have a temporary object store, then
these are the bad things that can happen:

  1. Some code may write an object to it, thinking the data is saved,
     but we later throw it away.

  2. Some code may _skip_ writing an object, thinking that we already
     have that object available (when in fact we'll later throw it
     away).

  3. Some code may record the fact that we have an object in memory, and
     then later (after the temporary object store is not in use), may
     point to it when writing an object or reference.

Pointing the primary object-dir at a temporary directory (like your
patch does) means we're subject to all three.

Problem (1) happens because unrelated code may not realize we've swapped
out the primary object writing code for this throw-away directory. But
we could require writers to explicitly indicate that they want to write
to the temporary area, while allowing other writes to go to the regular
object store. This could be done with a flag or variant of
write_object_file(). Or if the temporary area isn't a regular object
directory at all (like the whole tempfile / pretend thing), then this
happens automatically, because the regular object-write paths don't even
know about our temporary area.

Problem (2) is mostly handled inside write_object_file(). It will
optimize out writes of objects we already have. So it needs to know when
we have an object "for real", and when we just have it in our temporary
store. If we make the tmp_objdir the primary, then it doesn't know this
(though we could teach it to check. In the pretend_object_file() system,
it already knows this (it uses the "freshen" code paths which actually
want to find the entry in the filesystem).

Problem (3) is the hardest one, because we don't really distinguish
between readers and writers. So if I call has_object_file(), is it
because I want to access the object? Or is it because I want to generate
a new reference to it, which must ensure that it exists? One of those is
OK to look at the tmp_objdir (and indeed, is the whole point of making
it in the first place), and the other is the first step to corrupting
the repository. ;)

With the environment variables we set up for running external commands
(like hooks) in the receive-pack quarantine, one bit of safety we have
is to prevent ref writes entirely in those commands. That works
reasonably well. Even if they may write new objects that might get
thrown away (if the push is rejected), they'd ultimately need to point
to those objects with a ref.  And because those hooks spend their
_entire_ run-time in the quarantine state, they can't ever write a ref.

Whereas doing it in-process is a little dicier. Right now the textconv
cache writes out a new tree for every entry we add, so it happens to do
the ref write while the tmp-objdir is still in place. But as this
comment notes:

  $ git grep -B6 notes_cache_write diff.c
  diff.c-         /*
  diff.c-          * we could save up changes and flush them all at the end,
  diff.c-          * but we would need an extra call after all diffing is done.
  diff.c-          * Since generating a cache entry is the slow path anyway,
  diff.c-          * this extra overhead probably isn't a big deal.
  diff.c-          */
  diff.c:         notes_cache_write(textconv->cache);

this is slow, and it would be perfectly reasonable to flush the cache at
the end of the process (e.g., using an atexit() handler). In which case
we'd hit this problem (3) exactly: we'd generate and remember objects
during the tmp-objdir period, but only later actually reference them.
This is more likely than the receive-pack hook case, because we're doing
it all in process.

> > If you remove the tmp_objdir as the primary as soon as you're done with
> > the merge, but before you run the diff, you might be OK, though.
> 
> It has to be after I run the diff, because the diff needs access to
> the temporary files to diff against them.

Right, of course. I was too fixated on the object-write part, forgetting
that the whole point of the exercise is to later read them back. :)

> > If not, then I think the solution is probably not to install this as the
> > "primary", but rather:
> >
> >   - do the specific remerge-diff writes we want using a special "write
> >     to this object dir" variant of write_object_file()
> >
> >   - install the tmp_objdir as an alternate, so the reading side (which
> >     is diff code that doesn't otherwise know about our remerge-diff
> >     trickery) can find them
> >
> > And that lets other writers avoid writing into the temporary area
> > accidentally.
> 
> Doesn't this have the same problem?  If something similar to textconv
> caching were to create new trees that referenced the existing blobs
> and then added a ref that referred to that tree, then you have the
> exact same problem, right?  The new tree and the new ref would be
> corrupt as soon as the tmp-objdir went away.  Whether or not other
> callers write to the temporary area isn't the issue, it's whether they
> write refs that can refer to anything from the temporary area.  Or am
> I missing something?

The key thing here is in the first step, where remerge-diff is
explicitly saying "I want to write to this temporary object-dir". But
the textconv-cache code does not; it gets to write to the normal spot.
So it avoids problem (1).

You're right that it does not avoid problem (3) exactly. But triggering
that would require some code not just writing other objects or
references while the tmp-objdir is in place, but specifically
referencing the objects that remerge-diff did put into the tmp-objdir.
That seems a lot less likely to me (because the thing we're most worried
about is unrelated code that just happens to write while the tmp-objdir
is in place).

> > In that sense this is kind of like the pretend_object_file() interface,
> > except that it's storing the objects on disk instead of in memory. Of
> > course another way of doing that would be to stuff the object data into
> > tempfiles and just put pointers into the in-memory cached_objects array.
> >
> > It's also not entirely foolproof (nor is the existing
> > pretend_object_file()). Any operation which is fooled into thinking we
> > have object X because it's in the fake-object list or the tmp_objdir may
> > reference that object erroneously, creating a corruption. But it's still
> > safer than allowing arbitrary writes into the tmp_objdir.
> 
> Why is the pretend_object_file() interface safer?  Does it disable the
> textconv caching somehow?  I don't see why it helps avoid this
> problem.

So hopefully it's clearer now from what I wrote above, but just
connecting the dots:

  1. Unrelated code calling write_object_file() would still write real,
     durable objects, as usual.

  2. The write_object_file() "skip this write" optimization already
     ignores the pretend_object_file() objects while checking "do we
     have this object".

> Interesting.  I'm not familiar with the pretend_object_file() code,
> but I think I catch what you're saying.  I don't see anything in
> object-file.c that allows removing items from the cache, as I would
> want to do, but I'm sure I could add something.

Right, it's pretty limited now, and there's only one caller. And in fact
I've wanted to get rid of it, exactly because it can create the danger
we're discussing here. But I still think it's _less_ dangerous than
fully replacing the object directory.

> (I'm still not sure how this avoids the problem of things like
> textconv caching trying to write an object that references one of
> these, though.  Should we just manually disable textconv caching
> during a remerge-diff?)

It can't avoid it completely, but in general, the textconv cache should
be writing and referencing its own objects. Even if they happen to be
the same as something remerge-diff generated, it won't know that until
it has tried to write it (and so that's why having write_object_file()
continue to really write the object is important).


Hopefully that clears up my line of thinking. I do think a lot of this
is kind of theoretical, in the sense that:

  - textconv caching is an obscure feature that we _could_ just disable
    during remerge-diffs

  - there's a reasonable chance that there isn't any other code that
    wants to write during a diff

But this is all scary and error-prone enough that I'd prefer an approach
that has the "least surprise". So any solution where random code calling
write_object_file() _can't_ accidentally write to a throw-away directory
seems like a safer less surprising thing.

It does mean that the remerge-diff code needs to be explicit in its
object writes to say "this needs to go to the temporary area" (whether
it's a flag, or calling into a totally different function or subsystem).
I'm hoping that's not hard to do (because its writes are done explicitly
by the remerge-diff code), but I worry that it may be (because you are
relying on more generic tree code to the write under the hood).

-Peff
Jeff King Sept. 30, 2021, 7:33 a.m. UTC | #7
On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   Side note: The pretend_object_file() approach is actually even better,
> >   because we know the object is fake. So it does not confuse
> >   write_object_file()'s "do we already have this object" freshening
> >   check.
> >
> >   I suspect it could even be made faster than the tmp_objdir approach.
> >   From our perspective, these objects really are tempfiles. So we could
> >   write them as such, not worrying about things like fsyncing them,
> >   naming them into place, etc. We could just write them out, then mmap
> >   the results, and put the pointers into cached_objects (currently it
> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >   an easy extension to add).
> >
> >   In fact, I think you could get away with just _one_ tempfile per
> >   merge. Open up one tempfile. Write out all of the objects you want to
> >   "store" into it in sequence, and record the lseek() offsets before and
> >   after for each object. Then mmap the whole result, and stuff the
> >   appropriate pointers (based on arithmetic with the offsets) into the
> >   cached_objects list.
> 
> Cute.  The remerge diff code path creates a full tree that records
> the mechanical merge result.  By hooking into the lowest layer of
> write_object() interface, we'd serialize all objects in such a tree
> in the order they are computed (bottom up from the leaf level, I'd
> presume) into a single flat file ;-)

I do still like this approach, but just two possible gotchas I was
thinking of:

 - This side-steps all of our usual code for getting object data into
   memory. In general, I'd expect this content to not be too enormous,
   but it _could_ be if there are many / large blobs in the result. So
   we may end up with large maps. Probably not a big deal on modern
   64-bit systems. Maybe an issue on 32-bit systems, just because of
   virtual address space.

   Likewise, we do support systems with NO_MMAP. They'd work here, but
   it would probably mean putting all that object data into the heap. I
   could live with that, given how rare such systems are these days, and
   that it only matters if you're using --remerge-diff with big blobs.

 - I wonder to what degree --remerge-diff benefits from omitting writes
   for objects we already have. I.e., if you are writing out a whole
   tree representing the conflicted state, then you don't want to write
   all of the trees that aren't interesting. Hopefully the code is
   already figuring out which paths the merge even touched, and ignoring
   the rest. It probably benefits performance-wise from
   write_object_file() deciding to skip some object writes, as well
   (e.g., for resolutions which the final tree already took, as they'd
   be in the merge commit). The whole pretend-we-have-this-object thing
   may want to likewise make sure we don't write out objects that we
   already have in the real odb.

-Peff
Jeff King Sept. 30, 2021, 7:41 a.m. UTC | #8
On Tue, Sep 28, 2021 at 09:22:32PM -0700, Elijah Newren wrote:

> > Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> > that don't have these "hidden struct" shenanigans (like say "struct
> > string_list") so a caller could just access this, we can just declare it
> > "const" appropriately.
> >
> > We're also all in-tree friends here, so having various accessors for no
> > reason other than to access struct members seems a bit too much.
> 
> That's a fair point, but just to play counterpoint for a minute...
> 
> FWIW, I dislike when our public facing APIs are polluted with all
> kinds of internal details.  merge-recursive being a case in point.
> When writing merge-ort, although I had a struct with public fields,
> that struct also contained an opaque struct (pointer) within it to
> hide several private fields.  (I would have liked to hide or remove a
> few more fields, but couldn't do so while the merge_recursive_options
> struct was shared between both merge-ort and merge-recursive.)
> 
> That said, I agree it can certainly be overdone, and tmp_objdir is
> pretty simple.  However, sometimes even in simple cases I like when
> folks make use of an opaque struct because it means folks put some
> effort into thinking more about the API that should be exposed.
> That's something we as a project have often overlooked in the past, as
> evidenced both by our one-shot usage mentality, and the existence of
> external projects like libgit2 attempting to correct this design
> shortcoming.  I'd like git to move more towards being structured as a
> reusable library as well as a useful command-line tool.

Right, it was definitely a conscious decision to keep the tmp-objdir API
as slim as possible, just because it's such a complicated and confusing
thing in the first place.

For something like a strbuf, giving direct access to the fields makes
sense. Exposing the details of how the struct works (like accessing
".buf" as a NUL-terminated string) are part of its usefulness.

But tmp_objdir represents a more abstract concept, and I wanted to
insulate callers from the details.

That said, the notion of "this is the path of the objdir" is not that
contentious, so I don't mind it too much (but it would be a jump to
exposing the details at all).

-Peff
Jeff King Sept. 30, 2021, 7:46 a.m. UTC | #9
On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:

> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> > 
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
> 
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)

Ah, no, I remember what I was trying to say here. The distinction is
between "remove the tmp_objdir" and "remove it as the primary".

I.e., if you do this:

  1. create tmp_objdir

  2. make tmp_objdir primary for writes

  3. run the "merge" half of remerge-diff, writing objects into the
     temporary space

  4. stop having tmp_objdir as the primary; instead make it an alternate

  5. run the diff

  6. remove tmp_objdir totally

Then step 5 can't accidentally write objects into the temporary space,
but it can still read them. So it's not entirely safe, but it's safer,
and it would be a much smaller change.

Some ways it could go wrong:

  - is it possible for the merge code to ever write an object? I kind of
    wonder if we'd ever do any cache-able transformations as part of a
    content-level merge. I don't think we do now, though.

  - in step 5, write_object_file() may still be confused by the presence
    of the to-be-thrown-away objects in the alternate. This is pretty
    unlikely, as it implies that the remerge-diff wrote a blob or tree
    that is byte-identical to something that the diff wants to write.

-Peff
Ævar Arnfjörð Bjarmason Sept. 30, 2021, 1:16 p.m. UTC | #10
On Thu, Sep 30 2021, Jeff King wrote:

> On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> >   Side note: The pretend_object_file() approach is actually even better,
>> >   because we know the object is fake. So it does not confuse
>> >   write_object_file()'s "do we already have this object" freshening
>> >   check.
>> >
>> >   I suspect it could even be made faster than the tmp_objdir approach.
>> >   From our perspective, these objects really are tempfiles. So we could
>> >   write them as such, not worrying about things like fsyncing them,
>> >   naming them into place, etc. We could just write them out, then mmap
>> >   the results, and put the pointers into cached_objects (currently it
>> >   insists on malloc-ing a copy of the input buffer, but that seems like
>> >   an easy extension to add).
>> >
>> >   In fact, I think you could get away with just _one_ tempfile per
>> >   merge. Open up one tempfile. Write out all of the objects you want to
>> >   "store" into it in sequence, and record the lseek() offsets before and
>> >   after for each object. Then mmap the whole result, and stuff the
>> >   appropriate pointers (based on arithmetic with the offsets) into the
>> >   cached_objects list.
>> 
>> Cute.  The remerge diff code path creates a full tree that records
>> the mechanical merge result.  By hooking into the lowest layer of
>> write_object() interface, we'd serialize all objects in such a tree
>> in the order they are computed (bottom up from the leaf level, I'd
>> presume) into a single flat file ;-)
>
> I do still like this approach, but just two possible gotchas I was
> thinking of:
>
>  - This side-steps all of our usual code for getting object data into
>    memory. In general, I'd expect this content to not be too enormous,
>    but it _could_ be if there are many / large blobs in the result. So
>    we may end up with large maps. Probably not a big deal on modern
>    64-bit systems. Maybe an issue on 32-bit systems, just because of
>    virtual address space.
>
>    Likewise, we do support systems with NO_MMAP. They'd work here, but
>    it would probably mean putting all that object data into the heap. I
>    could live with that, given how rare such systems are these days, and
>    that it only matters if you're using --remerge-diff with big blobs.
>
>  - I wonder to what degree --remerge-diff benefits from omitting writes
>    for objects we already have. I.e., if you are writing out a whole
>    tree representing the conflicted state, then you don't want to write
>    all of the trees that aren't interesting. Hopefully the code is
>    already figuring out which paths the merge even touched, and ignoring
>    the rest. It probably benefits performance-wise from
>    write_object_file() deciding to skip some object writes, as well
>    (e.g., for resolutions which the final tree already took, as they'd
>    be in the merge commit). The whole pretend-we-have-this-object thing
>    may want to likewise make sure we don't write out objects that we
>    already have in the real odb.

I haven't benchmarked since my core.checkCollisions RFC patch[1]
resulted in the somewhat related loose object cache patch from you, and
not with something like the midx, but just a note that on some setups
just writing things out is faster than exhaustively checking if we
absolutely need to write things out.

I also wonder how much if anything writing out the one file v.s. lots of
loose objects is worthwhile on systems where we could write out those
loose objects on a ramdisk, which is commonly available on e.g. Linux
distros these days out of the box. If you care about performance but not
about your transitory data using a ramdisk is generally much better than
any other potential I/O optimization.

Finally, and I don't mean to throw a monkey wrench into this whole
discussion, so take this as a random musing: I wonder how much faster
this thing could be on its second run if instead of avoiding writing to
the store & cleaning up, it just wrote to the store, and then wrote
another object keyed on the git version and any revision paramaters
etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
to present the result to the user :)

I suppose that would be throwing a lot more work at an eventual "git gc"
than we ever do now, so maybe it's a bit crazy, but I think it might be
an interesting direction in general to (ab)use either the primary or
some secondary store in the .git dir as a semi-permanent cache of
resolved queries from the likes of "git log".

1. https://lore.kernel.org/git/20181028225023.26427-5-avarab@gmail.com/
Ævar Arnfjörð Bjarmason Sept. 30, 2021, 2:17 p.m. UTC | #11
On Tue, Sep 28 2021, Elijah Newren wrote:

> On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
>>
>> I commented just now on how this API is duplicated between here &
>> another in-flight series in
>> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
>> comments on this patch here:
>>
>> > diff --git a/tmp-objdir.c b/tmp-objdir.c
>> > index b8d880e3626..9f75a75d1c0 100644
>> > --- a/tmp-objdir.c
>> > +++ b/tmp-objdir.c
>> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
>> >       return t->env.v;
>> >  }
>> >
>> > +const char *tmp_objdir_path(struct tmp_objdir *t)
>> > +{
>> > +     return t->path.buf;
>> > +}
>>
>> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
>> that don't have these "hidden struct" shenanigans (like say "struct
>> string_list") so a caller could just access this, we can just declare it
>> "const" appropriately.
>>
>> We're also all in-tree friends here, so having various accessors for no
>> reason other than to access struct members seems a bit too much.
>
> That's a fair point, but just to play counterpoint for a minute...
>
> FWIW, I dislike when our public facing APIs are polluted with all
> kinds of internal details.  merge-recursive being a case in point.
> When writing merge-ort, although I had a struct with public fields,
> that struct also contained an opaque struct (pointer) within it to
> hide several private fields.  (I would have liked to hide or remove a
> few more fields, but couldn't do so while the merge_recursive_options
> struct was shared between both merge-ort and merge-recursive.)
>
> That said, I agree it can certainly be overdone, and tmp_objdir is
> pretty simple.  However, sometimes even in simple cases I like when
> folks make use of an opaque struct because it means folks put some
> effort into thinking more about the API that should be exposed.
> That's something we as a project have often overlooked in the past, as
> evidenced both by our one-shot usage mentality, and the existence of
> external projects like libgit2 attempting to correct this design
> shortcoming.  I'd like git to move more towards being structured as a
> reusable library as well as a useful command-line tool.

[This is somewhat of a continuation of
https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/].

I like our APIs being usable, but right now we're not promising any
maintenance of APIs that work the same as git v2.30.0 or whatever.

I think some external users directly link to libgit.a, but they're
tracking whatever churn we have from release to release, just like
someone maintaining out-of-tree linux kernel drivers.

For maintenance of git.git itself I think it's going to stay like that
for any foreseeable future[1].

And that's before we even get into the practicalities of git's license
likely not winning over many (if any) libgit2 users. I think those users
are if anything migrating from libgit2 to git's plumbing CLI, or in some
cases folding what they needed into git.git itself (or just staying with
libgit2).

So assuming all of that I really don't see the point in general of
having a "foo" field that has the equivalent of get_foo() and set_foo()
accessors in the context of git.git.

That's something that's really useful if you're maintaining API and ABI
compatibility, but for us it's usually just verbosity. All the users are
in-tree, so just add "const" as appropriate, or perhaps we should give
the field a "priv" name or whatever.

Obviously not everything should be done that way, e.g. being able to
have trace2 start/end points for something you'd otherwise push/clear
from a "struct string_list" or whatever directly *is* usable, so it's
not black and white.

I'm just saying thaht using convoluted patterns *just* to get in the way
of some hypothetical code that's going to mess with your internals can
make things harder in other areas. E.g. the init/release pattern via
macros I mentioned in the linked E-Mail above.

> [...]
>> In this particular case I hadn't understood on a previous reading of
>> tmp-objdir.c why this "path" isn't a statically allocated "char
>> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
>> PATH_MAX, as it is in some other cases.
>
> Actually, PATH_MAX shouldn't be used:
>
> https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
> https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/

Thanks. I didn't know about that. I've fixed (some of that) up in some
semi-related WIP work I've got.

1. Aside: I *do* think it would be really useful for us to expose a
   stable C API eventually, using some of the same license
   workarounds/shenanigans (depending on your view) as the Linux kernel
   did. I.e. to say that "we're GPLv2, but this small API is the
   equivalent of our syscall interface".

   Rather than that API being something that would look anything like
   internal git.git users it would basically be a run_command() + Perl's
   unpack/pack + output formatter on steroids.

   I.e. we'd be able to invoke say a 'git cat-file --batch' in the same
   process as the caller by essentially calling cmd_cat_file()
   directly. Just with some callback replacements for "provide this on
   stdin" and "this line accepts stdout" without actually involving
   different processes or I/O.

   You'd then have the equivalent of a --format output that would just
   so happen to be in your OS's / C ABI's understood struct
   format. Similar to "perldoc -f pack".

   The maintenance burden of such an API would be trivial. It would
   basically be a handful of functions just to setup input/output
   callbacks.

   Most of the interface would be a rather convoluted construction of
   command-line parameters (like our own run_command() use), not ideal,
   but most users are after structured data and eliminating process
   overhead, and they'd get that.

   We would need to replace fprintf() and the like with some structured
   formatting mechanism, but that would be useful for non-API users as
   new commands learning a shiny new --format parameter. You could even
   use such a --format to have a non-C-API user get the same struct
   output, perhaps for direct mapping into memory, say if you didn't
   trust the built-in/command in question to not leak memory yet.

   I don't know if this has been suggested before, just my 0.02. It's
   also partially why I'm interested in making even-built-ins not leak
   memory with some of my parallel SANITIZE=leak work.

   Once you've got that you can usually call the cmd_whatever() in a
   loop without much effort, and once you've got that, well, see above
   :)
Neeraj Singh Sept. 30, 2021, 7:25 p.m. UTC | #12
On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:
> On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:
> 
> > So there are other parts of running `git log` that are not read-only,
> > besides what I'm doing.  That's unfortunate.  Can we just disable
> > those things in this section of the code?
> 
> It might be possible for the cache (we'd still generate entries, but
> just not write them). It's "just" an optimization, though it's possible
> there are cases where it's really annoying to do so. I'd imagine those
> are a stretch. I do use textconv to decrypt gpg blobs with a key on a
> hardware token, for example. I don't require a touch for each one,
> though, nor would I use textconv caching anyway (since the point is not
> to store the unencrypted contents).
> 
> It wouldn't necessarily be OK to disable object writes in general.
> Though I'd think since a diff is _conceptually_ a read-only operation
> that any writes it wants to do would be in the same caching /
> optimization boat.  I'd also worry about the maintenance cost of having
> to annotate any such case.
> 
> > Also, I think it makes sense that textconv caching causes a problem
> > via writing new blobs and trees *and* refs that reference these blobs
> > and trees.  However, I'm not sure I'm really grasping it, because I
> > don't see how your solutions are safe either.  I'll mention more after
> > each one.
> 
> Heh. So I originally wrote more on this in my earlier email, but worried
> I was getting too into the weeds and would confuse you, so I pared it
> down. But I managed to be confusing anyway. :)
> 
> Here's one mental model. If you have a temporary object store, then
> these are the bad things that can happen:
> 
>   1. Some code may write an object to it, thinking the data is saved,
>      but we later throw it away.
> 
>   2. Some code may _skip_ writing an object, thinking that we already
>      have that object available (when in fact we'll later throw it
>      away).
> 
>   3. Some code may record the fact that we have an object in memory, and
>      then later (after the temporary object store is not in use), may
>      point to it when writing an object or reference.
> 
> Pointing the primary object-dir at a temporary directory (like your
> patch does) means we're subject to all three.
> 
> Problem (1) happens because unrelated code may not realize we've swapped
> out the primary object writing code for this throw-away directory. But
> we could require writers to explicitly indicate that they want to write
> to the temporary area, while allowing other writes to go to the regular
> object store. This could be done with a flag or variant of
> write_object_file(). Or if the temporary area isn't a regular object
> directory at all (like the whole tempfile / pretend thing), then this
> happens automatically, because the regular object-write paths don't even
> know about our temporary area.
> 
> Problem (2) is mostly handled inside write_object_file(). It will
> optimize out writes of objects we already have. So it needs to know when
> we have an object "for real", and when we just have it in our temporary
> store. If we make the tmp_objdir the primary, then it doesn't know this
> (though we could teach it to check. In the pretend_object_file() system,
> it already knows this (it uses the "freshen" code paths which actually
> want to find the entry in the filesystem).
> 
> Problem (3) is the hardest one, because we don't really distinguish
> between readers and writers. So if I call has_object_file(), is it
> because I want to access the object? Or is it because I want to generate
> a new reference to it, which must ensure that it exists? One of those is
> OK to look at the tmp_objdir (and indeed, is the whole point of making
> it in the first place), and the other is the first step to corrupting
> the repository. ;)
> 

This is a good taxonomy. I think these problems apply to any transactional
system. If you keep state outside of the transaction, then you may wind up
with inconsistency. Problem (3) is the worst and kind of subsumes (2). If
we think an object is durable with a specific oid, we probably are going to
write that oid out somewhere else that is outside of the transaction. If
we write it to the ODB (e.g. in a tree or some blob) without first caching
it in memory, the references in the ODB will be equally durable as the referees.


> With the environment variables we set up for running external commands
> (like hooks) in the receive-pack quarantine, one bit of safety we have
> is to prevent ref writes entirely in those commands. That works
> reasonably well. Even if they may write new objects that might get
> thrown away (if the push is rejected), they'd ultimately need to point
> to those objects with a ref.  And because those hooks spend their
> _entire_ run-time in the quarantine state, they can't ever write a ref.
> 
> Whereas doing it in-process is a little dicier. Right now the textconv
> cache writes out a new tree for every entry we add, so it happens to do
> the ref write while the tmp-objdir is still in place. But as this
> comment notes:
> 
>   $ git grep -B6 notes_cache_write diff.c
>   diff.c-         /*
>   diff.c-          * we could save up changes and flush them all at the end,
>   diff.c-          * but we would need an extra call after all diffing is done.
>   diff.c-          * Since generating a cache entry is the slow path anyway,
>   diff.c-          * this extra overhead probably isn't a big deal.
>   diff.c-          */
>   diff.c:         notes_cache_write(textconv->cache);
> 
> this is slow, and it would be perfectly reasonable to flush the cache at
> the end of the process (e.g., using an atexit() handler). In which case
> we'd hit this problem (3) exactly: we'd generate and remember objects
> during the tmp-objdir period, but only later actually reference them.
> This is more likely than the receive-pack hook case, because we're doing
> it all in process.
> 

Is there state we care about besides objects and refs? Maybe we need to
introduce a transactional system for the entire repository like we have
for the refs-db.  In-process code that cares about oids that might be
only transactionally present can register a callback for when the transaction
commits or rollsback.  I'm not volunteering to write this, btw :).

Right now, disabling Ref updates is a pretty good compromise, since that's the
one definite piece of durable state outside the ODB that can point in. In memory
state is generally hard to solve and could wind up persisting in either the ODB
or the Refs or both.

> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> > 
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
> 
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)
> 
> > > If not, then I think the solution is probably not to install this as the
> > > "primary", but rather:
> > >
> > >   - do the specific remerge-diff writes we want using a special "write
> > >     to this object dir" variant of write_object_file()
> > >
> > >   - install the tmp_objdir as an alternate, so the reading side (which
> > >     is diff code that doesn't otherwise know about our remerge-diff
> > >     trickery) can find them
> > >
> > > And that lets other writers avoid writing into the temporary area
> > > accidentally.
> > 
> > Doesn't this have the same problem?  If something similar to textconv
> > caching were to create new trees that referenced the existing blobs
> > and then added a ref that referred to that tree, then you have the
> > exact same problem, right?  The new tree and the new ref would be
> > corrupt as soon as the tmp-objdir went away.  Whether or not other
> > callers write to the temporary area isn't the issue, it's whether they
> > write refs that can refer to anything from the temporary area.  Or am
> > I missing something?
> 
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
> 
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).
> 
> > > In that sense this is kind of like the pretend_object_file() interface,
> > > except that it's storing the objects on disk instead of in memory. Of
> > > course another way of doing that would be to stuff the object data into
> > > tempfiles and just put pointers into the in-memory cached_objects array.
> > >
> > > It's also not entirely foolproof (nor is the existing
> > > pretend_object_file()). Any operation which is fooled into thinking we
> > > have object X because it's in the fake-object list or the tmp_objdir may
> > > reference that object erroneously, creating a corruption. But it's still
> > > safer than allowing arbitrary writes into the tmp_objdir.
> > 
> > Why is the pretend_object_file() interface safer?  Does it disable the
> > textconv caching somehow?  I don't see why it helps avoid this
> > problem.
> 
> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
> 
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
> 
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".
> 
> > Interesting.  I'm not familiar with the pretend_object_file() code,
> > but I think I catch what you're saying.  I don't see anything in
> > object-file.c that allows removing items from the cache, as I would
> > want to do, but I'm sure I could add something.
> 
> Right, it's pretty limited now, and there's only one caller. And in fact
> I've wanted to get rid of it, exactly because it can create the danger
> we're discussing here. But I still think it's _less_ dangerous than
> fully replacing the object directory.
> 
> > (I'm still not sure how this avoids the problem of things like
> > textconv caching trying to write an object that references one of
> > these, though.  Should we just manually disable textconv caching
> > during a remerge-diff?)
> 
> It can't avoid it completely, but in general, the textconv cache should
> be writing and referencing its own objects. Even if they happen to be
> the same as something remerge-diff generated, it won't know that until
> it has tried to write it (and so that's why having write_object_file()
> continue to really write the object is important).
> 
> 
> Hopefully that clears up my line of thinking. I do think a lot of this
> is kind of theoretical, in the sense that:
> 
>   - textconv caching is an obscure feature that we _could_ just disable
>     during remerge-diffs
> 
>   - there's a reasonable chance that there isn't any other code that
>     wants to write during a diff
> 
> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.
> 

There is something nice to be said for having the cached state go away
with the ephemeral objects, since that caching refers to objects that
may never be seen again.

Any solution which makes in-flight transactional state visible (either as
the primary writing spot or as a readable alternate) will suffer from state
leakage for non-transactionally-aware readers/writers.

The bulk-fsync code has some advantage in that it is more constrained in what
it's calling while the tmp objdir is active. Also, in typical cases, the
tmp objdir will be committed rather than discarded. But I believe it suffers
from the same essential issue.

Can we press forward, and just try to make sure all the in-process stuff
is aware of temporary object state?  Maybe we can hook up the temp odb stuff
with the refsdb transactions into a single system.

We can make the tmp object directory stuff more prevalent by using it in
git-apply and the merge stategy, and by using bulk-checkin while writing out
the cached tree and commits. This should flush out any issues and keep other
parts of the codebase honest.

> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).
> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).
> 

I don't think that really helps the trickiest problems. The textconv cache,
for instance, may now be writing notes that refer to nonexistent objects.
That's fine from a correctness point of view, but now we're polluting the repo
with state that won't be gc'ed and won't ever be accessed.

IMO, it might be better for the textconv cache to have a single notes root
tree (rather than a notes 'branch' with a commit history). We would want to
put a ref to this tree somewhere inside the object directory itself so that
it's transactionally consistent with the rest of the object state. For instance,
we could store the current textconv notes tree root in .git/objects/info/textconv-cache.

Thanks,
Neeraj
Junio C Hamano Sept. 30, 2021, 8 p.m. UTC | #13
Jeff King <peff@peff.net> writes:

>> > If not, then I think the solution is probably not to install this as the
>> > "primary", but rather:
>> >
>> >   - do the specific remerge-diff writes we want using a special "write
>> >     to this object dir" variant of write_object_file()
>> >
>> >   - install the tmp_objdir as an alternate, so the reading side (which
>> >     is diff code that doesn't otherwise know about our remerge-diff
>> >     trickery) can find them
>> >
>> > And that lets other writers avoid writing into the temporary area
>> > accidentally.
>> 
>> ...
>
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
>
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).

I do not offhand remember if a new object creation codepath that
avoids writing an object that we already have considers an object we
find in an alternate as "we have it".  If it does not and we make a
duplicated copy in our primary object store, then it would be OK,
but otherwise, those that know tmp_objdir is an alternate may still
try to write a new object and find that the same object already
exists in an alternate.  Once tmp_objdir is gone, we would end up
with a corrupt repository, right?

freshen_loose_object() seems to go to check_and_freshen() which
consult check_and_freshen_nonlocal() before returning yes/no, so
having the same object in an alternate does seem to count as having
one.

> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
>
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
>
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".

Yup for (2)---the pretend interface is safer than alternate in that
sense.

But the pretend interface was designed to just hold a handful of
phoney objects in core.  I am not sure if it is a good fit for this
use case.

> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.
>
> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).

... and also writes by others that happens to write the same object
as what is already in the temporary area should be made to the real
object store.  Then we would be much safer.

> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).

Thanks.
Junio C Hamano Sept. 30, 2021, 8:06 p.m. UTC | #14
Jeff King <peff@peff.net> writes:

>   - is it possible for the merge code to ever write an object? I kind of
>     wonder if we'd ever do any cache-able transformations as part of a
>     content-level merge. I don't think we do now, though.

How is virtual merge base, result of an inner merge that recursively
merging two merge bases, fed to an outer merge as the ancestor?
Isn't it written as a tree object by the inner merge as a tree with
full of blob objects, so the outer merge does not have to treat a
merge base tree that is virtual any specially from a sole merge
base?
Junio C Hamano Sept. 30, 2021, 8:19 p.m. UTC | #15
Neeraj Singh <nksingh85@gmail.com> writes:

> Is there state we care about besides objects and refs? Maybe we need to
> introduce a transactional system for the entire repository like we have
> for the refs-db.  In-process code that cares about oids that might be
> only transactionally present can register a callback for when the transaction
> commits or rollsback.  I'm not volunteering to write this, btw :).

I take it to mean by "objects and refs" that we care only about
objects that are reachable from the refs?  The index (the primary
index array plus extensions that record object names) also has
anchoring points we care about (ask "git fsck" what they are).

> We can make the tmp object directory stuff more prevalent by using it in
> git-apply and the merge stategy, and by using bulk-checkin while writing out
> the cached tree and commits. This should flush out any issues and keep other
> parts of the codebase honest.

I wonder what's so wrong about cruft objects in the main object
store that may be gc-ed away laster, though.  IOW, I wonder if it is
simpler to treat the quarantine while accepting dubious foreign data
via "git push" in receive-pack as an exception, not a norm.
Jeff King Sept. 30, 2021, 9 p.m. UTC | #16
On Thu, Sep 30, 2021 at 03:16:19PM +0200, Ævar Arnfjörð Bjarmason wrote:

> I also wonder how much if anything writing out the one file v.s. lots of
> loose objects is worthwhile on systems where we could write out those
> loose objects on a ramdisk, which is commonly available on e.g. Linux
> distros these days out of the box. If you care about performance but not
> about your transitory data using a ramdisk is generally much better than
> any other potential I/O optimization.

I'd think in general we won't be using a ramdisk, because tmp_objdir is
putting its directory inside $GIT_DIR/objects. It doesn't _have_ to, but
using a ramdisk works against its original purpose (which was to store
potentially quite a lot of data from an incoming push, and to be able to
rename it cheaply into its final resting place).

It would probably not be too hard to provide a flag that indicates the
intended use, though (and then we decide where to create the temporary
directory based on that).

> Finally, and I don't mean to throw a monkey wrench into this whole
> discussion, so take this as a random musing: I wonder how much faster
> this thing could be on its second run if instead of avoiding writing to
> the store & cleaning up, it just wrote to the store, and then wrote
> another object keyed on the git version and any revision paramaters
> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> to present the result to the user :)
> 
> I suppose that would be throwing a lot more work at an eventual "git gc"
> than we ever do now, so maybe it's a bit crazy, but I think it might be
> an interesting direction in general to (ab)use either the primary or
> some secondary store in the .git dir as a semi-permanent cache of
> resolved queries from the likes of "git log".

I don't think it's crazy to just write the objects to the main object
store. We already generate cruft objects for some other operations
(Junio asked elsewhere in the thread about virtual trees for recursive
merges; I don't know the answer offhand, but I'd guess we do there).
They do get cleaned up eventually.

I'm not sure it helps performance much by itself. In a merge (or even
just writing a tree out from the index), by the time you realize you
already have the object, you've done most of the work to generate it.

I think what you're describing is to make some kind of cache structure
on top. That might be sensible (and indeed, the index already does this
with the cachetree extension). But it can also easily come later if the
objects are just in the regular odb.

-Peff
Elijah Newren Oct. 1, 2021, 2:23 a.m. UTC | #17
On Thu, Sep 30, 2021 at 12:26 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 10:05:31PM -0700, Elijah Newren wrote:
>
> > So there are other parts of running `git log` that are not read-only,
> > besides what I'm doing.  That's unfortunate.  Can we just disable
> > those things in this section of the code?
>
> It might be possible for the cache (we'd still generate entries, but
> just not write them). It's "just" an optimization, though it's possible
> there are cases where it's really annoying to do so. I'd imagine those
> are a stretch. I do use textconv to decrypt gpg blobs with a key on a
> hardware token, for example. I don't require a touch for each one,
> though, nor would I use textconv caching anyway (since the point is not
> to store the unencrypted contents).
>
> It wouldn't necessarily be OK to disable object writes in general.
> Though I'd think since a diff is _conceptually_ a read-only operation
> that any writes it wants to do would be in the same caching /
> optimization boat.  I'd also worry about the maintenance cost of having
> to annotate any such case.
>
> > Also, I think it makes sense that textconv caching causes a problem
> > via writing new blobs and trees *and* refs that reference these blobs
> > and trees.  However, I'm not sure I'm really grasping it, because I
> > don't see how your solutions are safe either.  I'll mention more after
> > each one.
>
> Heh. So I originally wrote more on this in my earlier email, but worried
> I was getting too into the weeds and would confuse you, so I pared it
> down. But I managed to be confusing anyway. :)
>
> Here's one mental model. If you have a temporary object store, then
> these are the bad things that can happen:
>
>   1. Some code may write an object to it, thinking the data is saved,
>      but we later throw it away.
>
>   2. Some code may _skip_ writing an object, thinking that we already
>      have that object available (when in fact we'll later throw it
>      away).
>
>   3. Some code may record the fact that we have an object in memory, and
>      then later (after the temporary object store is not in use), may
>      point to it when writing an object or reference.
>
> Pointing the primary object-dir at a temporary directory (like your
> patch does) means we're subject to all three.
>
> Problem (1) happens because unrelated code may not realize we've swapped
> out the primary object writing code for this throw-away directory. But
> we could require writers to explicitly indicate that they want to write
> to the temporary area, while allowing other writes to go to the regular
> object store. This could be done with a flag or variant of
> write_object_file(). Or if the temporary area isn't a regular object
> directory at all (like the whole tempfile / pretend thing), then this
> happens automatically, because the regular object-write paths don't even
> know about our temporary area.
>
> Problem (2) is mostly handled inside write_object_file(). It will
> optimize out writes of objects we already have. So it needs to know when
> we have an object "for real", and when we just have it in our temporary
> store. If we make the tmp_objdir the primary, then it doesn't know this
> (though we could teach it to check. In the pretend_object_file() system,
> it already knows this (it uses the "freshen" code paths which actually
> want to find the entry in the filesystem).
>
> Problem (3) is the hardest one, because we don't really distinguish
> between readers and writers. So if I call has_object_file(), is it
> because I want to access the object? Or is it because I want to generate
> a new reference to it, which must ensure that it exists? One of those is
> OK to look at the tmp_objdir (and indeed, is the whole point of making
> it in the first place), and the other is the first step to corrupting
> the repository. ;)
>
> With the environment variables we set up for running external commands
> (like hooks) in the receive-pack quarantine, one bit of safety we have
> is to prevent ref writes entirely in those commands. That works
> reasonably well. Even if they may write new objects that might get
> thrown away (if the push is rejected), they'd ultimately need to point
> to those objects with a ref.  And because those hooks spend their
> _entire_ run-time in the quarantine state, they can't ever write a ref.
>
> Whereas doing it in-process is a little dicier. Right now the textconv
> cache writes out a new tree for every entry we add, so it happens to do
> the ref write while the tmp-objdir is still in place. But as this
> comment notes:
>
>   $ git grep -B6 notes_cache_write diff.c
>   diff.c-         /*
>   diff.c-          * we could save up changes and flush them all at the end,
>   diff.c-          * but we would need an extra call after all diffing is done.
>   diff.c-          * Since generating a cache entry is the slow path anyway,
>   diff.c-          * this extra overhead probably isn't a big deal.
>   diff.c-          */
>   diff.c:         notes_cache_write(textconv->cache);
>
> this is slow, and it would be perfectly reasonable to flush the cache at
> the end of the process (e.g., using an atexit() handler). In which case
> we'd hit this problem (3) exactly: we'd generate and remember objects
> during the tmp-objdir period, but only later actually reference them.
> This is more likely than the receive-pack hook case, because we're doing
> it all in process.
>
> > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > the merge, but before you run the diff, you might be OK, though.
> >
> > It has to be after I run the diff, because the diff needs access to
> > the temporary files to diff against them.
>
> Right, of course. I was too fixated on the object-write part, forgetting
> that the whole point of the exercise is to later read them back. :)
>
> > > If not, then I think the solution is probably not to install this as the
> > > "primary", but rather:
> > >
> > >   - do the specific remerge-diff writes we want using a special "write
> > >     to this object dir" variant of write_object_file()
> > >
> > >   - install the tmp_objdir as an alternate, so the reading side (which
> > >     is diff code that doesn't otherwise know about our remerge-diff
> > >     trickery) can find them
> > >
> > > And that lets other writers avoid writing into the temporary area
> > > accidentally.
> >
> > Doesn't this have the same problem?  If something similar to textconv
> > caching were to create new trees that referenced the existing blobs
> > and then added a ref that referred to that tree, then you have the
> > exact same problem, right?  The new tree and the new ref would be
> > corrupt as soon as the tmp-objdir went away.  Whether or not other
> > callers write to the temporary area isn't the issue, it's whether they
> > write refs that can refer to anything from the temporary area.  Or am
> > I missing something?
>
> The key thing here is in the first step, where remerge-diff is
> explicitly saying "I want to write to this temporary object-dir". But
> the textconv-cache code does not; it gets to write to the normal spot.
> So it avoids problem (1).
>
> You're right that it does not avoid problem (3) exactly. But triggering
> that would require some code not just writing other objects or
> references while the tmp-objdir is in place, but specifically
> referencing the objects that remerge-diff did put into the tmp-objdir.
> That seems a lot less likely to me (because the thing we're most worried
> about is unrelated code that just happens to write while the tmp-objdir
> is in place).
>
> > > In that sense this is kind of like the pretend_object_file() interface,
> > > except that it's storing the objects on disk instead of in memory. Of
> > > course another way of doing that would be to stuff the object data into
> > > tempfiles and just put pointers into the in-memory cached_objects array.
> > >
> > > It's also not entirely foolproof (nor is the existing
> > > pretend_object_file()). Any operation which is fooled into thinking we
> > > have object X because it's in the fake-object list or the tmp_objdir may
> > > reference that object erroneously, creating a corruption. But it's still
> > > safer than allowing arbitrary writes into the tmp_objdir.
> >
> > Why is the pretend_object_file() interface safer?  Does it disable the
> > textconv caching somehow?  I don't see why it helps avoid this
> > problem.
>
> So hopefully it's clearer now from what I wrote above, but just
> connecting the dots:
>
>   1. Unrelated code calling write_object_file() would still write real,
>      durable objects, as usual.
>
>   2. The write_object_file() "skip this write" optimization already
>      ignores the pretend_object_file() objects while checking "do we
>      have this object".
>
> > Interesting.  I'm not familiar with the pretend_object_file() code,
> > but I think I catch what you're saying.  I don't see anything in
> > object-file.c that allows removing items from the cache, as I would
> > want to do, but I'm sure I could add something.
>
> Right, it's pretty limited now, and there's only one caller. And in fact
> I've wanted to get rid of it, exactly because it can create the danger
> we're discussing here. But I still think it's _less_ dangerous than
> fully replacing the object directory.
>
> > (I'm still not sure how this avoids the problem of things like
> > textconv caching trying to write an object that references one of
> > these, though.  Should we just manually disable textconv caching
> > during a remerge-diff?)
>
> It can't avoid it completely, but in general, the textconv cache should
> be writing and referencing its own objects. Even if they happen to be
> the same as something remerge-diff generated, it won't know that until
> it has tried to write it (and so that's why having write_object_file()
> continue to really write the object is important).
>
>
> Hopefully that clears up my line of thinking. I do think a lot of this
> is kind of theoretical, in the sense that:
>
>   - textconv caching is an obscure feature that we _could_ just disable
>     during remerge-diffs
>
>   - there's a reasonable chance that there isn't any other code that
>     wants to write during a diff
>
> But this is all scary and error-prone enough that I'd prefer an approach
> that has the "least surprise". So any solution where random code calling
> write_object_file() _can't_ accidentally write to a throw-away directory
> seems like a safer less surprising thing.

Yes, thanks for taking the time to go into this detail.

> It does mean that the remerge-diff code needs to be explicit in its
> object writes to say "this needs to go to the temporary area" (whether
> it's a flag, or calling into a totally different function or subsystem).
> I'm hoping that's not hard to do (because its writes are done explicitly
> by the remerge-diff code), but I worry that it may be (because you are
> relying on more generic tree code to the write under the hood).

All blob and tree objects written during merge-ort are done by
explicit write_object_file() calls that exist within merge-ort.c.  So
this approach should be doable, it just needs the external caller to
set some flag saying to write these objects elsewhere.
Elijah Newren Oct. 1, 2021, 2:31 a.m. UTC | #18
On Thu, Sep 30, 2021 at 12:46 AM Jeff King <peff@peff.net> wrote:
>
> On Thu, Sep 30, 2021 at 03:26:42AM -0400, Jeff King wrote:
>
> > > > If you remove the tmp_objdir as the primary as soon as you're done with
> > > > the merge, but before you run the diff, you might be OK, though.
> > >
> > > It has to be after I run the diff, because the diff needs access to
> > > the temporary files to diff against them.
> >
> > Right, of course. I was too fixated on the object-write part, forgetting
> > that the whole point of the exercise is to later read them back. :)
>
> Ah, no, I remember what I was trying to say here. The distinction is
> between "remove the tmp_objdir" and "remove it as the primary".
>
> I.e., if you do this:
>
>   1. create tmp_objdir
>
>   2. make tmp_objdir primary for writes
>
>   3. run the "merge" half of remerge-diff, writing objects into the
>      temporary space
>
>   4. stop having tmp_objdir as the primary; instead make it an alternate
>
>   5. run the diff
>
>   6. remove tmp_objdir totally
>
> Then step 5 can't accidentally write objects into the temporary space,
> but it can still read them. So it's not entirely safe, but it's safer,
> and it would be a much smaller change.

Interesting.

> Some ways it could go wrong:
>
>   - is it possible for the merge code to ever write an object? I kind of
>     wonder if we'd ever do any cache-able transformations as part of a
>     content-level merge. I don't think we do now, though.

Yes, of course -- otherwise there would have been no need for the
tmp_objdir in the first place.  In particular, it needs to write
three-way-content merges of individual files, and it needs to write
new tree objects.  (And it needs to do this both for creating the
virtual merge bases if the merge is recursive, as well as doing it for
the outer merge.)

It doesn't write anything for caching reasons, such as line ending
normalizations (that's all kept in-memory and done on demand).

>   - in step 5, write_object_file() may still be confused by the presence
>     of the to-be-thrown-away objects in the alternate. This is pretty
>     unlikely, as it implies that the remerge-diff wrote a blob or tree
>     that is byte-identical to something that the diff wants to write.

That's one reason it could be confused.  The textconv filtering in
particular was creating a new object based on an existing one, and a
tree, and a ref.  What if there was some other form of caching or
statistic gathering that didn't write a new object based on an
existing one, but did add trees and especially refs that referenced
the existing object?  It's not that the diff wanted to write something
byte-identical to what the merge wrote, it's just that the diff wants
to reference the object somehow.
Elijah Newren Oct. 1, 2021, 3:11 a.m. UTC | #19
On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Sep 30 2021, Jeff King wrote:
>
> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
> >
> >> Jeff King <peff@peff.net> writes:
> >>
> >> >   Side note: The pretend_object_file() approach is actually even better,
> >> >   because we know the object is fake. So it does not confuse
> >> >   write_object_file()'s "do we already have this object" freshening
> >> >   check.
> >> >
> >> >   I suspect it could even be made faster than the tmp_objdir approach.
> >> >   From our perspective, these objects really are tempfiles. So we could
> >> >   write them as such, not worrying about things like fsyncing them,
> >> >   naming them into place, etc. We could just write them out, then mmap
> >> >   the results, and put the pointers into cached_objects (currently it
> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >> >   an easy extension to add).
> >> >
> >> >   In fact, I think you could get away with just _one_ tempfile per
> >> >   merge. Open up one tempfile. Write out all of the objects you want to
> >> >   "store" into it in sequence, and record the lseek() offsets before and
> >> >   after for each object. Then mmap the whole result, and stuff the
> >> >   appropriate pointers (based on arithmetic with the offsets) into the
> >> >   cached_objects list.
> >>
> >> Cute.  The remerge diff code path creates a full tree that records
> >> the mechanical merge result.  By hooking into the lowest layer of
> >> write_object() interface, we'd serialize all objects in such a tree
> >> in the order they are computed (bottom up from the leaf level, I'd
> >> presume) into a single flat file ;-)
> >
> > I do still like this approach, but just two possible gotchas I was
> > thinking of:
> >
> >  - This side-steps all of our usual code for getting object data into
> >    memory. In general, I'd expect this content to not be too enormous,
> >    but it _could_ be if there are many / large blobs in the result. So
> >    we may end up with large maps. Probably not a big deal on modern
> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >    virtual address space.
> >
> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >    it would probably mean putting all that object data into the heap. I
> >    could live with that, given how rare such systems are these days, and
> >    that it only matters if you're using --remerge-diff with big blobs.
> >
> >  - I wonder to what degree --remerge-diff benefits from omitting writes
> >    for objects we already have. I.e., if you are writing out a whole
> >    tree representing the conflicted state, then you don't want to write
> >    all of the trees that aren't interesting. Hopefully the code is
> >    already figuring out which paths the merge even touched, and ignoring
> >    the rest. It probably benefits performance-wise from
> >    write_object_file() deciding to skip some object writes, as well
> >    (e.g., for resolutions which the final tree already took, as they'd
> >    be in the merge commit). The whole pretend-we-have-this-object thing
> >    may want to likewise make sure we don't write out objects that we
> >    already have in the real odb.
>
> I haven't benchmarked since my core.checkCollisions RFC patch[1]
> resulted in the somewhat related loose object cache patch from you, and
> not with something like the midx, but just a note that on some setups
> just writing things out is faster than exhaustively checking if we
> absolutely need to write things out.
>
> I also wonder how much if anything writing out the one file v.s. lots of
> loose objects is worthwhile on systems where we could write out those
> loose objects on a ramdisk, which is commonly available on e.g. Linux
> distros these days out of the box. If you care about performance but not
> about your transitory data using a ramdisk is generally much better than
> any other potential I/O optimization.
>
> Finally, and I don't mean to throw a monkey wrench into this whole
> discussion, so take this as a random musing: I wonder how much faster
> this thing could be on its second run if instead of avoiding writing to
> the store & cleaning up, it just wrote to the store, and then wrote

It'd be _much_ slower.  My first implementation in fact did that; it
just wrote objects to the store, left them there, and didn't bother to
do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
auto-gc's during the run were really slow too. But stepping back,
gc'ing objects that I already knew were garbage seemed like a waste;
why not just prune them pre-emptively?  To do so, though, I'd have to
track all the individual objects I added to make sure I didn't prune
something else.  Following that idea and a few different attempts
eventually led me to the discovery of tmp_objdir.

In case it's not clear to you why just writing all the objects to the
normal store and leaving them there slows things down so much...

Let's say 1 in 10 merges had originally needed some kind of conflict
resolution (either in the outer merge or in the inner merge for the
virtual merge bases), meaning that 9 out of 10 merges traversed by
--remerge-diff don't write any objects.  Now for each merge for which
--remerge-diff does need to create conflicted blobs and new trees,
let's say it writes on average 3 blobs and 7 trees.  (I don't know the
real average numbers, it could well be ~5 total, but ~10 seems a
realistic first order approximation and it makes the math easy.)
Then, if we keep all objects we write, then `git log --remerge-diff`
on a history with 100,000 merge commits, will have added 100,000 loose
objects by the time it finishes.  That means that all diff and merge
operations slow down considerably as it runs due to all the extra
loose objects.

> another object keyed on the git version and any revision paramaters
> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> to present the result to the user :)

So caching the full `git log ...` output, based on a hash of the
command line flags, and then merely re-showing it later?  And having
that output be invalidated as soon as any head advances?  Or are you
thinking of caching the output per-commit based on a hash of the other
command line flags...potentially slowing non-log operations down with
the huge number of loose objects?

> I suppose that would be throwing a lot more work at an eventual "git gc"
> than we ever do now, so maybe it's a bit crazy, but I think it might be
> an interesting direction in general to (ab)use either the primary or
> some secondary store in the .git dir as a semi-permanent cache of
> resolved queries from the likes of "git log".

If you do per-commit caching, and the user scrolls through enough
output (not hard to do, just searching the output for some string
often is enough), that "eventual" git-gc will be the very next git
operation.  If you cache the entire output, it'll be invalidated
pretty quickly.  So I don't see how this works.  Or am I
misunderstanding something you're suggesting here?
Elijah Newren Oct. 1, 2021, 3:55 a.m. UTC | #20
On Thu, Sep 30, 2021 at 7:39 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Tue, Sep 28 2021, Elijah Newren wrote:
>
> > On Tue, Sep 28, 2021 at 1:00 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> On Tue, Aug 31 2021, Elijah Newren via GitGitGadget wrote:
> >>
> >> I commented just now on how this API is duplicated between here &
> >> another in-flight series in
> >> https://lore.kernel.org/git/87v92lxhh4.fsf@evledraar.gmail.com/; Just
> >> comments on this patch here:
> >>
> >> > diff --git a/tmp-objdir.c b/tmp-objdir.c
> >> > index b8d880e3626..9f75a75d1c0 100644
> >> > --- a/tmp-objdir.c
> >> > +++ b/tmp-objdir.c
> >> > @@ -288,7 +288,36 @@ const char **tmp_objdir_env(const struct tmp_objdir *t)
> >> >       return t->env.v;
> >> >  }
> >> >
> >> > +const char *tmp_objdir_path(struct tmp_objdir *t)
> >> > +{
> >> > +     return t->path.buf;
> >> > +}
> >>
> >> Not your fault & this pre-dates your patch, but FWIW I prefer our APIs
> >> that don't have these "hidden struct" shenanigans (like say "struct
> >> string_list") so a caller could just access this, we can just declare it
> >> "const" appropriately.
> >>
> >> We're also all in-tree friends here, so having various accessors for no
> >> reason other than to access struct members seems a bit too much.
> >
> > That's a fair point, but just to play counterpoint for a minute...
> >
> > FWIW, I dislike when our public facing APIs are polluted with all
> > kinds of internal details.  merge-recursive being a case in point.
> > When writing merge-ort, although I had a struct with public fields,
> > that struct also contained an opaque struct (pointer) within it to
> > hide several private fields.  (I would have liked to hide or remove a
> > few more fields, but couldn't do so while the merge_recursive_options
> > struct was shared between both merge-ort and merge-recursive.)
> >
> > That said, I agree it can certainly be overdone, and tmp_objdir is
> > pretty simple.  However, sometimes even in simple cases I like when
> > folks make use of an opaque struct because it means folks put some
> > effort into thinking more about the API that should be exposed.
> > That's something we as a project have often overlooked in the past, as
> > evidenced both by our one-shot usage mentality, and the existence of
> > external projects like libgit2 attempting to correct this design
> > shortcoming.  I'd like git to move more towards being structured as a
> > reusable library as well as a useful command-line tool.
>
> [This is somewhat of a continuation of
> https://lore.kernel.org/git/87lf3etaih.fsf@evledraar.gmail.com/].
>
> I like our APIs being usable, but right now we're not promising any
> maintenance of APIs that work the same as git v2.30.0 or whatever.
>
> I think some external users directly link to libgit.a, but they're
> tracking whatever churn we have from release to release, just like
> someone maintaining out-of-tree linux kernel drivers.
>
> For maintenance of git.git itself I think it's going to stay like that
> for any foreseeable future[1].
>
> And that's before we even get into the practicalities of git's license
> likely not winning over many (if any) libgit2 users. I think those users
> are if anything migrating from libgit2 to git's plumbing CLI, or in some
> cases folding what they needed into git.git itself (or just staying with
> libgit2).
>
> So assuming all of that I really don't see the point in general of
> having a "foo" field that has the equivalent of get_foo() and set_foo()
> accessors in the context of git.git.
>
> That's something that's really useful if you're maintaining API and ABI
> compatibility, but for us it's usually just verbosity. All the users are
> in-tree, so just add "const" as appropriate, or perhaps we should give
> the field a "priv" name or whatever.

It wasn't just a get or set accessors, and there was more involved
than just adding const:
  * There was a typechange from struct strbuf -> const char *; strbuf != char *
  * Having an API that _enforces_ accesses to be const provides more
safety.  Relying on callers to remember to declare their variable
access as const provides virtually none.

But more importantly than either of the above, tmp_objdir was locked
down because it was too easy to make big mistakes with, as Peff
pointed out.  That meant that folks would only use it through the API,
and if folks tried to do unusual things, they'd do so by adding new
API, which is easier to flag in code reviews and make sure the right
people get cc'ed than if folks just reach into some internals from
some other file.  And the new API that I added resulted in a
discussion that pointed out I didn't need to access that field at all
and should instead create and delete the tmp_objdir after every merge.

> Obviously not everything should be done that way, e.g. being able to
> have trace2 start/end points for something you'd otherwise push/clear
> from a "struct string_list" or whatever directly *is* usable, so it's
> not black and white.
>
> I'm just saying thaht using convoluted patterns *just* to get in the way
> of some hypothetical code that's going to mess with your internals can
> make things harder in other areas. E.g. the init/release pattern via
> macros I mentioned in the linked E-Mail above.

Fair points, but I just don't think that's relevant to the context
here; i.e. these aren't convoluted patterns *just* to get in the way.

Further, I don't see how this would adversely affect the *_INIT macros
you mentioned at all.  tmp_objdir is an opaque struct so you can only
declare a pointer to it.  Pointers being initialized to NULL is
handled quite nicely by the *_INIT macros.

> > [...]
> >> In this particular case I hadn't understood on a previous reading of
> >> tmp-objdir.c why this "path" isn't a statically allocated "char
> >> path[PATH_MAX]", or a least we that hardcoded "1024" should be a
> >> PATH_MAX, as it is in some other cases.
> >
> > Actually, PATH_MAX shouldn't be used:
> >
> > https://lore.kernel.org/git/7d1237c7-5a83-d766-7d93-5f0d59166067@web.de/
> > https://lore.kernel.org/git/20180720180427.GE22486@sigill.intra.peff.net/
>
> Thanks. I didn't know about that. I've fixed (some of that) up in some
> semi-related WIP work I've got.
>
> 1. Aside: I *do* think it would be really useful for us to expose a
>    stable C API eventually, using some of the same license
>    workarounds/shenanigans (depending on your view) as the Linux kernel
>    did. I.e. to say that "we're GPLv2, but this small API is the
>    equivalent of our syscall interface".
>
>    Rather than that API being something that would look anything like
>    internal git.git users it would basically be a run_command() + Perl's
>    unpack/pack + output formatter on steroids.
>
>    I.e. we'd be able to invoke say a 'git cat-file --batch' in the same
>    process as the caller by essentially calling cmd_cat_file()
>    directly. Just with some callback replacements for "provide this on
>    stdin" and "this line accepts stdout" without actually involving
>    different processes or I/O.
>
>    You'd then have the equivalent of a --format output that would just
>    so happen to be in your OS's / C ABI's understood struct
>    format. Similar to "perldoc -f pack".
>
>    The maintenance burden of such an API would be trivial. It would
>    basically be a handful of functions just to setup input/output
>    callbacks.
>
>    Most of the interface would be a rather convoluted construction of
>    command-line parameters (like our own run_command() use), not ideal,
>    but most users are after structured data and eliminating process
>    overhead, and they'd get that.
>
>    We would need to replace fprintf() and the like with some structured
>    formatting mechanism, but that would be useful for non-API users as
>    new commands learning a shiny new --format parameter. You could even
>    use such a --format to have a non-C-API user get the same struct
>    output, perhaps for direct mapping into memory, say if you didn't
>    trust the built-in/command in question to not leak memory yet.
>
>    I don't know if this has been suggested before, just my 0.02. It's
>    also partially why I'm interested in making even-built-ins not leak
>    memory with some of my parallel SANITIZE=leak work.
>
>    Once you've got that you can usually call the cmd_whatever() in a
>    loop without much effort, and once you've got that, well, see above
>    :)

This sounds really cool.
Elijah Newren Oct. 1, 2021, 3:59 a.m. UTC | #21
On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jeff King <peff@peff.net> writes:
>
> >   - is it possible for the merge code to ever write an object? I kind of
> >     wonder if we'd ever do any cache-able transformations as part of a
> >     content-level merge. I don't think we do now, though.
>
> How is virtual merge base, result of an inner merge that recursively
> merging two merge bases, fed to an outer merge as the ancestor?
> Isn't it written as a tree object by the inner merge as a tree with
> full of blob objects, so the outer merge does not have to treat a
> merge base tree that is virtual any specially from a sole merge
> base?

Yes.  And that's not unique to the inner merges; it also writes new
blob and tree objects for the outer merge, and then passes the
resulting toplevel tree id out as the result.
Elijah Newren Oct. 1, 2021, 4:26 a.m. UTC | #22
On Thu, Sep 30, 2021 at 12:33 AM Jeff King <peff@peff.net> wrote:
>
> On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>
> > Jeff King <peff@peff.net> writes:
> >
> > >   Side note: The pretend_object_file() approach is actually even better,
> > >   because we know the object is fake. So it does not confuse
> > >   write_object_file()'s "do we already have this object" freshening
> > >   check.
> > >
> > >   I suspect it could even be made faster than the tmp_objdir approach.
> > >   From our perspective, these objects really are tempfiles. So we could
> > >   write them as such, not worrying about things like fsyncing them,
> > >   naming them into place, etc. We could just write them out, then mmap
> > >   the results, and put the pointers into cached_objects (currently it
> > >   insists on malloc-ing a copy of the input buffer, but that seems like
> > >   an easy extension to add).
> > >
> > >   In fact, I think you could get away with just _one_ tempfile per
> > >   merge. Open up one tempfile. Write out all of the objects you want to
> > >   "store" into it in sequence, and record the lseek() offsets before and
> > >   after for each object. Then mmap the whole result, and stuff the
> > >   appropriate pointers (based on arithmetic with the offsets) into the
> > >   cached_objects list.
> >
> > Cute.  The remerge diff code path creates a full tree that records
> > the mechanical merge result.  By hooking into the lowest layer of
> > write_object() interface, we'd serialize all objects in such a tree
> > in the order they are computed (bottom up from the leaf level, I'd
> > presume) into a single flat file ;-)
>
> I do still like this approach, but just two possible gotchas I was
> thinking of:
>
>  - This side-steps all of our usual code for getting object data into
>    memory. In general, I'd expect this content to not be too enormous,
>    but it _could_ be if there are many / large blobs in the result. So
>    we may end up with large maps. Probably not a big deal on modern
>    64-bit systems. Maybe an issue on 32-bit systems, just because of
>    virtual address space.
>
>    Likewise, we do support systems with NO_MMAP. They'd work here, but
>    it would probably mean putting all that object data into the heap. I
>    could live with that, given how rare such systems are these days, and
>    that it only matters if you're using --remerge-diff with big blobs.

Um, I'm starting to get uncomfortable with this pretend_object stuff.
Part of the reason that merge-ort isn't truly "in memory" despite
attempting to do exactly that, was because for large enough repos with
enough files modified on both sides, I wasn't comfortable assuming
that all new files from three-way content merges and all new trees fit
into memory.  I'm sure we'd be fine with current-day linux kernel
sized repos.  No big deal.  In fact, most merges probably don't add
more than a few dozen new files.  But for microsoft-sized repos, and
with repos tending to grow over time, more so when the tools
themselves scale nicely (which we've all been working on enabling),
makes me worry there might be enough new objects within a single merge
(especially given the recursive inner merges) that we might need to
worry about this.

>  - I wonder to what degree --remerge-diff benefits from omitting writes
>    for objects we already have. I.e., if you are writing out a whole
>    tree representing the conflicted state, then you don't want to write
>    all of the trees that aren't interesting. Hopefully the code is
>    already figuring out which paths the merge even touched, and ignoring
>    the rest.

Not only do you want to avoid writing all of the trees that aren't
interesting, you also want to avoid traversing into them in the first
place and avoid doing trivial file merges for each entry underneath.
Sadly, merge-recursive did anyway.  Because renames and directory
renames can sometimes throw a big wrench in the desire to avoid
traversing into such directories.  Figuring out how to avoid it was
kind of tricky, but merge-ort definitely handles this when it can
safely do so; see the cover letter for my trivial directory resolution
series[1].

[1] https://lore.kernel.org/git/pull.988.v4.git.1626841444.gitgitgadget@gmail.com/

>    It probably benefits performance-wise from
>    write_object_file() deciding to skip some object writes, as well
>    (e.g., for resolutions which the final tree already took, as they'd
>    be in the merge commit).

Yes, it will also benefit from that, but the much bigger side is
avoiding needlessly recursing into directories unchanged on (at least)
one side.

>    The whole pretend-we-have-this-object thing
>    may want to likewise make sure we don't write out objects that we
>    already have in the real odb.

Right, so I'd have to copy the relevant logic from write_object_file()
-- I think that means instead of write_object_file()'s calls to
freshen_packed_object() and freshen_loose_object() that I instead call
find_pack_entry() and make has_loose_object() in object-file.c not be
static and then call it.  Does that sound right?

Of course, that's assuming we're okay with this pretend_object thing,
which I'm starting to worry about.
Jeff King Oct. 1, 2021, 5:21 a.m. UTC | #23
On Thu, Sep 30, 2021 at 07:31:44PM -0700, Elijah Newren wrote:

> > Some ways it could go wrong:
> >
> >   - is it possible for the merge code to ever write an object? I kind of
> >     wonder if we'd ever do any cache-able transformations as part of a
> >     content-level merge. I don't think we do now, though.
> 
> Yes, of course -- otherwise there would have been no need for the
> tmp_objdir in the first place.  In particular, it needs to write
> three-way-content merges of individual files, and it needs to write
> new tree objects.  (And it needs to do this both for creating the
> virtual merge bases if the merge is recursive, as well as doing it for
> the outer merge.)

Right, sorry, I should have added: ...in addition to the merge-results
we're writing. What I'm getting at is whether there might be _other_
parts of the code that the merge code would invoke. Say, if a
.gitattribute caused us to convert a file's encoding in order to perform
a more semantic merge, and we wanted to store that result somehow (e.g.,
in a similar cache).

I don't think anything like that exists now, but I don't find it outside
the realm of possibility in the future. It's sort of the merge
equivalent of "textconv"; we have external diff and external merge
drivers, but being able to convert contents and feed them to the regular
text diff/merge code simplifies things.

> >   - in step 5, write_object_file() may still be confused by the presence
> >     of the to-be-thrown-away objects in the alternate. This is pretty
> >     unlikely, as it implies that the remerge-diff wrote a blob or tree
> >     that is byte-identical to something that the diff wants to write.
> 
> That's one reason it could be confused.  The textconv filtering in
> particular was creating a new object based on an existing one, and a
> tree, and a ref.  What if there was some other form of caching or
> statistic gathering that didn't write a new object based on an
> existing one, but did add trees and especially refs that referenced
> the existing object?  It's not that the diff wanted to write something
> byte-identical to what the merge wrote, it's just that the diff wants
> to reference the object somehow.

Yes, good point. It can help a little if the alternate-odb added by
tmp_objdir was marked with a flag to say "hey, this is temporary". That
would help write_object_file() decide not to depend on it. But the
problem is so much wider than that that I think it will always be a bit
dangerous; any code could say "can I access this object" and we don't
know if its intent is simply to read it, or to create a new object that
references it.

-Peff
Jeff King Oct. 1, 2021, 5:27 a.m. UTC | #24
On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote:

> >  - This side-steps all of our usual code for getting object data into
> >    memory. In general, I'd expect this content to not be too enormous,
> >    but it _could_ be if there are many / large blobs in the result. So
> >    we may end up with large maps. Probably not a big deal on modern
> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >    virtual address space.
> >
> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >    it would probably mean putting all that object data into the heap. I
> >    could live with that, given how rare such systems are these days, and
> >    that it only matters if you're using --remerge-diff with big blobs.
> 
> Um, I'm starting to get uncomfortable with this pretend_object stuff.
> Part of the reason that merge-ort isn't truly "in memory" despite
> attempting to do exactly that, was because for large enough repos with
> enough files modified on both sides, I wasn't comfortable assuming
> that all new files from three-way content merges and all new trees fit
> into memory.  I'm sure we'd be fine with current-day linux kernel
> sized repos.  No big deal.  In fact, most merges probably don't add
> more than a few dozen new files.  But for microsoft-sized repos, and
> with repos tending to grow over time, more so when the tools
> themselves scale nicely (which we've all been working on enabling),
> makes me worry there might be enough new objects within a single merge
> (especially given the recursive inner merges) that we might need to
> worry about this.

I do think we need to consider that the content might be larger than
will comfortably fit in memory. But the point of using mmap is that we
don't have to care. The OS is taking care of it for us (just like it
would in regular object files).

The question is just whether we're comfortable assuming that mmap
exists if you're working on such a large repository. I'd guess that big
repos are pretty painful with out it (and again, I'm not even clear
which systems Git runs on even lack mmap these days). So IMHO this isn't
really a blocker for going in this direction.

> >    The whole pretend-we-have-this-object thing
> >    may want to likewise make sure we don't write out objects that we
> >    already have in the real odb.
> 
> Right, so I'd have to copy the relevant logic from write_object_file()
> -- I think that means instead of write_object_file()'s calls to
> freshen_packed_object() and freshen_loose_object() that I instead call
> find_pack_entry() and make has_loose_object() in object-file.c not be
> static and then call it.  Does that sound right?

Yes. You _could_ call the same freshening functions, but I think since
we know that our objects are throw-away anyway, we do not even need to
perform that freshening operation. I think just has_object_file() would
be sufficient for your needs.

-Peff
Ævar Arnfjörð Bjarmason Oct. 1, 2021, 7:30 a.m. UTC | #25
On Thu, Sep 30 2021, Elijah Newren wrote:

> On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
> <avarab@gmail.com> wrote:
>>
>> On Thu, Sep 30 2021, Jeff King wrote:
>>
>> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
>> >
>> >> Jeff King <peff@peff.net> writes:
>> >>
>> >> >   Side note: The pretend_object_file() approach is actually even better,
>> >> >   because we know the object is fake. So it does not confuse
>> >> >   write_object_file()'s "do we already have this object" freshening
>> >> >   check.
>> >> >
>> >> >   I suspect it could even be made faster than the tmp_objdir approach.
>> >> >   From our perspective, these objects really are tempfiles. So we could
>> >> >   write them as such, not worrying about things like fsyncing them,
>> >> >   naming them into place, etc. We could just write them out, then mmap
>> >> >   the results, and put the pointers into cached_objects (currently it
>> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
>> >> >   an easy extension to add).
>> >> >
>> >> >   In fact, I think you could get away with just _one_ tempfile per
>> >> >   merge. Open up one tempfile. Write out all of the objects you want to
>> >> >   "store" into it in sequence, and record the lseek() offsets before and
>> >> >   after for each object. Then mmap the whole result, and stuff the
>> >> >   appropriate pointers (based on arithmetic with the offsets) into the
>> >> >   cached_objects list.
>> >>
>> >> Cute.  The remerge diff code path creates a full tree that records
>> >> the mechanical merge result.  By hooking into the lowest layer of
>> >> write_object() interface, we'd serialize all objects in such a tree
>> >> in the order they are computed (bottom up from the leaf level, I'd
>> >> presume) into a single flat file ;-)
>> >
>> > I do still like this approach, but just two possible gotchas I was
>> > thinking of:
>> >
>> >  - This side-steps all of our usual code for getting object data into
>> >    memory. In general, I'd expect this content to not be too enormous,
>> >    but it _could_ be if there are many / large blobs in the result. So
>> >    we may end up with large maps. Probably not a big deal on modern
>> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
>> >    virtual address space.
>> >
>> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
>> >    it would probably mean putting all that object data into the heap. I
>> >    could live with that, given how rare such systems are these days, and
>> >    that it only matters if you're using --remerge-diff with big blobs.
>> >
>> >  - I wonder to what degree --remerge-diff benefits from omitting writes
>> >    for objects we already have. I.e., if you are writing out a whole
>> >    tree representing the conflicted state, then you don't want to write
>> >    all of the trees that aren't interesting. Hopefully the code is
>> >    already figuring out which paths the merge even touched, and ignoring
>> >    the rest. It probably benefits performance-wise from
>> >    write_object_file() deciding to skip some object writes, as well
>> >    (e.g., for resolutions which the final tree already took, as they'd
>> >    be in the merge commit). The whole pretend-we-have-this-object thing
>> >    may want to likewise make sure we don't write out objects that we
>> >    already have in the real odb.
>>
>> I haven't benchmarked since my core.checkCollisions RFC patch[1]
>> resulted in the somewhat related loose object cache patch from you, and
>> not with something like the midx, but just a note that on some setups
>> just writing things out is faster than exhaustively checking if we
>> absolutely need to write things out.
>>
>> I also wonder how much if anything writing out the one file v.s. lots of
>> loose objects is worthwhile on systems where we could write out those
>> loose objects on a ramdisk, which is commonly available on e.g. Linux
>> distros these days out of the box. If you care about performance but not
>> about your transitory data using a ramdisk is generally much better than
>> any other potential I/O optimization.
>>
>> Finally, and I don't mean to throw a monkey wrench into this whole
>> discussion, so take this as a random musing: I wonder how much faster
>> this thing could be on its second run if instead of avoiding writing to
>> the store & cleaning up, it just wrote to the store, and then wrote
>
> It'd be _much_ slower.  My first implementation in fact did that; it
> just wrote objects to the store, left them there, and didn't bother to
> do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
> auto-gc's during the run were really slow too. But stepping back,
> gc'ing objects that I already knew were garbage seemed like a waste;
> why not just prune them pre-emptively?  To do so, though, I'd have to
> track all the individual objects I added to make sure I didn't prune
> something else.  Following that idea and a few different attempts
> eventually led me to the discovery of tmp_objdir.
>
> In case it's not clear to you why just writing all the objects to the
> normal store and leaving them there slows things down so much...
>
> Let's say 1 in 10 merges had originally needed some kind of conflict
> resolution (either in the outer merge or in the inner merge for the
> virtual merge bases), meaning that 9 out of 10 merges traversed by
> --remerge-diff don't write any objects.  Now for each merge for which
> --remerge-diff does need to create conflicted blobs and new trees,
> let's say it writes on average 3 blobs and 7 trees.  (I don't know the
> real average numbers, it could well be ~5 total, but ~10 seems a
> realistic first order approximation and it makes the math easy.)
> Then, if we keep all objects we write, then `git log --remerge-diff`
> on a history with 100,000 merge commits, will have added 100,000 loose
> objects by the time it finishes.  That means that all diff and merge
> operations slow down considerably as it runs due to all the extra
> loose objects.

...

>> another object keyed on the git version and any revision paramaters
>> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
>> to present the result to the user :)
>
> So caching the full `git log ...` output, based on a hash of the
> command line flags, and then merely re-showing it later?  And having
> that output be invalidated as soon as any head advances?  Or are you
> thinking of caching the output per-commit based on a hash of the other
> command line flags...potentially slowing non-log operations down with
> the huge number of loose objects?

Yeah I meant caching the full 'git log' output. Anyway, I think what you
said above about number of loose objects clearly makes that a stupid
suggestion on my part.

FWIW I meant that more as a "maybe in the future we can ..." musing than
anything for this series. I.e. in general I've often wanted say the
second run of a 'git log -G<rx>"' to be faster than it is, which could
use such on-the-fly caching.

But if we had such caching then something like --remerge-diff would need
to (from my understanding of what you're saying) need both this
temporary staging area for objects and perhaps to cache the output in
the main (or some cache) store. I.e. they're orthagonal.

>> I suppose that would be throwing a lot more work at an eventual "git gc"
>> than we ever do now, so maybe it's a bit crazy, but I think it might be
>> an interesting direction in general to (ab)use either the primary or
>> some secondary store in the .git dir as a semi-permanent cache of
>> resolved queries from the likes of "git log".
>
> If you do per-commit caching, and the user scrolls through enough
> output (not hard to do, just searching the output for some string
> often is enough), that "eventual" git-gc will be the very next git
> operation.  If you cache the entire output, it'll be invalidated
> pretty quickly.  So I don't see how this works.  Or am I
> misunderstanding something you're suggesting here?

With the caveat above that I think this is all a pretty stupid
suggestion on my part:

Just on the narrow aspect of how "git gc" behaves in that scenario: We'd
keep those objects around for 2 weeks by default, or whatever you have
gc.pruneExpire set to.

So such a setup would certainly cause lots of pathological issues (see
my [1] for one), but I don't think over-eager expiry of loose objects
would be one.

I'm not sure but are you referring to "invalidated pretty quickly"
because it would go over the gc.auto limit? If so that's now how it
works in this scenario, see [2]. I.e. it's moderated by the
gc.pruneExpire setting.

1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/
Ævar Arnfjörð Bjarmason Oct. 1, 2021, 7:43 a.m. UTC | #26
On Fri, Oct 01 2021, Jeff King wrote:

> On Thu, Sep 30, 2021 at 09:26:37PM -0700, Elijah Newren wrote:
>
>> >  - This side-steps all of our usual code for getting object data into
>> >    memory. In general, I'd expect this content to not be too enormous,
>> >    but it _could_ be if there are many / large blobs in the result. So
>> >    we may end up with large maps. Probably not a big deal on modern
>> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
>> >    virtual address space.
>> >
>> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
>> >    it would probably mean putting all that object data into the heap. I
>> >    could live with that, given how rare such systems are these days, and
>> >    that it only matters if you're using --remerge-diff with big blobs.
>> 
>> Um, I'm starting to get uncomfortable with this pretend_object stuff.
>> Part of the reason that merge-ort isn't truly "in memory" despite
>> attempting to do exactly that, was because for large enough repos with
>> enough files modified on both sides, I wasn't comfortable assuming
>> that all new files from three-way content merges and all new trees fit
>> into memory.  I'm sure we'd be fine with current-day linux kernel
>> sized repos.  No big deal.  In fact, most merges probably don't add
>> more than a few dozen new files.  But for microsoft-sized repos, and
>> with repos tending to grow over time, more so when the tools
>> themselves scale nicely (which we've all been working on enabling),
>> makes me worry there might be enough new objects within a single merge
>> (especially given the recursive inner merges) that we might need to
>> worry about this.
>
> I do think we need to consider that the content might be larger than
> will comfortably fit in memory. But the point of using mmap is that we
> don't have to care. The OS is taking care of it for us (just like it
> would in regular object files).
>
> The question is just whether we're comfortable assuming that mmap
> exists if you're working on such a large repository. I'd guess that big
> repos are pretty painful with out it (and again, I'm not even clear
> which systems Git runs on even lack mmap these days). So IMHO this isn't
> really a blocker for going in this direction.

On the not a blocker: Even without mmap() such a user also has the out
of increasing the size of their swap/page file.

And generally I agree that it's fair for us to say that if you've got
such outsized performance needs you're going to need something more than
the lowest common denominator git can be ported to.

I also wouldn't be surprised if in that scenario we'd run faster if we
were using memory (and no mmap) than if we tried to fallback to the FS,
i.e. your suggestion of replacing N loose objects with one file with
index offsets is basically what the OS would be doing for you as it
pages out memory it can't fit in RAM to disk.
Elijah Newren Oct. 1, 2021, 8:03 a.m. UTC | #27
On Fri, Oct 1, 2021 at 12:41 AM Ævar Arnfjörð Bjarmason
<avarab@gmail.com> wrote:
>
> On Thu, Sep 30 2021, Elijah Newren wrote:
>
> > On Thu, Sep 30, 2021 at 6:31 AM Ævar Arnfjörð Bjarmason
> > <avarab@gmail.com> wrote:
> >>
> >> On Thu, Sep 30 2021, Jeff King wrote:
> >>
> >> > On Tue, Sep 28, 2021 at 09:08:00PM -0700, Junio C Hamano wrote:
> >> >
> >> >> Jeff King <peff@peff.net> writes:
> >> >>
> >> >> >   Side note: The pretend_object_file() approach is actually even better,
> >> >> >   because we know the object is fake. So it does not confuse
> >> >> >   write_object_file()'s "do we already have this object" freshening
> >> >> >   check.
> >> >> >
> >> >> >   I suspect it could even be made faster than the tmp_objdir approach.
> >> >> >   From our perspective, these objects really are tempfiles. So we could
> >> >> >   write them as such, not worrying about things like fsyncing them,
> >> >> >   naming them into place, etc. We could just write them out, then mmap
> >> >> >   the results, and put the pointers into cached_objects (currently it
> >> >> >   insists on malloc-ing a copy of the input buffer, but that seems like
> >> >> >   an easy extension to add).
> >> >> >
> >> >> >   In fact, I think you could get away with just _one_ tempfile per
> >> >> >   merge. Open up one tempfile. Write out all of the objects you want to
> >> >> >   "store" into it in sequence, and record the lseek() offsets before and
> >> >> >   after for each object. Then mmap the whole result, and stuff the
> >> >> >   appropriate pointers (based on arithmetic with the offsets) into the
> >> >> >   cached_objects list.
> >> >>
> >> >> Cute.  The remerge diff code path creates a full tree that records
> >> >> the mechanical merge result.  By hooking into the lowest layer of
> >> >> write_object() interface, we'd serialize all objects in such a tree
> >> >> in the order they are computed (bottom up from the leaf level, I'd
> >> >> presume) into a single flat file ;-)
> >> >
> >> > I do still like this approach, but just two possible gotchas I was
> >> > thinking of:
> >> >
> >> >  - This side-steps all of our usual code for getting object data into
> >> >    memory. In general, I'd expect this content to not be too enormous,
> >> >    but it _could_ be if there are many / large blobs in the result. So
> >> >    we may end up with large maps. Probably not a big deal on modern
> >> >    64-bit systems. Maybe an issue on 32-bit systems, just because of
> >> >    virtual address space.
> >> >
> >> >    Likewise, we do support systems with NO_MMAP. They'd work here, but
> >> >    it would probably mean putting all that object data into the heap. I
> >> >    could live with that, given how rare such systems are these days, and
> >> >    that it only matters if you're using --remerge-diff with big blobs.
> >> >
> >> >  - I wonder to what degree --remerge-diff benefits from omitting writes
> >> >    for objects we already have. I.e., if you are writing out a whole
> >> >    tree representing the conflicted state, then you don't want to write
> >> >    all of the trees that aren't interesting. Hopefully the code is
> >> >    already figuring out which paths the merge even touched, and ignoring
> >> >    the rest. It probably benefits performance-wise from
> >> >    write_object_file() deciding to skip some object writes, as well
> >> >    (e.g., for resolutions which the final tree already took, as they'd
> >> >    be in the merge commit). The whole pretend-we-have-this-object thing
> >> >    may want to likewise make sure we don't write out objects that we
> >> >    already have in the real odb.
> >>
> >> I haven't benchmarked since my core.checkCollisions RFC patch[1]
> >> resulted in the somewhat related loose object cache patch from you, and
> >> not with something like the midx, but just a note that on some setups
> >> just writing things out is faster than exhaustively checking if we
> >> absolutely need to write things out.
> >>
> >> I also wonder how much if anything writing out the one file v.s. lots of
> >> loose objects is worthwhile on systems where we could write out those
> >> loose objects on a ramdisk, which is commonly available on e.g. Linux
> >> distros these days out of the box. If you care about performance but not
> >> about your transitory data using a ramdisk is generally much better than
> >> any other potential I/O optimization.
> >>
> >> Finally, and I don't mean to throw a monkey wrench into this whole
> >> discussion, so take this as a random musing: I wonder how much faster
> >> this thing could be on its second run if instead of avoiding writing to
> >> the store & cleaning up, it just wrote to the store, and then wrote
> >
> > It'd be _much_ slower.  My first implementation in fact did that; it
> > just wrote objects to the store, left them there, and didn't bother to
> > do any auto-gcs.  It slowed down quite a bit as it ran.  Adding
> > auto-gc's during the run were really slow too. But stepping back,
> > gc'ing objects that I already knew were garbage seemed like a waste;
> > why not just prune them pre-emptively?  To do so, though, I'd have to
> > track all the individual objects I added to make sure I didn't prune
> > something else.  Following that idea and a few different attempts
> > eventually led me to the discovery of tmp_objdir.
> >
> > In case it's not clear to you why just writing all the objects to the
> > normal store and leaving them there slows things down so much...
> >
> > Let's say 1 in 10 merges had originally needed some kind of conflict
> > resolution (either in the outer merge or in the inner merge for the
> > virtual merge bases), meaning that 9 out of 10 merges traversed by
> > --remerge-diff don't write any objects.  Now for each merge for which
> > --remerge-diff does need to create conflicted blobs and new trees,
> > let's say it writes on average 3 blobs and 7 trees.  (I don't know the
> > real average numbers, it could well be ~5 total, but ~10 seems a
> > realistic first order approximation and it makes the math easy.)
> > Then, if we keep all objects we write, then `git log --remerge-diff`
> > on a history with 100,000 merge commits, will have added 100,000 loose
> > objects by the time it finishes.  That means that all diff and merge
> > operations slow down considerably as it runs due to all the extra
> > loose objects.
>
> ...
>
> >> another object keyed on the git version and any revision paramaters
> >> etc., and then pretty much just had to do a "git cat-file -p <that-obj>"
> >> to present the result to the user :)
> >
> > So caching the full `git log ...` output, based on a hash of the
> > command line flags, and then merely re-showing it later?  And having
> > that output be invalidated as soon as any head advances?  Or are you
> > thinking of caching the output per-commit based on a hash of the other
> > command line flags...potentially slowing non-log operations down with
> > the huge number of loose objects?
>
> Yeah I meant caching the full 'git log' output. Anyway, I think what you
> said above about number of loose objects clearly makes that a stupid
> suggestion on my part.

Most my ideas for merge-ort were stupid.  I didn't know until I tried
them, and then I just discarded them and only showed off the good
ideas.  :-)

Brainstorming new ways of doing things is a useful exercise, I think.

> FWIW I meant that more as a "maybe in the future we can ..." musing than
> anything for this series. I.e. in general I've often wanted say the
> second run of a 'git log -G<rx>"' to be faster than it is, which could
> use such on-the-fly caching.
>
> But if we had such caching then something like --remerge-diff would need
> to (from my understanding of what you're saying) need both this
> temporary staging area for objects and perhaps to cache the output in
> the main (or some cache) store. I.e. they're orthagonal.
>
> >> I suppose that would be throwing a lot more work at an eventual "git gc"
> >> than we ever do now, so maybe it's a bit crazy, but I think it might be
> >> an interesting direction in general to (ab)use either the primary or
> >> some secondary store in the .git dir as a semi-permanent cache of
> >> resolved queries from the likes of "git log".
> >
> > If you do per-commit caching, and the user scrolls through enough
> > output (not hard to do, just searching the output for some string
> > often is enough), that "eventual" git-gc will be the very next git
> > operation.  If you cache the entire output, it'll be invalidated
> > pretty quickly.  So I don't see how this works.  Or am I
> > misunderstanding something you're suggesting here?
>
> With the caveat above that I think this is all a pretty stupid
> suggestion on my part:
>
> Just on the narrow aspect of how "git gc" behaves in that scenario: We'd
> keep those objects around for 2 weeks by default, or whatever you have
> gc.pruneExpire set to.
>
> So such a setup would certainly cause lots of pathological issues (see
> my [1] for one), but I don't think over-eager expiry of loose objects
> would be one.
>
> I'm not sure but are you referring to "invalidated pretty quickly"
> because it would go over the gc.auto limit? If so that's now how it
> works in this scenario, see [2]. I.e. it's moderated by the
> gc.pruneExpire setting.

By "invalidated pretty quickly" I meant that if the user ran "git log
--all" and you cached the entire output in a single cache file, then
as soon as any ref is updated, that cached output is invalid because
"git log --all" should now show different output than before.

It seems to be a case of pick your poison: either (a) cache output per
commit so that it can be re-used, and suffer from extreme numbers of
loose objects, OR (b) cache the entire output to avoid the large
number of loose objects, resulting in the cache only being useful if
the user passed the _exact_ same arguments and revisions to git-log --
and also having the cache become stale as soon as any involved ref is
updated.

>
> 1. https://lore.kernel.org/git/87fu6bmr0j.fsf@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/87inc89j38.fsf@evledraar.gmail.com/
Junio C Hamano Oct. 1, 2021, 4:36 p.m. UTC | #28
Elijah Newren <newren@gmail.com> writes:

> On Thu, Sep 30, 2021 at 1:06 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Jeff King <peff@peff.net> writes:
>>
>> >   - is it possible for the merge code to ever write an object? I kind of
>> >     wonder if we'd ever do any cache-able transformations as part of a
>> >     content-level merge. I don't think we do now, though.
>>
>> How is virtual merge base, result of an inner merge that recursively
>> merging two merge bases, fed to an outer merge as the ancestor?
>> Isn't it written as a tree object by the inner merge as a tree with
>> full of blob objects, so the outer merge does not have to treat a
>> merge base tree that is virtual any specially from a sole merge
>> base?
>
> Yes.  And that's not unique to the inner merges; it also writes new
> blob and tree objects for the outer merge, and then passes the
> resulting toplevel tree id out as the result.

I think Peff's question was "other than what is contained in the
merge result, are there objects we create during the merge process
that we may want to keep?"  And my inner merge example was objects
that do not appear in the merge result, but they are discardable,
so it wasn't a good example.

A tentative merge to recreate what the manual resolver would have
seen while showing remerge-diff needs to be recorded somewhere
before the diff gets shown, but after showing the diff, we can lose
them unless we need them for some other reason (e.g. they already
existed, another process happened to have created the same blob
while we are running remerge-diff, etc.).  So in that sense, the
objects that represent the result of the outermost merge is also
discardable.

The question is if there are other things similar to the textconv
cache, where we need to create during the operation and may keep the
result for later use.  The closest thing I can think of is the
rename cache we've talked about from time to time without actually
adding one ;-).

I guess using cached textconv result, without putting newly
discovered textconv result back to the cache, would be the best we
can do?

The blobs (hence their textconv results) involved in showing a merge
M with --remerge-diff are the ones in the trees of M^@ (all parents)
and $(git merge-base $(git rev-parse M^@)), and they may have been
created when running "git show" on or when letting "git log -p"
pass, these commits, so it is not like we are disabling the cache
entirely, and allowing read-only access may still have value.

Thanks.
diff mbox series

Patch

diff --git a/tmp-objdir.c b/tmp-objdir.c
index b8d880e3626..9f75a75d1c0 100644
--- a/tmp-objdir.c
+++ b/tmp-objdir.c
@@ -288,7 +288,36 @@  const char **tmp_objdir_env(const struct tmp_objdir *t)
 	return t->env.v;
 }
 
+const char *tmp_objdir_path(struct tmp_objdir *t)
+{
+	return t->path.buf;
+}
+
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *t)
 {
 	add_to_alternates_memory(t->path.buf);
 }
+
+void tmp_objdir_make_primary(struct repository *r, const struct tmp_objdir *t)
+{
+	struct object_directory *od;
+	od = xcalloc(1, sizeof(*od));
+
+	od->path = xstrdup(t->path.buf);
+	od->next = r->objects->odb;
+	r->objects->odb = od;
+}
+
+void tmp_objdir_remove_as_primary(struct repository *r,
+				  const struct tmp_objdir *t)
+{
+	struct object_directory *od;
+
+	od = r->objects->odb;
+	if (strcmp(t->path.buf, od->path))
+		BUG("expected %s as primary object store; found %s",
+		    t->path.buf, od->path);
+	r->objects->odb = od->next;
+	free(od->path);
+	free(od);
+}
diff --git a/tmp-objdir.h b/tmp-objdir.h
index b1e45b4c75d..6067da24e8c 100644
--- a/tmp-objdir.h
+++ b/tmp-objdir.h
@@ -19,6 +19,7 @@ 
  *
  */
 
+struct repository;
 struct tmp_objdir;
 
 /*
@@ -33,6 +34,11 @@  struct tmp_objdir *tmp_objdir_create(void);
  */
 const char **tmp_objdir_env(const struct tmp_objdir *);
 
+/*
+ * Return the path used for the temporary object directory.
+ */
+const char *tmp_objdir_path(struct tmp_objdir *t);
+
 /*
  * Finalize a temporary object directory by migrating its objects into the main
  * object database, removing the temporary directory, and freeing any
@@ -51,4 +57,14 @@  int tmp_objdir_destroy(struct tmp_objdir *);
  */
 void tmp_objdir_add_as_alternate(const struct tmp_objdir *);
 
+/*
+ * Add the temporary object directory as the *primary* object store in the
+ * current process, turning the previous primary object store into an
+ * alternate.
+ */
+void tmp_objdir_make_primary(struct repository *r,
+			     const struct tmp_objdir *t);
+void tmp_objdir_remove_as_primary(struct repository *r,
+				  const struct tmp_objdir *t);
+
 #endif /* TMP_OBJDIR_H */