diff mbox series

[v5,4/4] submodule: record superproject gitdir during 'update'

Message ID 20211104234942.3473650-5-emilyshaffer@google.com (mailing list archive)
State New, archived
Headers show
Series cache parent project's gitdir in submodules | expand

Commit Message

Emily Shaffer Nov. 4, 2021, 11:49 p.m. UTC
A recorded hint path to the superproject's gitdir might be added during
'git submodule add', but in some cases - like submodules which were
created before 'git submodule add' learned to record that info - it might
be useful to update the hint. Let's do it during 'git submodule
update', when we already have a handle to the superproject while calling
operations on the submodules.

Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
---
 git-submodule.sh            | 14 ++++++++++++++
 t/t7406-submodule-update.sh | 12 ++++++++++++
 2 files changed, 26 insertions(+)

Comments

Junio C Hamano Nov. 5, 2021, 4:49 a.m. UTC | #1
Emily Shaffer <emilyshaffer@google.com> writes:

> A recorded hint path to the superproject's gitdir might be added during
> 'git submodule add', but in some cases - like submodules which were
> created before 'git submodule add' learned to record that info - it might
> be useful to update the hint. Let's do it during 'git submodule
> update', when we already have a handle to the superproject while calling
> operations on the submodules.

We are hearing repeated mention of "cache" and "hint".  Do we ever
invalidate it, or if we have such a record, do we blindly trust it
and use it without verifying if it is still fresh?

Also, this step and the previous step both say we record gitdir on
their title, but we instead record common dir.  Whichever is the
right choice to record, let's be consistent.

Thanks.
Ævar Arnfjörð Bjarmason Nov. 5, 2021, 8:43 a.m. UTC | #2
On Thu, Nov 04 2021, Junio C Hamano wrote:

> Emily Shaffer <emilyshaffer@google.com> writes:
>
>> A recorded hint path to the superproject's gitdir might be added during
>> 'git submodule add', but in some cases - like submodules which were
>> created before 'git submodule add' learned to record that info - it might
>> be useful to update the hint. Let's do it during 'git submodule
>> update', when we already have a handle to the superproject while calling
>> operations on the submodules.
>
> We are hearing repeated mention of "cache" and "hint".  Do we ever
> invalidate it, or if we have such a record, do we blindly trust it
> and use it without verifying if it is still fresh?
>
> Also, this step and the previous step both say we record gitdir on
> their title, but we instead record common dir.  Whichever is the
> right choice to record, let's be consistent.

I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
track of this series, and see one reason is that the In-Reply-Chain was
broken between v3..v4.

I.e. it seems to me that this whole thing started as a way to avoid
shellscript overhead by calling git-rev-parse from git-submodule.sh, but
now that the relevant bits are moved to C we could just call some
slightly adjusted code in setup.c.

Maybe I'm entirely wrong, but I think if I am that this series has a
commit message gap between the "hint" and "cache" and this really *does*
need to be written at clone/init/update time in some way that I've
missed.

Otherwise I still don't really get it, sorry. I.e. maybe the relevant
code in setup.c really does need caching, or maybe it doesn't anymore,
and this cache-via-config is a hold-over from when figuring out the
parent repo implied needing to shell out somewhere for every operation.

1. https://lore.kernel.org/git/87r1cnfkqx.fsf@evledraar.gmail.com/
Ævar Arnfjörð Bjarmason Nov. 5, 2021, 8:51 a.m. UTC | #3
On Thu, Nov 04 2021, Emily Shaffer wrote:

> A recorded hint path to the superproject's gitdir might be added during
> 'git submodule add', but in some cases - like submodules which were
> created before 'git submodule add' learned to record that info - it might
> be useful to update the hint. Let's do it during 'git submodule
> update', when we already have a handle to the superproject while calling
> operations on the submodules.
>
> Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> ---
>  git-submodule.sh            | 14 ++++++++++++++
>  t/t7406-submodule-update.sh | 12 ++++++++++++
>  2 files changed, 26 insertions(+)
>
> diff --git a/git-submodule.sh b/git-submodule.sh
> index 652861aa66..873d64eb99 100755
> --- a/git-submodule.sh
> +++ b/git-submodule.sh
> @@ -449,6 +449,20 @@ cmd_update()
>  			;;
>  		esac
>  
> +		# Cache a pointer to the superproject's common dir. This may have
> +		# changed, unless it's a fresh clone. Writes it to worktree
> +		# if applicable, otherwise to local.
> +		if test -z "$just_cloned"
> +		then
> +			sm_gitdir="$(git -C "$sm_path" rev-parse --absolute-git-dir)"
> +			relative_gitdir="$(git rev-parse --path-format=relative \
> +							 --prefix "${sm_gitdir}" \
> +							 --git-common-dir)"
> +
> +			git -C "$sm_path" config --worktree \
> +				submodule.superprojectgitdir "$relative_gitdir"
> +		fi
> +

Aside from the "is this really needed anymore?" comment on this caching
strategy in general I had in [1] does this really need to be adding new
shell code to git-submodule.sh given that we're actively trying to get
rid of it entirely and move it over to C.

I.e. here we've just called "git submodule--helper
run-update-procedure", and we pass it "$sm_path" (but not "$recursive",
but could easily do so).

It needs to clone this new submodule, so presumably it already has a
"$sm_gitdir" equivalent, and we can turn that into the same relative
path, no?

Can't we call this with a git_config_set*() somewhere in that code?

*investigates a bit*

Okey, so I see that [2] is part of a series that Atharva Raykar had a
version of (including this new git-submodule.sh code above) [3] rebased
on top of an older version of this topic. I.e. this bit is that part of that patch:

+       git_config_set_in_file(config_path, "submodule.superprojectGitdir",
+                              relative_path(get_git_dir(),
+                                            update_data->sm_path, &sb));

I also vaguely recall that Junio ejected his topic to make room for this
topic first.

Maybe I've missed some update on this but is his [2][3] broken in
combination with your topic here? And re[1] is whatever "caching" is
being done here still beneficial once this is all moved to C?

In your [4] there seems to be an agreement to do it the other way
around, but as noted in the mail linked from [1] maybe the caching isn't
needed anymore then?

1. https://lore.kernel.org/git/211105.861r3vhtot.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/20211013051805.45662-8-raykar.ath@gmail.com/
3. https://lore.kernel.org/git/20211013051805.45662-1-raykar.ath@gmail.com/
4. https://lore.kernel.org/git/YWiXL+plA7GHfuVv@google.com/
Emily Shaffer Nov. 8, 2021, 11:21 p.m. UTC | #4
On Fri, Nov 05, 2021 at 09:43:56AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Thu, Nov 04 2021, Junio C Hamano wrote:
> 
> > Emily Shaffer <emilyshaffer@google.com> writes:
> >
> >> A recorded hint path to the superproject's gitdir might be added during
> >> 'git submodule add', but in some cases - like submodules which were
> >> created before 'git submodule add' learned to record that info - it might
> >> be useful to update the hint. Let's do it during 'git submodule
> >> update', when we already have a handle to the superproject while calling
> >> operations on the submodules.
> >
> > We are hearing repeated mention of "cache" and "hint".  Do we ever
> > invalidate it, or if we have such a record, do we blindly trust it
> > and use it without verifying if it is still fresh?
> >
> > Also, this step and the previous step both say we record gitdir on
> > their title, but we instead record common dir.  Whichever is the
> > right choice to record, let's be consistent.
> 
> I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
> track of this series, and see one reason is that the In-Reply-Chain was
> broken between v3..v4.
> 
> I.e. it seems to me that this whole thing started as a way to avoid
> shellscript overhead by calling git-rev-parse from git-submodule.sh, but
> now that the relevant bits are moved to C we could just call some
> slightly adjusted code in setup.c.

No, that is not the case. It is the case that `git -C .. rev-parse
--git-dir` is *very* expensive in the case where `../` is not, in fact,
a gitdir; when I attempted another series which relied on finding the
parent superproject's gitdir in this way, our testsuite took something
like 5x longer to run than before. In other words, the expensive part is
not the shelling out overhead - the expensive part is searching up the
entire filesystem directory structure in the worst-case ("we are not a
submodule") scenario. This is still needed, even with 'git-submodule.sh'
moving to C.

> 
> Maybe I'm entirely wrong, but I think if I am that this series has a
> commit message gap between the "hint" and "cache" and this really *does*
> need to be written at clone/init/update time in some way that I've
> missed.
> 
> Otherwise I still don't really get it, sorry. I.e. maybe the relevant
> code in setup.c really does need caching, or maybe it doesn't anymore,
> and this cache-via-config is a hold-over from when figuring out the
> parent repo implied needing to shell out somewhere for every operation.
> 
> 1. https://lore.kernel.org/git/87r1cnfkqx.fsf@evledraar.gmail.com/
Emily Shaffer Nov. 8, 2021, 11:22 p.m. UTC | #5
On Fri, Nov 05, 2021 at 09:51:12AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Thu, Nov 04 2021, Emily Shaffer wrote:
> 
> > A recorded hint path to the superproject's gitdir might be added during
> > 'git submodule add', but in some cases - like submodules which were
> > created before 'git submodule add' learned to record that info - it might
> > be useful to update the hint. Let's do it during 'git submodule
> > update', when we already have a handle to the superproject while calling
> > operations on the submodules.
> >
> > Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
> > ---
> >  git-submodule.sh            | 14 ++++++++++++++
> >  t/t7406-submodule-update.sh | 12 ++++++++++++
> >  2 files changed, 26 insertions(+)
> >
> > diff --git a/git-submodule.sh b/git-submodule.sh
> > index 652861aa66..873d64eb99 100755
> > --- a/git-submodule.sh
> > +++ b/git-submodule.sh
> > @@ -449,6 +449,20 @@ cmd_update()
> >  			;;
> >  		esac
> >  
> > +		# Cache a pointer to the superproject's common dir. This may have
> > +		# changed, unless it's a fresh clone. Writes it to worktree
> > +		# if applicable, otherwise to local.
> > +		if test -z "$just_cloned"
> > +		then
> > +			sm_gitdir="$(git -C "$sm_path" rev-parse --absolute-git-dir)"
> > +			relative_gitdir="$(git rev-parse --path-format=relative \
> > +							 --prefix "${sm_gitdir}" \
> > +							 --git-common-dir)"
> > +
> > +			git -C "$sm_path" config --worktree \
> > +				submodule.superprojectgitdir "$relative_gitdir"
> > +		fi
> > +
> 
> Aside from the "is this really needed anymore?" comment on this caching
> strategy in general I had in [1] does this really need to be adding new
> shell code to git-submodule.sh given that we're actively trying to get
> rid of it entirely and move it over to C.
> 
> I.e. here we've just called "git submodule--helper
> run-update-procedure", and we pass it "$sm_path" (but not "$recursive",
> but could easily do so).
> 
> It needs to clone this new submodule, so presumably it already has a
> "$sm_gitdir" equivalent, and we can turn that into the same relative
> path, no?
> 
> Can't we call this with a git_config_set*() somewhere in that code?
> 
> *investigates a bit*
> 
> Okey, so I see that [2] is part of a series that Atharva Raykar had a
> version of (including this new git-submodule.sh code above) [3] rebased
> on top of an older version of this topic. I.e. this bit is that part of that patch:
> 
> +       git_config_set_in_file(config_path, "submodule.superprojectGitdir",
> +                              relative_path(get_git_dir(),
> +                                            update_data->sm_path, &sb));
> 
> I also vaguely recall that Junio ejected his topic to make room for this
> topic first.
> 
> Maybe I've missed some update on this but is his [2][3] broken in
> combination with your topic here? And re[1] is whatever "caching" is
> being done here still beneficial once this is all moved to C?
> 
> In your [4] there seems to be an agreement to do it the other way
> around, but as noted in the mail linked from [1] maybe the caching isn't
> needed anymore then?

I answered vs. your other mail; yes, it's still needed, and last I
checked Atharva's series had been kicked out to make room for this one.

> 
> 1. https://lore.kernel.org/git/211105.861r3vhtot.gmgdl@evledraar.gmail.com/
> 2. https://lore.kernel.org/git/20211013051805.45662-8-raykar.ath@gmail.com/
> 3. https://lore.kernel.org/git/20211013051805.45662-1-raykar.ath@gmail.com/
> 4. https://lore.kernel.org/git/YWiXL+plA7GHfuVv@google.com/
Ævar Arnfjörð Bjarmason Nov. 9, 2021, 12:42 a.m. UTC | #6
On Mon, Nov 08 2021, Emily Shaffer wrote:

> On Fri, Nov 05, 2021 at 09:43:56AM +0100, Ævar Arnfjörð Bjarmason wrote:
>> 
>> 
>> On Thu, Nov 04 2021, Junio C Hamano wrote:
>> 
>> > Emily Shaffer <emilyshaffer@google.com> writes:
>> >
>> >> A recorded hint path to the superproject's gitdir might be added during
>> >> 'git submodule add', but in some cases - like submodules which were
>> >> created before 'git submodule add' learned to record that info - it might
>> >> be useful to update the hint. Let's do it during 'git submodule
>> >> update', when we already have a handle to the superproject while calling
>> >> operations on the submodules.
>> >
>> > We are hearing repeated mention of "cache" and "hint".  Do we ever
>> > invalidate it, or if we have such a record, do we blindly trust it
>> > and use it without verifying if it is still fresh?
>> >
>> > Also, this step and the previous step both say we record gitdir on
>> > their title, but we instead record common dir.  Whichever is the
>> > right choice to record, let's be consistent.
>> 
>> I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
>> track of this series, and see one reason is that the In-Reply-Chain was
>> broken between v3..v4.
>> 
>> I.e. it seems to me that this whole thing started as a way to avoid
>> shellscript overhead by calling git-rev-parse from git-submodule.sh, but
>> now that the relevant bits are moved to C we could just call some
>> slightly adjusted code in setup.c.
>
> No, that is not the case. It is the case that `git -C .. rev-parse
> --git-dir` is *very* expensive in the case where `../` is not, in fact,
> a gitdir; when I attempted another series which relied on finding the
> parent superproject's gitdir in this way, our testsuite took something
> like 5x longer to run than before. In other words, the expensive part is
> not the shelling out overhead - the expensive part is searching up the
> entire filesystem directory structure in the worst-case ("we are not a
> submodule") scenario. This is still needed, even with 'git-submodule.sh'
> moving to C.

Do you have that test code somewhere? I tried to reproduce this &
can't. I run my tests in /home/avar/*, and just created this:

    $ find /tmp/some/ -name '.git' -type d
    /tmp/some/dir/.git
    /tmp/some/dir/a/b/c/d/e/f/g/i/j/k/.git

I.e. a deeply nested structure in /tmp, if you ask for the git-dir in
/tmp/some/**/k you'll need to search several levels up.

Then with the patch below we'll instrument almost all git commands to
optionally do that search, i.e. anything that does setup_git_directory()
at all:

    $ GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true ~/g/git/git rev-parse HEAD
    warning: from '/tmp/some/dir/a/b/c/d/e/f/g/i/j' found '/tmp/some/dir' ('/tmp/some/dir/.git')
    <some hash here>

And as a quick test to run a few tests I tried:

    rm -rf test-results/; GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true prove -j8 t741*submod*.sh :: -V

Which runs quickly enough for a tight test loop, and does that work >600 times:
    
    $ cat test-results/*.out|grep -c warning.*from.*found
    662

I can't get that to show me any meaningful difference, just to pick on
one test (since it was easier to run repeatedly):
    
    $ hyperfine -L v true,false "GIT_TEST_SETUP={v} ./t7416-submodule-dash-url.sh --root=/dev/shm/git"
    Benchmark #1: GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git
      Time (mean ± σ):     527.5 ms ±   7.2 ms    [User: 431.6 ms, System: 125.9 ms]
      Range (min … max):   522.6 ms … 542.5 ms    10 runs
     
    Benchmark #2: GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git
      Time (mean ± σ):     526.7 ms ±  10.8 ms    [User: 421.1 ms, System: 131.6 ms]
      Range (min … max):   518.2 ms … 546.8 ms    10 runs
     
    Summary
      'GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git' ran
        1.00 ± 0.02 times faster than 'GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git'

I.e. it's all fuzzy and within the error margins.

Now if we do e.g.:

    GIT_TEST_SETUP=false strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >a
    GIT_TEST_SETUP=true strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >b

We'll see the syscall difference, in summary:

    -   110086 total            100.00
    +   114765 total            100.00

And some of the real big differences are:
    
    $ diff -u <(head -n 12 a) <(head -n 12 b)
    --- /dev/fd/63  2021-11-09 01:57:16.023991556 +0100
    +++ /dev/fd/62  2021-11-09 01:57:16.019991593 +0100
    @@ -1,12 +1,12 @@
         calls syscall          % time
     --------- ---------------- ------
    -    11504 openat             2.15
    -    11496 close              2.07
    -    10672 read               4.15
    -    10465 rt_sigaction       0.32
    -     9913 lstat              1.50
    -     9456 mmap               0.67
    -     7545 stat               0.81
    -     6349 fstat              0.62
    -     3896 access             0.61
    -     3490 mprotect           0.26
    +    11783 lstat              1.80
    +    11600 close              1.89
    +    11599 openat             2.19
    +    10887 read               4.36
    +    10465 rt_sigaction       0.33
    +     9742 stat               1.07
    +     9455 mmap               0.75
    +     6346 fstat              0.57
    +     4113 access             0.65
    +     3490 mprotect           0.28

But syscalls are fast, so it doesn't show up in real results.

Now, of course a real implementation could be less stupid, e.g. even if
we think we need *a cache* if these are the performance numbers why do
we need to risk the cache being incorrect v.s. say just writing "I am a
submodule" somewhere in the config (if we don't have that).

Then we'll only do that work for submodules, so not all git invocations
will pay the cost (and it this point we'll usually have read config
anyway).

But I really just don't think it's that expensive at all. I can see how
it would be for actually shelling out, but we don't need to do that.

This could also just be that I'm running this on a really fast FS, which
is true. So I went and tested on an AIX machine I have access to.

I/O on AIX is slow, *really slow*, so slow that if you "rm -rfv"
something you'll have time to read individual lines scrolling by.

That ~500ms t7416-submodule-dash-url.sh test runs in around 50s on that
AIX box (power-aix.osuosl.org), most of which is I/O overhead, I created
the same /tmp/ directory structure and tried with
GIT_TEST_SETUP=[false|true] and it's ~55s with/without the env variable,
with no clear winner.

I don't have access to hyperfine on that box, or the patience to wait
for AIX I/O to wait for meaningful results, but to a first approximation
that seems to indicate that it doesn't really matter there either.

diff --git a/setup.c b/setup.c
index 347d7181ae9..8453d397676 100644
--- a/setup.c
+++ b/setup.c
@@ -1209,6 +1209,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
        struct strbuf dir = STRBUF_INIT, gitdir = STRBUF_INIT;
        const char *prefix = NULL;
        struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
+       const char *str = "/tmp/some/dir/a/b/c/d/e/f/g/i/j";
+       struct strbuf a = STRBUF_INIT, b = STRBUF_INIT;
 
        /*
         * We may have read an incomplete configuration before
@@ -1231,6 +1233,14 @@ const char *setup_git_directory_gently(int *nongit_ok)
                die_errno(_("Unable to read current working directory"));
        strbuf_addbuf(&dir, &cwd);
 
+       if (git_env_bool("GIT_TEST_SETUP", 0)) {
+               strbuf_addstr(&a, str);
+               setup_git_directory_gently_1(&a, &b, 0);
+
+               if (strcmp(a.buf, str) && git_env_bool("GIT_TEST_SETUP_PRINT", 0))
+                       warning("from '%s' found '%s' ('%s/%s')", str, a.buf, a.buf, b.buf);
+       }
+
        switch (setup_git_directory_gently_1(&dir, &gitdir, 1)) {
        case GIT_DIR_EXPLICIT:
                prefix = setup_explicit_git_dir(gitdir.buf, &cwd, &repo_fmt, nongit_ok);
Ævar Arnfjörð Bjarmason Nov. 9, 2021, 1:12 a.m. UTC | #7
On Mon, Nov 08 2021, Emily Shaffer wrote:

> On Fri, Nov 05, 2021 at 09:51:12AM +0100, Ævar Arnfjörð Bjarmason wrote:
>> 
>> 
>> On Thu, Nov 04 2021, Emily Shaffer wrote:
>> 
>> > A recorded hint path to the superproject's gitdir might be added during
>> > 'git submodule add', but in some cases - like submodules which were
>> > created before 'git submodule add' learned to record that info - it might
>> > be useful to update the hint. Let's do it during 'git submodule
>> > update', when we already have a handle to the superproject while calling
>> > operations on the submodules.
>> >
>> > Signed-off-by: Emily Shaffer <emilyshaffer@google.com>
>> > ---
>> >  git-submodule.sh            | 14 ++++++++++++++
>> >  t/t7406-submodule-update.sh | 12 ++++++++++++
>> >  2 files changed, 26 insertions(+)
>> >
>> > diff --git a/git-submodule.sh b/git-submodule.sh
>> > index 652861aa66..873d64eb99 100755
>> > --- a/git-submodule.sh
>> > +++ b/git-submodule.sh
>> > @@ -449,6 +449,20 @@ cmd_update()
>> >  			;;
>> >  		esac
>> >  
>> > +		# Cache a pointer to the superproject's common dir. This may have
>> > +		# changed, unless it's a fresh clone. Writes it to worktree
>> > +		# if applicable, otherwise to local.
>> > +		if test -z "$just_cloned"
>> > +		then
>> > +			sm_gitdir="$(git -C "$sm_path" rev-parse --absolute-git-dir)"
>> > +			relative_gitdir="$(git rev-parse --path-format=relative \
>> > +							 --prefix "${sm_gitdir}" \
>> > +							 --git-common-dir)"
>> > +
>> > +			git -C "$sm_path" config --worktree \
>> > +				submodule.superprojectgitdir "$relative_gitdir"
>> > +		fi
>> > +
>> 
>> Aside from the "is this really needed anymore?" comment on this caching
>> strategy in general I had in [1] does this really need to be adding new
>> shell code to git-submodule.sh given that we're actively trying to get
>> rid of it entirely and move it over to C.
>> 
>> I.e. here we've just called "git submodule--helper
>> run-update-procedure", and we pass it "$sm_path" (but not "$recursive",
>> but could easily do so).
>> 
>> It needs to clone this new submodule, so presumably it already has a
>> "$sm_gitdir" equivalent, and we can turn that into the same relative
>> path, no?
>> 
>> Can't we call this with a git_config_set*() somewhere in that code?
>> 
>> *investigates a bit*
>> 
>> Okey, so I see that [2] is part of a series that Atharva Raykar had a
>> version of (including this new git-submodule.sh code above) [3] rebased
>> on top of an older version of this topic. I.e. this bit is that part of that patch:
>> 
>> +       git_config_set_in_file(config_path, "submodule.superprojectGitdir",
>> +                              relative_path(get_git_dir(),
>> +                                            update_data->sm_path, &sb));
>> 
>> I also vaguely recall that Junio ejected his topic to make room for this
>> topic first.
>> 
>> Maybe I've missed some update on this but is his [2][3] broken in
>> combination with your topic here? And re[1] is whatever "caching" is
>> being done here still beneficial once this is all moved to C?
>> 
>> In your [4] there seems to be an agreement to do it the other way
>> around, but as noted in the mail linked from [1] maybe the caching isn't
>> needed anymore then?
>
> I answered vs. your other mail; yes, it's still needed, [...]

I replied just now (in [1], but it's not on lore yet, maybe vger ate my
mail again).

tl;dr: Ran some quick performance numbers, and can't reproduce any
slowdown at all when instrumenting the test suite to run another
setup_git_directory() that'll do a nested lookup on pretty much every
git command, and running the test suite (you mentioned a ~5x overall
slowdown).

Anyway, I think the two replies you've got here only partially address
what I pointed out, in particular in [2]:
    
    But I'm a bit iffy on a series that's structured in a way as to not
    start by asserting that we should have given semantics without the
    cache, and then adds the cache later as an optional optimization.

I.e. even if it was 10x as slow I think it should still be structured in
such a way as to at least have some GIT_TEST_* knob to turn it off in
favor of a slow path.

E.g. we've got commit-graph paths that are probably 100x or 1000x faster
than the slow path, but having the cache-less ones means we can validate
their correctness.
    
> and last I checked Atharva's series had been kicked out to make room
> for this one.

Indeed, but as a comment on this proposed series that doesn't really
address this in my above-quoted ([3])

    I.e. here we've just called "git submodule--helper
    run-update-procedure", and we pass it "$sm_path" (but not "$recursive",
    but could easily do so).

I.e. my understanding is that Junio ejected Atharva's because this
seemed like the smaller change, but perhaps it wasn't realized that we'd
be adding shellscript only to remove it again?

But in any case, even if we're patching git-submodule.sh not having
Atharva's go first in its entirety doesn't answer why we can't extract
the bits he/you came up with in C for this series.

I.e. that git-submodule.sh wouldn't need that shelling out (since it
just called a helper, that helper could just return this data too),
that's just a few lines above the hunk in this series.

I.e. if some remaining performance issue I couldn't reproduce in [1] is
due to the shellscript version of this v.s. calling a C function in
libgit isn't it better to focus on closing that gap than having the
caching mechanism?

(Per the above & linked mails I'm still not 100% sure this even *is* a
caching mechanism, or if we store primary data in it, but I'm just
trying to go by your commit message descriptions).

1. https://lore.kernel.org/git/211109.86v912dtfw.gmgdl@evledraar.gmail.com/
2. https://lore.kernel.org/git/87r1cnfkqx.fsf@evledraar.gmail.com/
3. https://lore.kernel.org/git/211105.86wnlngebr.gmgdl@evledraar.gmail.com/

>> 
>> 1. https://lore.kernel.org/git/211105.861r3vhtot.gmgdl@evledraar.gmail.com/
>> 2. https://lore.kernel.org/git/20211013051805.45662-8-raykar.ath@gmail.com/
>> 3. https://lore.kernel.org/git/20211013051805.45662-1-raykar.ath@gmail.com/
>> 4. https://lore.kernel.org/git/YWiXL+plA7GHfuVv@google.com/
Emily Shaffer Nov. 9, 2021, 8:36 p.m. UTC | #8
On Tue, Nov 09, 2021 at 01:42:03AM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> On Mon, Nov 08 2021, Emily Shaffer wrote:
> 
> > On Fri, Nov 05, 2021 at 09:43:56AM +0100, Ævar Arnfjörð Bjarmason wrote:
> >> 
> >> 
> >> On Thu, Nov 04 2021, Junio C Hamano wrote:
> >> 
> >> > Emily Shaffer <emilyshaffer@google.com> writes:
> >> >
> >> >> A recorded hint path to the superproject's gitdir might be added during
> >> >> 'git submodule add', but in some cases - like submodules which were
> >> >> created before 'git submodule add' learned to record that info - it might
> >> >> be useful to update the hint. Let's do it during 'git submodule
> >> >> update', when we already have a handle to the superproject while calling
> >> >> operations on the submodules.
> >> >
> >> > We are hearing repeated mention of "cache" and "hint".  Do we ever
> >> > invalidate it, or if we have such a record, do we blindly trust it
> >> > and use it without verifying if it is still fresh?
> >> >
> >> > Also, this step and the previous step both say we record gitdir on
> >> > their title, but we instead record common dir.  Whichever is the
> >> > right choice to record, let's be consistent.
> >> 
> >> I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
> >> track of this series, and see one reason is that the In-Reply-Chain was
> >> broken between v3..v4.
> >> 
> >> I.e. it seems to me that this whole thing started as a way to avoid
> >> shellscript overhead by calling git-rev-parse from git-submodule.sh, but
> >> now that the relevant bits are moved to C we could just call some
> >> slightly adjusted code in setup.c.
> >
> > No, that is not the case. It is the case that `git -C .. rev-parse
> > --git-dir` is *very* expensive in the case where `../` is not, in fact,
> > a gitdir; when I attempted another series which relied on finding the
> > parent superproject's gitdir in this way, our testsuite took something
> > like 5x longer to run than before. In other words, the expensive part is
> > not the shelling out overhead - the expensive part is searching up the
> > entire filesystem directory structure in the worst-case ("we are not a
> > submodule") scenario. This is still needed, even with 'git-submodule.sh'
> > moving to C.
> 
> Do you have that test code somewhere?

I messed around with it a little more, rebasing the no-caches-involved
older implementation and using an in-process lookup with
setup_git_directory_gently_1.
https://github.com/nasamuffin/git/tree/config-inheritance-no-cache

The recent experiments are in the tip commit, and the original series is
in the two commits prior if you're interested.

The upshot, though, is that I think there is still not a way around a
second subprocess. Before, we determined the superproject's gitdir like
so:

  # Does a git project at .. think I belong to it?
  git -C .. ls-files <args> -- path/to/submodule
  # Where does that git project's gitdir live?
  git -C .. rev-parse --absolute-git-dir

Even if we can do the second call in-process, we still will be
performing this ls-files call to ensure that the parent repo is actually
our superproject. (One good example of a time when the parent repo is
*not*: the entire Git test suite, where '/path/to/git/t/trash directory.t1234-abcd'
is not a submodule of '/path/to/git/.git'.)

We could reverse the checks, which will make this much less painful in
the real world, but will still slow down our test suites (and hopefully
you'll forgive me for combining C and bash so brazenly, but it's for
illustration purposes only):

  # Is there a git project above us?
  setup_git_directory_gently_1("..", out, 0);
  # Does it think we're its submodule?
  git -C $out rev-parse --absolute-git-dir

That will still result in an extra out-of-process call for every line in
the Git test suite, though, because of the trash directory layout.

I looked briefly at `git ls-files` // `cmd_ls_files()` and it's fairly
close to being callable on an arbitrary 'struct repository', but not
quite there. But I am pretty afraid of the rabbit hole ;)


And anyway, even with those possible changes, it turns out that
`setup_git_directory_gently_1` - which I had to munge a bit to make
non-private, anyway - wants to take shortcuts like looking at
getenv(GIT_DIR_ENVIRONMENT), which means it's will notice the
submodule's envvar before it will notice the path passed to it. I
actually am a little surprised that your experiment worked at all,
because of this wrinkle, but your printf lines show that it did somehow.


Taken as a whole, I'm not quite certain which is worse.

Approach "cache a pointer from the submodule":
- all the normal caching gotchas - creation, invalidation, staleness,
  etc
+ very easy lookup without need for significant refactoring
- since it's a config, if we do it wrong we're stuck supporting it
  forever anyway

Approach "do lots of in-process heuristics":
- need to refactor code areas unrelated to submodules or configs, like
  setup.c
- performance might differ based on filesystem speed
+ still pretty fast (compared to subprocess calls)
+ avoid all cache correctness issues
- still kind of based on heuristics; will someone envision a wonky way
  of organizing a submodule that breaks the heuristic?

I'll think on it more....

 - Emily

> I tried to reproduce this &
> can't. I run my tests in /home/avar/*, and just created this:
> 
>     $ find /tmp/some/ -name '.git' -type d
>     /tmp/some/dir/.git
>     /tmp/some/dir/a/b/c/d/e/f/g/i/j/k/.git
> 
> I.e. a deeply nested structure in /tmp, if you ask for the git-dir in
> /tmp/some/**/k you'll need to search several levels up.
> 
> Then with the patch below we'll instrument almost all git commands to
> optionally do that search, i.e. anything that does setup_git_directory()
> at all:
> 
>     $ GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true ~/g/git/git rev-parse HEAD
>     warning: from '/tmp/some/dir/a/b/c/d/e/f/g/i/j' found '/tmp/some/dir' ('/tmp/some/dir/.git')
>     <some hash here>
> 
> And as a quick test to run a few tests I tried:
> 
>     rm -rf test-results/; GIT_TEST_SETUP=true GIT_TEST_SETUP_PRINT=true prove -j8 t741*submod*.sh :: -V
> 
> Which runs quickly enough for a tight test loop, and does that work >600 times:
>     
>     $ cat test-results/*.out|grep -c warning.*from.*found
>     662
> 
> I can't get that to show me any meaningful difference, just to pick on
> one test (since it was easier to run repeatedly):
>     
>     $ hyperfine -L v true,false "GIT_TEST_SETUP={v} ./t7416-submodule-dash-url.sh --root=/dev/shm/git"
>     Benchmark #1: GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git
>       Time (mean ± σ):     527.5 ms ±   7.2 ms    [User: 431.6 ms, System: 125.9 ms]
>       Range (min … max):   522.6 ms … 542.5 ms    10 runs
>      
>     Benchmark #2: GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git
>       Time (mean ± σ):     526.7 ms ±  10.8 ms    [User: 421.1 ms, System: 131.6 ms]
>       Range (min … max):   518.2 ms … 546.8 ms    10 runs
>      
>     Summary
>       'GIT_TEST_SETUP=false ./t7416-submodule-dash-url.sh --root=/dev/shm/git' ran
>         1.00 ± 0.02 times faster than 'GIT_TEST_SETUP=true ./t7416-submodule-dash-url.sh --root=/dev/shm/git'
> 
> I.e. it's all fuzzy and within the error margins.
> 
> Now if we do e.g.:
> 
>     GIT_TEST_SETUP=false strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >a
>     GIT_TEST_SETUP=true strace -f -c -U calls,name,time -S calls ./t7416-submodule-dash-url.sh 2>&1 >/dev/null | grep -A9000 calls >b
> 
> We'll see the syscall difference, in summary:
> 
>     -   110086 total            100.00
>     +   114765 total            100.00
> 
> And some of the real big differences are:
>     
>     $ diff -u <(head -n 12 a) <(head -n 12 b)
>     --- /dev/fd/63  2021-11-09 01:57:16.023991556 +0100
>     +++ /dev/fd/62  2021-11-09 01:57:16.019991593 +0100
>     @@ -1,12 +1,12 @@
>          calls syscall          % time
>      --------- ---------------- ------
>     -    11504 openat             2.15
>     -    11496 close              2.07
>     -    10672 read               4.15
>     -    10465 rt_sigaction       0.32
>     -     9913 lstat              1.50
>     -     9456 mmap               0.67
>     -     7545 stat               0.81
>     -     6349 fstat              0.62
>     -     3896 access             0.61
>     -     3490 mprotect           0.26
>     +    11783 lstat              1.80
>     +    11600 close              1.89
>     +    11599 openat             2.19
>     +    10887 read               4.36
>     +    10465 rt_sigaction       0.33
>     +     9742 stat               1.07
>     +     9455 mmap               0.75
>     +     6346 fstat              0.57
>     +     4113 access             0.65
>     +     3490 mprotect           0.28
> 
> But syscalls are fast, so it doesn't show up in real results.
> 
> Now, of course a real implementation could be less stupid, e.g. even if
> we think we need *a cache* if these are the performance numbers why do
> we need to risk the cache being incorrect v.s. say just writing "I am a
> submodule" somewhere in the config (if we don't have that).
> 
> Then we'll only do that work for submodules, so not all git invocations
> will pay the cost (and it this point we'll usually have read config
> anyway).
> 
> But I really just don't think it's that expensive at all. I can see how
> it would be for actually shelling out, but we don't need to do that.
> 
> This could also just be that I'm running this on a really fast FS, which
> is true. So I went and tested on an AIX machine I have access to.
> 
> I/O on AIX is slow, *really slow*, so slow that if you "rm -rfv"
> something you'll have time to read individual lines scrolling by.
> 
> That ~500ms t7416-submodule-dash-url.sh test runs in around 50s on that
> AIX box (power-aix.osuosl.org), most of which is I/O overhead, I created
> the same /tmp/ directory structure and tried with
> GIT_TEST_SETUP=[false|true] and it's ~55s with/without the env variable,
> with no clear winner.
> 
> I don't have access to hyperfine on that box, or the patience to wait
> for AIX I/O to wait for meaningful results, but to a first approximation
> that seems to indicate that it doesn't really matter there either.
> 
> diff --git a/setup.c b/setup.c
> index 347d7181ae9..8453d397676 100644
> --- a/setup.c
> +++ b/setup.c
> @@ -1209,6 +1209,8 @@ const char *setup_git_directory_gently(int *nongit_ok)
>         struct strbuf dir = STRBUF_INIT, gitdir = STRBUF_INIT;
>         const char *prefix = NULL;
>         struct repository_format repo_fmt = REPOSITORY_FORMAT_INIT;
> +       const char *str = "/tmp/some/dir/a/b/c/d/e/f/g/i/j";
> +       struct strbuf a = STRBUF_INIT, b = STRBUF_INIT;
>  
>         /*
>          * We may have read an incomplete configuration before
> @@ -1231,6 +1233,14 @@ const char *setup_git_directory_gently(int *nongit_ok)
>                 die_errno(_("Unable to read current working directory"));
>         strbuf_addbuf(&dir, &cwd);
>  
> +       if (git_env_bool("GIT_TEST_SETUP", 0)) {
> +               strbuf_addstr(&a, str);
> +               setup_git_directory_gently_1(&a, &b, 0);
> +
> +               if (strcmp(a.buf, str) && git_env_bool("GIT_TEST_SETUP_PRINT", 0))
> +                       warning("from '%s' found '%s' ('%s/%s')", str, a.buf, a.buf, b.buf);
> +       }
> +
>         switch (setup_git_directory_gently_1(&dir, &gitdir, 1)) {
>         case GIT_DIR_EXPLICIT:
>                 prefix = setup_explicit_git_dir(gitdir.buf, &cwd, &repo_fmt, nongit_ok);
>
Emily Shaffer Nov. 9, 2021, 9:46 p.m. UTC | #9
On Tue, Nov 09, 2021 at 12:36:50PM -0800, Emily Shaffer wrote:
> 
> On Tue, Nov 09, 2021 at 01:42:03AM +0100, Ævar Arnfjörð Bjarmason wrote:
> > 
> > 
> > On Mon, Nov 08 2021, Emily Shaffer wrote:
> > 
> > > On Fri, Nov 05, 2021 at 09:43:56AM +0100, Ævar Arnfjörð Bjarmason wrote:
> > >> 
> > >> 
> > >> On Thu, Nov 04 2021, Junio C Hamano wrote:
> > >> 
> > >> > Emily Shaffer <emilyshaffer@google.com> writes:
> > >> >
> > >> >> A recorded hint path to the superproject's gitdir might be added during
> > >> >> 'git submodule add', but in some cases - like submodules which were
> > >> >> created before 'git submodule add' learned to record that info - it might
> > >> >> be useful to update the hint. Let's do it during 'git submodule
> > >> >> update', when we already have a handle to the superproject while calling
> > >> >> operations on the submodules.
> > >> >
> > >> > We are hearing repeated mention of "cache" and "hint".  Do we ever
> > >> > invalidate it, or if we have such a record, do we blindly trust it
> > >> > and use it without verifying if it is still fresh?
> > >> >
> > >> > Also, this step and the previous step both say we record gitdir on
> > >> > their title, but we instead record common dir.  Whichever is the
> > >> > right choice to record, let's be consistent.
> > >> 
> > >> I had similar (AFAICT still unaddressed) feedback on the v2[1]. I'd lost
> > >> track of this series, and see one reason is that the In-Reply-Chain was
> > >> broken between v3..v4.
> > >> 
> > >> I.e. it seems to me that this whole thing started as a way to avoid
> > >> shellscript overhead by calling git-rev-parse from git-submodule.sh, but
> > >> now that the relevant bits are moved to C we could just call some
> > >> slightly adjusted code in setup.c.
> > >
> > > No, that is not the case. It is the case that `git -C .. rev-parse
> > > --git-dir` is *very* expensive in the case where `../` is not, in fact,
> > > a gitdir; when I attempted another series which relied on finding the
> > > parent superproject's gitdir in this way, our testsuite took something
> > > like 5x longer to run than before. In other words, the expensive part is
> > > not the shelling out overhead - the expensive part is searching up the
> > > entire filesystem directory structure in the worst-case ("we are not a
> > > submodule") scenario. This is still needed, even with 'git-submodule.sh'
> > > moving to C.
> > 
> > Do you have that test code somewhere?
> 
> I messed around with it a little more, rebasing the no-caches-involved
> older implementation and using an in-process lookup with
> setup_git_directory_gently_1.
> https://github.com/nasamuffin/git/tree/config-inheritance-no-cache
> 
> The recent experiments are in the tip commit, and the original series is
> in the two commits prior if you're interested.
> 
> The upshot, though, is that I think there is still not a way around a
> second subprocess. Before, we determined the superproject's gitdir like
> so:
> 
>   # Does a git project at .. think I belong to it?
>   git -C .. ls-files <args> -- path/to/submodule
>   # Where does that git project's gitdir live?
>   git -C .. rev-parse --absolute-git-dir
> 
> Even if we can do the second call in-process, we still will be
> performing this ls-files call to ensure that the parent repo is actually
> our superproject. (One good example of a time when the parent repo is
> *not*: the entire Git test suite, where '/path/to/git/t/trash directory.t1234-abcd'
> is not a submodule of '/path/to/git/.git'.)
> 
> We could reverse the checks, which will make this much less painful in
> the real world, but will still slow down our test suites (and hopefully
> you'll forgive me for combining C and bash so brazenly, but it's for
> illustration purposes only):
> 
>   # Is there a git project above us?
>   setup_git_directory_gently_1("..", out, 0);
>   # Does it think we're its submodule?
>   git -C $out rev-parse --absolute-git-dir
> 
> That will still result in an extra out-of-process call for every line in
> the Git test suite, though, because of the trash directory layout.
> 
> I looked briefly at `git ls-files` // `cmd_ls_files()` and it's fairly
> close to being callable on an arbitrary 'struct repository', but not
> quite there. But I am pretty afraid of the rabbit hole ;)

Jonathan Nieder and Glen Choo pointed out that I can read in the index
to an arbitrary struct index_state from a path, and then call
'index_dir_exists()', so this part of it is not as scary as I thought.

I'll mess around today and see if I can come up with an in-process
version of 'get_superproject_working_tree()' and
'get_superproject_gitdir()'.

Thanks for the lead - a fine example of how useful it is to receive
high-level input from someone removed from the problem ;)

 - Emily
diff mbox series

Patch

diff --git a/git-submodule.sh b/git-submodule.sh
index 652861aa66..873d64eb99 100755
--- a/git-submodule.sh
+++ b/git-submodule.sh
@@ -449,6 +449,20 @@  cmd_update()
 			;;
 		esac
 
+		# Cache a pointer to the superproject's common dir. This may have
+		# changed, unless it's a fresh clone. Writes it to worktree
+		# if applicable, otherwise to local.
+		if test -z "$just_cloned"
+		then
+			sm_gitdir="$(git -C "$sm_path" rev-parse --absolute-git-dir)"
+			relative_gitdir="$(git rev-parse --path-format=relative \
+							 --prefix "${sm_gitdir}" \
+							 --git-common-dir)"
+
+			git -C "$sm_path" config --worktree \
+				submodule.superprojectgitdir "$relative_gitdir"
+		fi
+
 		if test -n "$recursive"
 		then
 			(
diff --git a/t/t7406-submodule-update.sh b/t/t7406-submodule-update.sh
index 11cccbb333..5146237abc 100755
--- a/t/t7406-submodule-update.sh
+++ b/t/t7406-submodule-update.sh
@@ -1061,4 +1061,16 @@  test_expect_success 'submodule update --quiet passes quietness to fetch with a s
 	)
 '
 
+test_expect_success 'submodule update adds superproject gitdir to older repos' '
+	(cd super &&
+	 git -C submodule config --unset submodule.superprojectGitdir &&
+	 git submodule update &&
+	 test-tool path-utils relative_path \
+		"$(git rev-parse --path-format=absolute --git-common-dir)" \
+		"$(git -C submodule rev-parse --path-format=absolute --git-common-dir)" >expect &&
+	 git -C submodule config submodule.superprojectGitdir >actual &&
+	 test_cmp expect actual
+	)
+'
+
 test_done