diff mbox series

p5302: create the repo in each index-pack test

Message ID 20190422211952.GA4728@sigill.intra.peff.net (mailing list archive)
State New, archived
Headers show
Series p5302: create the repo in each index-pack test | expand

Commit Message

Jeff King April 22, 2019, 9:19 p.m. UTC
On Mon, Apr 22, 2019 at 04:56:53PM -0400, Jeff King wrote:

> > I am running some index packs to test the theory, I can tell you already that 
> > the 56 thread versions was much slower, it took 397m25.622s. I am running a 
> > few other tests also, but it will take a while to get an answer. Since things 
> > take hours to test, I made a repo with a single branch (and the tags for that 
> > branch) from this bigger repo using a git init/git fetch. The single branch 
> > repo takes about 12s to clone, but it takes around 14s with 3 threads to run 
> > index-pack, any ideas why it is slower than a clone?
> 
> Are you running it in the same repo, or in another newly-created repo?
> Or alternatively, in a new repo but repeatedly running index-pack? After
> the first run, that repo will have all of the objects. And so for each
> object it sees, index-pack will say "woah, we already had that one;
> let's double check that they're byte for byte identical" which carries
> extra overhead (and probably makes the lock contention way worse, too,
> because accessing existing objects just has one big coarse lock).
> 
> So definitely do something like:
> 
>   for threads in 1 2 3 4 5 12 56; do
> 	rm -rf repo.git
> 	git init --bare repo.git
> 	GIT_FORCE_THREADS=$threads \
> 	  git -C repo.git index-pack -v --stdin </path/to/pack
>   done
> 
> to test.

This is roughly what p5302 is going, though it does not go as high as
56 (though it seems like there is probably not much point in doing so).

However, I did notice this slight bug in it. After this fix, here are my
numbers from indexing git.git:

  Test                                           HEAD             
  ----------------------------------------------------------------
  5302.2: index-pack 0 threads                   22.72(22.55+0.16)
  5302.3: index-pack 1 thread                    23.26(23.02+0.24)
  5302.4: index-pack 2 threads                   13.19(24.06+0.23)
  5302.5: index-pack 4 threads                   7.96(24.65+0.25) 
  5302.6: index-pack 8 threads                   7.94(45.06+0.38) 
  5302.7: index-pack default number of threads   9.37(23.82+0.18) 

So it looks like "4" is slightly better than the default of "3" for me.

I'm running it on linux.git now, but it will take quite a while to come
up with a result.

-- >8 --
Subject: [PATCH] p5302: create the repo in each index-pack test

The p5302 script runs "index-pack --stdin" in each timing test. It does
two things to try to get good timings:

  1. we do the repo creation in a separate (non-timed) setup test, so
     that our timing is purely the index-pack run

  2. we use a separate repo for each test; this is important because the
     presence of existing objects in the repo influences the result
     (because we'll end up doing collision checks against them)

But this forgets one thing: we generally run each timed test multiple
times to reduce the impact of noise. Which means that repeats of each
test after the first will be subject to the collision slowdown from
point 2, and we'll generally just end up taking the first time anyway.

Instead, let's create the repo in the test (effectively undoing point
1). That does add a constant amount of extra work to each iteration, but
it's quite small compared to the actual effects we're interested in
measuring.

Signed-off-by: Jeff King <peff@peff.net>
---
The very first 0-thread one will run faster because it has less to "rm
-rf", but I think we can ignore that.

 t/perf/p5302-pack-index.sh | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

Comments

Junio C Hamano April 23, 2019, 1:09 a.m. UTC | #1
Jeff King <peff@peff.net> writes:

> Subject: [PATCH] p5302: create the repo in each index-pack test
>
> The p5302 script runs "index-pack --stdin" in each timing test. It does
> two things to try to get good timings:
>
>   1. we do the repo creation in a separate (non-timed) setup test, so
>      that our timing is purely the index-pack run
>
>   2. we use a separate repo for each test; this is important because the
>      presence of existing objects in the repo influences the result
>      (because we'll end up doing collision checks against them)
>
> But this forgets one thing: we generally run each timed test multiple
> times to reduce the impact of noise. Which means that repeats of each
> test after the first will be subject to the collision slowdown from
> point 2, and we'll generally just end up taking the first time anyway.

The above is very cleanly written to convince anybody that what the
current test does contradicts with wish #2 above, and that the two
wishes #1 and #2 are probably mutually incompatible.

But isn't the collision check a part of the real-life workload that
Git users are made waiting for and care about the performance of?
Or are we purely interested in the cost of resolving delta,
computing the object name, and writing the result out to the disk in
this test and the "overall experience" benchmark is left elsewhere?

The reason why I got confused is because the test_description of the
script leaves "the actual effects we're interested in measuring"
unsaid, I think.  The log message of b8a2486f ("index-pack: support
multithreaded delta resolving", 2012-05-06) that created this test
does not help that much, either.

In any case, the above "this forgets one thing" makes it clear that
we at this point in time declare what we are interested in very
clearly, and I agree that the solution described in the paragraph
below clearly matches the goal.  Looks good.

> Instead, let's create the repo in the test (effectively undoing point
> 1). That does add a constant amount of extra work to each iteration, but
> it's quite small compared to the actual effects we're interested in
> measuring.


> Signed-off-by: Jeff King <peff@peff.net>
> ---
> The very first 0-thread one will run faster because it has less to "rm
> -rf", but I think we can ignore that.

OK.

> -	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
> +	rm -rf repo.git &&
> +	git init --bare repo.git &&
> +	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK

This is obviously inherited from the original, but do we get scolded
by some versions of bash for this line, without quoting the source path
of the redirection, i.e.

	... --stdin <"$PACK"
Jeff King April 23, 2019, 2:07 a.m. UTC | #2
On Tue, Apr 23, 2019 at 10:09:54AM +0900, Junio C Hamano wrote:

> The above is very cleanly written to convince anybody that what the
> current test does contradicts with wish #2 above, and that the two
> wishes #1 and #2 are probably mutually incompatible.
> 
> But isn't the collision check a part of the real-life workload that
> Git users are made waiting for and care about the performance of?
> Or are we purely interested in the cost of resolving delta,
> computing the object name, and writing the result out to the disk in
> this test and the "overall experience" benchmark is left elsewhere?

I think we _are_ just interested in the resolving delta cost (after all,
we're testing it with various thread levels). What's more, the old code
would run the test $GIT_PERF_COUNT times, once without any objects, and
then the other N-1 times with objects. And then take the smallest time,
which would generally be the one-off!  So we're really just measuring
that case more consistently now.

But even if you left all of that aside, I think the case without objects
is actually the realistic one. It represents the equivalent a full
clone, where we would not have any objects already.

The case where we are fetching into a repository with objects already is
also potentially of interest, but this test wouldn't show that very
well. Because there the main added cost is looking up each object and
saying "ah, we do not have it; no need to do a collision check", because
we'd generally not expect the other side to be sending us duplicates.

But because this test would be repeating itself on the same pack each
time, we'd be seeing a collision on _every_ object. And the added time
would be dominated by us saying "oops, a collision; let's take the slow
path and reconstruct that object from disk so we can compare its bytes".

> The reason why I got confused is because the test_description of the
> script leaves "the actual effects we're interested in measuring"
> unsaid, I think.  The log message of b8a2486f ("index-pack: support
> multithreaded delta resolving", 2012-05-06) that created this test
> does not help that much, either.
> 
> In any case, the above "this forgets one thing" makes it clear that
> we at this point in time declare what we are interested in very
> clearly, and I agree that the solution described in the paragraph
> below clearly matches the goal.  Looks good.

So I think you convinced yourself even before my email that this was the
right path, but let me know if you think it's worth trying to revise the
commit message to include some of the above reasoning.

> > -	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
> > +	rm -rf repo.git &&
> > +	git init --bare repo.git &&
> > +	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK
> 
> This is obviously inherited from the original, but do we get scolded
> by some versions of bash for this line, without quoting the source path
> of the redirection, i.e.
> 
> 	... --stdin <"$PACK"

In general, yes, but I think we are OK in this instance because we
generated $PACK ourselves in the setup step, and we know that it is just
a relative .git/objects/pack/xyz.pack with no spaces. I almost touched
it just to get rid of the style-violating space after the "<" though. ;)

-Peff
Junio C Hamano April 23, 2019, 2:27 a.m. UTC | #3
Jeff King <peff@peff.net> writes:

>> This is obviously inherited from the original, but do we get scolded
>> by some versions of bash for this line, without quoting the source path
>> of the redirection, i.e.
>> 
>> 	... --stdin <"$PACK"
>
> In general, yes, but I think we are OK in this instance because we
> generated $PACK ourselves in the setup step, and we know that it is just
> a relative .git/objects/pack/xyz.pack with no spaces.

I know we are OK, but the issue with some versions of bash AFAIU is
that bash is not OK regardless of the contents of $variable that is
not quoted and used as the target or the source of a redirection,
issuing an unnecessary warning.
Jeff King April 23, 2019, 2:36 a.m. UTC | #4
On Tue, Apr 23, 2019 at 11:27:02AM +0900, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >> This is obviously inherited from the original, but do we get scolded
> >> by some versions of bash for this line, without quoting the source path
> >> of the redirection, i.e.
> >> 
> >> 	... --stdin <"$PACK"
> >
> > In general, yes, but I think we are OK in this instance because we
> > generated $PACK ourselves in the setup step, and we know that it is just
> > a relative .git/objects/pack/xyz.pack with no spaces.
> 
> I know we are OK, but the issue with some versions of bash AFAIU is
> that bash is not OK regardless of the contents of $variable that is
> not quoted and used as the target or the source of a redirection,
> issuing an unnecessary warning.

Is it? I thought the issue was specifically when there were spaces. I
get:

  $ bash
  $ file=ok
  $ echo foo >$file
  $ file='not ok'
  $ echo foo >$file
  bash: $file: ambiguous redirect

And that is AFAIK what the recent 7951a016a5 (t4038-diff-combined: quote
paths with whitespace, 2019-03-17) was about (because our trash
directory always has a space in it).

Did I miss a report where it happens on some versions even without
spaces? If so, we have quite a number of things to fix judging from the
output of:

  git grep '>\$'

-Peff
Junio C Hamano April 23, 2019, 2:40 a.m. UTC | #5
Jeff King <peff@peff.net> writes:

> Is it? I thought the issue was specifically when there were spaces. I
> get:
>
>   $ bash
>   $ file=ok
>   $ echo foo >$file
>   $ file='not ok'
>   $ echo foo >$file
>   bash: $file: ambiguous redirect

OK, so I misremembered.  Then we are good.  Thanks.
diff mbox series

Patch

diff --git a/t/perf/p5302-pack-index.sh b/t/perf/p5302-pack-index.sh
index 99bdb16c85..a9b3e112d9 100755
--- a/t/perf/p5302-pack-index.sh
+++ b/t/perf/p5302-pack-index.sh
@@ -13,35 +13,40 @@  test_expect_success 'repack' '
 	export PACK
 '
 
-test_expect_success 'create target repositories' '
-	for repo in t1 t2 t3 t4 t5 t6
-	do
-		git init --bare $repo
-	done
-'
-
 test_perf 'index-pack 0 threads' '
-	GIT_DIR=t1 git index-pack --threads=1 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=1 --stdin < $PACK
 '
 
 test_perf 'index-pack 1 thread ' '
-	GIT_DIR=t2 GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git GIT_FORCE_THREADS=1 git index-pack --threads=1 --stdin < $PACK
 '
 
 test_perf 'index-pack 2 threads' '
-	GIT_DIR=t3 git index-pack --threads=2 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=2 --stdin < $PACK
 '
 
 test_perf 'index-pack 4 threads' '
-	GIT_DIR=t4 git index-pack --threads=4 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=4 --stdin < $PACK
 '
 
 test_perf 'index-pack 8 threads' '
-	GIT_DIR=t5 git index-pack --threads=8 --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --threads=8 --stdin < $PACK
 '
 
 test_perf 'index-pack default number of threads' '
-	GIT_DIR=t6 git index-pack --stdin < $PACK
+	rm -rf repo.git &&
+	git init --bare repo.git &&
+	GIT_DIR=repo.git git index-pack --stdin < $PACK
 '
 
 test_done