diff mbox series

t/perf/perf-lib.sh: remove test_times.* at the end test_perf_()

Message ID pull.1051.git.1633386543759.gitgitgadget@gmail.com (mailing list archive)
State Accepted
Commit 473a26166c0a4d527b0568384bbacdb9e29b30d2
Headers show
Series t/perf/perf-lib.sh: remove test_times.* at the end test_perf_() | expand

Commit Message

Jeff Hostetler Oct. 4, 2021, 10:29 p.m. UTC
From: Jeff Hostetler <jeffhost@microsoft.com>

Teach test_perf_() to remove the temporary test_times.* files
at the end of each test.

test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates
./test_times.[123...].  It then uses a perl script to find the minimum
over "./test_times.*" (note the wildcard) and writes that time to
"test-results/<testname>.<testnumber>.result".

If the repeat count is changed during the pXXXX test script, stale
test_times.* files (from previous steps) may be included in the min()
computation.  For example:

...
GIT_PERF_REPEAT_COUNT=3 \
test_perf "status" "
	git status
"

GIT_PERF_REPEAT_COUNT=1 \
test_perf "checkout other" "
	git checkout other
"
...

The time reported in the summary for "XXXX.2 checkout other" would
be "min( checkout[1], status[2], status[3] )".

We prevent that error by removing the test_times.* files at the end of
each test.

Signed-off-by: Jeff Hostetler <jeffhost@microsoft.com>
---
    t/perf/perf-lib.sh: remove test_times.* at the end test_perf_()
    
    Teach test_perf_() to remove the temporary test_times.* files at the end
    of each test.
    
    test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates
    ./test_times.[123...]. It then uses a perl script to find the minimum
    over "./test_times.*" (note the wildcard) and writes that time to
    "test-results/..result".
    
    If the repeat count is changed during the pXXXX test script, stale
    test_times.* files (from previous steps) may be included in the min()
    computation. For example:
    
    ... GIT_PERF_REPEAT_COUNT=3
    test_perf "status" " git status "
    
    GIT_PERF_REPEAT_COUNT=1
    test_perf "checkout other" " git checkout other " ...
    
    The time reported in the summary for "XXXX.2 checkout other" would be
    "min( checkout[1], status[2], status[3] )".
    
    We prevent that error by removing the test_times.* files at the end of
    each test.
    
    Signed-off-by: Jeff Hostetler jeffhost@microsoft.com

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1051%2Fjeffhostetler%2Fperf-test-remove-test-times-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1051/jeffhostetler/perf-test-remove-test-times-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1051

 t/perf/perf-lib.sh | 1 +
 1 file changed, 1 insertion(+)


base-commit: 0785eb769886ae81e346df10e88bc49ffc0ac64e

Comments

Taylor Blau Oct. 5, 2021, 5:45 p.m. UTC | #1
On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote:
> From: Jeff Hostetler <jeffhost@microsoft.com>
>
> Teach test_perf_() to remove the temporary test_times.* files

Small nit: s/test_times/test_time here and throughout.

> at the end of each test.
>
> test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates
> ./test_times.[123...].  It then uses a perl script to find the minimum
> over "./test_times.*" (note the wildcard) and writes that time to
> "test-results/<testname>.<testnumber>.result".
>
> If the repeat count is changed during the pXXXX test script, stale
> test_times.* files (from previous steps) may be included in the min()
> computation.  For example:
>
> ...
> GIT_PERF_REPEAT_COUNT=3 \
> test_perf "status" "
> 	git status
> "
>
> GIT_PERF_REPEAT_COUNT=1 \
> test_perf "checkout other" "
> 	git checkout other
> "
> ...
>
> The time reported in the summary for "XXXX.2 checkout other" would
> be "min( checkout[1], status[2], status[3] )".
>
> We prevent that error by removing the test_times.* files at the end of
> each test.

Well explained, and makes sense to me. I didn't know we set
GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but
grepping shows that we do it in the fsmonitor tests.

Dropping any test_times files makes sense as the right thing to do. I
have no opinion on whether it should happen before running a perf test,
or after generating the results. So what you did here looks good to me.

An alternative approach might be to only read the test_time.n files we
know should exist based on GIT_PERF_REPEAT_COUNT, perhaps like:

    test_seq "$GIT_PERF_REPEAT_COUNT" | perl -lne 'print "test_time.$_"' |
    xargs "$TEST_DIRECTORY/perf/min_time.perl" >"$base".result

but I'm not convinced that the above is at all better than what you
wrote, since leaving the extra files around is the footgun we're trying
to avoid in the first place.

Thanks,
Taylor
Jeff King Oct. 6, 2021, 7:24 p.m. UTC | #2
On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote:

> > GIT_PERF_REPEAT_COUNT=3 \
> > test_perf "status" "
> > 	git status
> > "
> >
> > GIT_PERF_REPEAT_COUNT=1 \
> > test_perf "checkout other" "
> > 	git checkout other
> > "
> [...]
> 
> Well explained, and makes sense to me. I didn't know we set
> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but
> grepping shows that we do it in the fsmonitor tests.

Neither did I. IMHO that is a hack that we would do better to avoid, as
the point of it is to let the user drive the decision of time versus
quality of results. So the first example above is spending extra time
that the user may have asked us not to, and the second is getting less
significant results by not repeating the trial.

Presumably the issue in the second one is that the test modifies state.
The "right" solution there is to give test_perf() a way to set up the
state between trials (you can do it in the test_perf block, but you'd
want to avoid letting the setup step affect the timing).

I'd also note that

  GIT_PERF_REPEAT_COUNT=1 \
  test_perf ...

in the commit message is a bad pattern. On some shells, the one-shot
variable before a function will persist after the function returns (so
it would accidentally tweak the count for later tests, too).

All that said, I do think cleaning up the test_time files after each
test_perf is a good precuation, even if I don't think it's a good idea
in general to flip the REPEAT_COUNT variable in the middle of a test.

-Peff
Taylor Blau Oct. 6, 2021, 7:26 p.m. UTC | #3
On Wed, Oct 06, 2021 at 03:24:07PM -0400, Jeff King wrote:
> All that said, I do think cleaning up the test_time files after each
> test_perf is a good precuation, even if I don't think it's a good idea
> in general to flip the REPEAT_COUNT variable in the middle of a test.

Thanks for putting it so concisely, I couldn't agree more.

Thanks,
Taylor
Jeff Hostetler Oct. 7, 2021, 5:49 p.m. UTC | #4
On 10/6/21 3:24 PM, Jeff King wrote:
> On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote:
> 
>>> GIT_PERF_REPEAT_COUNT=3 \
>>> test_perf "status" "
>>> 	git status
>>> "
>>>
>>> GIT_PERF_REPEAT_COUNT=1 \
>>> test_perf "checkout other" "
>>> 	git checkout other
>>> "
>> [...]
>>
>> Well explained, and makes sense to me. I didn't know we set
>> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but
>> grepping shows that we do it in the fsmonitor tests.
> 
> Neither did I. IMHO that is a hack that we would do better to avoid, as
> the point of it is to let the user drive the decision of time versus
> quality of results. So the first example above is spending extra time
> that the user may have asked us not to, and the second is getting less
> significant results by not repeating the trial.
> 
> Presumably the issue in the second one is that the test modifies state.
> The "right" solution there is to give test_perf() a way to set up the
> state between trials (you can do it in the test_perf block, but you'd
> want to avoid letting the setup step affect the timing).
> 
> I'd also note that
> 
>    GIT_PERF_REPEAT_COUNT=1 \
>    test_perf ...
> 
> in the commit message is a bad pattern. On some shells, the one-shot
> variable before a function will persist after the function returns (so
> it would accidentally tweak the count for later tests, too).
> 
> All that said, I do think cleaning up the test_time files after each
> test_perf is a good precuation, even if I don't think it's a good idea
> in general to flip the REPEAT_COUNT variable in the middle of a test.
> 
> -Peff
> 

Yeah, I don't think I want to keep switching the value of _REPEAT_COUNT
in the body of the test.  (It did feel a little "against the spirit" of
the framework.)  I'm in the process of redoing the test to not need
that.



There's a problem with the perf test assumptions here and I'm curious
if there's a better way to use the perf-lib that I'm not thinking of.

When working with big repos (in this case 100K files), the actual
checkout takes 33 seconds, but the repetitions are fast -- since they
just print a warning and stop.  In the 1M file case that number is ~7
minutes for the first instance.)  With the code in min_time.perl
silently taking the min() of the runs, it looks like the checkout was
really fast when it wasn't.  That fact gets hidden in the summary report
printed at the end.

$ time ~/work/core/git checkout p0006-ballast
Updating files: 100% (100000/100000), done.
Switched to branch 'p0006-ballast'

real	0m33.510s
user	0m2.757s
sys	0m15.565s

$ time ~/work/core/git checkout p0006-ballast
Already on 'p0006-ballast'

real	0m0.745s
user	0m0.214s
sys	0m4.705s

$ time ~/work/core/git checkout p0006-ballast
Already on 'p0006-ballast'

real	0m0.738s
user	0m0.134s
sys	0m6.850s


I could use test_expect_success() for anything that does want
to change state, and then save test_perf() for status calls
and other read-only tests, but I think we lose some opportunities
here.

I'm open to suggestions here.

Thanks,
Jeff
Jeff King Oct. 8, 2021, 2:55 a.m. UTC | #5
On Thu, Oct 07, 2021 at 01:49:15PM -0400, Jeff Hostetler wrote:

> Yeah, I don't think I want to keep switching the value of _REPEAT_COUNT
> in the body of the test.  (It did feel a little "against the spirit" of
> the framework.)  I'm in the process of redoing the test to not need
> that.

Sounds good to me. :)

> There's a problem with the perf test assumptions here and I'm curious
> if there's a better way to use the perf-lib that I'm not thinking of.
> 
> When working with big repos (in this case 100K files), the actual
> checkout takes 33 seconds, but the repetitions are fast -- since they
> just print a warning and stop.  In the 1M file case that number is ~7
> minutes for the first instance.)  With the code in min_time.perl
> silently taking the min() of the runs, it looks like the checkout was
> really fast when it wasn't.  That fact gets hidden in the summary report
> printed at the end.

Right. So your option now is basically something like:

  test_perf 'checkout' '
	git reset --hard the_original_state &&
	git checkout
  '

I.e., reset the state _and_ perform the operation you want to time,
within a single test_perf(). (Here it's a literal "git reset", but
obviously it could be anything to adjust the state back to the
baseline). But that reset operation will get lumped into your timing.
That might not matter if it's comparatively cheap, but it might throw
off all your results.

What I'd propose instead is that we ought to have:

  test_perf 'checkout'
            --prepare '
	        git reset --hard the_original_state
	    ' '
	        git checkout
	    '

Having two multi-line snippets is a bit ugly (check out that awful
middle line), but I think this could be added without breaking existing
tests (they just wouldn't have a --prepare option).

If that syntax is too horrendous, we could have:

  # this saves the snippet in a variable internally, and runs
  # it before each trial of the next test_perf(), after which
  # it is discarded
  test_perf_prepare '
          git reset --hard the_original_state
  '

  test_perf 'checkout' '
          git checkout
  '

I think that would be pretty easy to implement, and would solve the most
common form of this problem. And there's plenty of prior art; just about
every decent benchmarking system has a "do this before each trial"
mechanism. Our t/perf suite (as you probably noticed) is rather more
ad-hoc and less mature.

There are cases it doesn't help, though. For instance, in one of the
scripts we measure the time to run "git repack -adb" to generate
bitmaps. But the first run has to do more work, because we can reuse
results for subsequent ones! It would help to "rm -f
objects/pack/*.bitmap", but even that's not entirely fair, as it will be
repacking from a single pack, versus whatever state we started with.

And there I think the whole "take the best run" strategy is hampering
us. These inaccuracies in our timings go unseen, because we don't do any
statistical analysis of the results. Whereas a tool like hyperfine (for
example) will run trials until the mean stabilizes, and then let you
know when there were trials outside of a standard deviation.

I know we're hesitant to introduce dependencies, but I do wonder if we
could have much higher quality perf results if we accepted a dependency
on a tool like that. I'd never want that for the regular test suite, but
I'd my feelings for the perf suite are much looser. I suspect not many
people run it at all, and its main utility is showing off improvements
and looking for broad regressions. It's possible somebody would want to
track down a performance change on a specific obscure platform, but in
general I'd suspect they'd be much better off timing things manually in
such a case.

So there. That was probably more than you wanted to hear, and further
than you want to go right now. In the near-term for the tests you're
interested in, something like the "prepare" feature I outlined above
would probably not be too hard to add, and would address your immediate
problem.

-Peff
Ævar Arnfjörð Bjarmason Oct. 8, 2021, 7:47 a.m. UTC | #6
On Thu, Oct 07 2021, Jeff King wrote:

> And there I think the whole "take the best run" strategy is hampering
> us. These inaccuracies in our timings go unseen, because we don't do any
> statistical analysis of the results. Whereas a tool like hyperfine (for
> example) will run trials until the mean stabilizes, and then let you
> know when there were trials outside of a standard deviation.
>
> I know we're hesitant to introduce dependencies, but I do wonder if we
> could have much higher quality perf results if we accepted a dependency
> on a tool like that. I'd never want that for the regular test suite, but
> I'd my feelings for the perf suite are much looser. I suspect not many
> people run it at all, and its main utility is showing off improvements
> and looking for broad regressions. It's possible somebody would want to
> track down a performance change on a specific obscure platform, but in
> general I'd suspect they'd be much better off timing things manually in
> such a case.
>
> So there. That was probably more than you wanted to hear, and further
> than you want to go right now. In the near-term for the tests you're
> interested in, something like the "prepare" feature I outlined above
> would probably not be too hard to add, and would address your immediate
> problem.

I'd really like that, as you point out the statistics in t/perf now are
quite bad.

A tool like hyperfine is ultimately generalized (for the purposes of the
test suite) as something that can run templated code with labels. If
anyone cared I don't see why we couldn't ship a hyperfine-fallback.pl or
whatever that accepted the same parameters, and ran our current (and
worse) end-to-end statistics.

If that is something you're encouraged to work on and are taking
requests :) : It would be really nice if t/perf could say emit a one-off
Makefile and run the tests via that, rather than the one-off nproc=1
./run script we've got now.

With the same sort of templating a "hyperfine" invocation would need
(and some prep/teardown phases) it would make it easy to run perf tests
in parallel with ncores, or even across N number of machines.
Junio C Hamano Oct. 8, 2021, 5:30 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

> What I'd propose instead is that we ought to have:
>
>   test_perf 'checkout'
>             --prepare '
> 	        git reset --hard the_original_state
> 	    ' '
> 	        git checkout
> 	    '
>
> Having two multi-line snippets is a bit ugly (check out that awful
> middle line), but I think this could be added without breaking existing
> tests (they just wouldn't have a --prepare option).
>
> If that syntax is too horrendous, we could have:
>
>   # this saves the snippet in a variable internally, and runs
>   # it before each trial of the next test_perf(), after which
>   # it is discarded
>   test_perf_prepare '
>           git reset --hard the_original_state
>   '
>
>   test_perf 'checkout' '
>           git checkout
>   '
>
> I think that would be pretty easy to implement, and would solve the most
> common form of this problem. And there's plenty of prior art; just about
> every decent benchmarking system has a "do this before each trial"
> mechanism. Our t/perf suite (as you probably noticed) is rather more
> ad-hoc and less mature.

Nice.

> There are cases it doesn't help, though. For instance, in one of the
> scripts we measure the time to run "git repack -adb" to generate
> bitmaps. But the first run has to do more work, because we can reuse
> results for subsequent ones! It would help to "rm -f
> objects/pack/*.bitmap", but even that's not entirely fair, as it will be
> repacking from a single pack, versus whatever state we started with.

You need a "do this too for each iteration but do not time it", i.e.

    test_perf 'repack performance' --prepare '
	make a messy original repository
    ' --per-iteration-prepare '
	prepare a test repository from the messy original
    ' --time-this-part-only '
        git repack -adb
    '

Syntactically, eh, Yuck.
Jeff King Oct. 8, 2021, 7:57 p.m. UTC | #8
On Fri, Oct 08, 2021 at 10:30:09AM -0700, Junio C Hamano wrote:

> > There are cases it doesn't help, though. For instance, in one of the
> > scripts we measure the time to run "git repack -adb" to generate
> > bitmaps. But the first run has to do more work, because we can reuse
> > results for subsequent ones! It would help to "rm -f
> > objects/pack/*.bitmap", but even that's not entirely fair, as it will be
> > repacking from a single pack, versus whatever state we started with.
> 
> You need a "do this too for each iteration but do not time it", i.e.
> 
>     test_perf 'repack performance' --prepare '
> 	make a messy original repository
>     ' --per-iteration-prepare '
> 	prepare a test repository from the messy original
>     ' --time-this-part-only '
>         git repack -adb
>     '
> 
> Syntactically, eh, Yuck.

If any step doesn't need to be per-iteration, you can do it in a
separate test_expect_success block. So --prepare would always be
per-iteration, I think.

The tricky thing I meant to highlight in that example is that the
preparation step is non-obvious, since the timed command actually throws
away the original state. But we can probably stash away what we need.
With the test_perf_prepare I showed earlier, maybe:

  test_expect_success 'stash original state' '
	cp -al objects objects.orig
  '

  test_perf_prepare 'set up original state' '
	rm -rf objects &&
	cp -al objects.orig objects
  '

  test_perf 'repack' '
	git repack -adb
  '

which is not too bad. Of course it is made a lot easier by my use of the
unportable "cp -l", but you get the general idea. ;) This is all a rough
sketch anyway, and not something I plan on working on anytime soon.

For Jeff's case, the preparation would hopefully just be some sequence
of reset/read-tree/etc to manipulate the index and working tree to the
original state.

-Peff
SZEDER Gábor Oct. 10, 2021, 9:26 p.m. UTC | #9
On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote:
> On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote:
> > From: Jeff Hostetler <jeffhost@microsoft.com>
> >
> > Teach test_perf_() to remove the temporary test_times.* files
> 
> Small nit: s/test_times/test_time here and throughout.
> 
> > at the end of each test.
> >
> > test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates
> > ./test_times.[123...].  It then uses a perl script to find the minimum
> > over "./test_times.*" (note the wildcard) and writes that time to
> > "test-results/<testname>.<testnumber>.result".
> >
> > If the repeat count is changed during the pXXXX test script, stale
> > test_times.* files (from previous steps) may be included in the min()
> > computation.  For example:
> >
> > ...
> > GIT_PERF_REPEAT_COUNT=3 \
> > test_perf "status" "
> > 	git status
> > "
> >
> > GIT_PERF_REPEAT_COUNT=1 \
> > test_perf "checkout other" "
> > 	git checkout other
> > "
> > ...
> >
> > The time reported in the summary for "XXXX.2 checkout other" would
> > be "min( checkout[1], status[2], status[3] )".
> >
> > We prevent that error by removing the test_times.* files at the end of
> > each test.
> 
> Well explained, and makes sense to me. I didn't know we set
> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but
> grepping shows that we do it in the fsmonitor tests.
> 
> Dropping any test_times files makes sense as the right thing to do. I
> have no opinion on whether it should happen before running a perf test,
> or after generating the results. So what you did here looks good to me.

I think it's better to remove those files before running the perf
test, and leave them behind after the test finished.  This would give
developers an opportunity to use the timing results for whatever other
statistics they might be interested in.

And yes, I think it would be better if 'make test' left behind
't/test-results' with all the test trace output for later analysis as
well.  E.g. grepping through the test logs can uncover bugs like this:

  https://public-inbox.org/git/20211010172809.1472914-1-szeder.dev@gmail.com/

and I've fixed several similar test bugs that I've found when looking
through 'test-results/*.out'.  Alas, it's always a bit of a hassle to
comment out 'make clean-except-prove-cache' in 't/Makefile'.
Jeff Hostetler Oct. 13, 2021, 9:09 p.m. UTC | #10
On 10/10/21 5:26 PM, SZEDER Gábor wrote:
> On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote:
>> On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote:
>>> From: Jeff Hostetler <jeffhost@microsoft.com>
>>>
>>> Teach test_perf_() to remove the temporary test_times.* files
>>
>> Small nit: s/test_times/test_time here and throughout.
>>
>>> at the end of each test.
>>>
>>> test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates
>>> ./test_times.[123...].  It then uses a perl script to find the minimum
>>> over "./test_times.*" (note the wildcard) and writes that time to
>>> "test-results/<testname>.<testnumber>.result".
>>>
>>> If the repeat count is changed during the pXXXX test script, stale
>>> test_times.* files (from previous steps) may be included in the min()
>>> computation.  For example:
>>>
>>> ...
>>> GIT_PERF_REPEAT_COUNT=3 \
>>> test_perf "status" "
>>> 	git status
>>> "
>>>
>>> GIT_PERF_REPEAT_COUNT=1 \
>>> test_perf "checkout other" "
>>> 	git checkout other
>>> "
>>> ...
>>>
>>> The time reported in the summary for "XXXX.2 checkout other" would
>>> be "min( checkout[1], status[2], status[3] )".
>>>
>>> We prevent that error by removing the test_times.* files at the end of
>>> each test.
>>
>> Well explained, and makes sense to me. I didn't know we set
>> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but
>> grepping shows that we do it in the fsmonitor tests.
>>
>> Dropping any test_times files makes sense as the right thing to do. I
>> have no opinion on whether it should happen before running a perf test,
>> or after generating the results. So what you did here looks good to me.
> 
> I think it's better to remove those files before running the perf
> test, and leave them behind after the test finished.  This would give
> developers an opportunity to use the timing results for whatever other
> statistics they might be interested in.

I could see doing it before.  I'd like to leave it as is for now.
Let's fix the correctness now and we can fine tune it later (with
your suggestion below).

That makes me wonder if it would it be better to have the script
keep all of the test time values?  That is, create something like
test_time.$test_count.$test_seq.  Then you could look at all of
the timings over the whole test script, rather just of those of the
one where you stopped it.


> 
> And yes, I think it would be better if 'make test' left behind
> 't/test-results' with all the test trace output for later analysis as
> well.  E.g. grepping through the test logs can uncover bugs like this:
> 
>    https://public-inbox.org/git/20211010172809.1472914-1-szeder.dev@gmail.com/
> 
> and I've fixed several similar test bugs that I've found when looking
> through 'test-results/*.out'.  Alas, it's always a bit of a hassle to
> comment out 'make clean-except-prove-cache' in 't/Makefile'.
> 

Jeff
diff mbox series

Patch

diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh
index f5ed092ee59..a1b5d2804dc 100644
--- a/t/perf/perf-lib.sh
+++ b/t/perf/perf-lib.sh
@@ -230,6 +230,7 @@  test_perf_ () {
 		test_ok_ "$1"
 	fi
 	"$TEST_DIRECTORY"/perf/min_time.perl test_time.* >"$base".result
+	rm test_time.*
 }
 
 test_perf () {