Message ID | pull.1051.git.1633386543759.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 473a26166c0a4d527b0568384bbacdb9e29b30d2 |
Headers | show |
Series | t/perf/perf-lib.sh: remove test_times.* at the end test_perf_() | expand |
On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote: > From: Jeff Hostetler <jeffhost@microsoft.com> > > Teach test_perf_() to remove the temporary test_times.* files Small nit: s/test_times/test_time here and throughout. > at the end of each test. > > test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates > ./test_times.[123...]. It then uses a perl script to find the minimum > over "./test_times.*" (note the wildcard) and writes that time to > "test-results/<testname>.<testnumber>.result". > > If the repeat count is changed during the pXXXX test script, stale > test_times.* files (from previous steps) may be included in the min() > computation. For example: > > ... > GIT_PERF_REPEAT_COUNT=3 \ > test_perf "status" " > git status > " > > GIT_PERF_REPEAT_COUNT=1 \ > test_perf "checkout other" " > git checkout other > " > ... > > The time reported in the summary for "XXXX.2 checkout other" would > be "min( checkout[1], status[2], status[3] )". > > We prevent that error by removing the test_times.* files at the end of > each test. Well explained, and makes sense to me. I didn't know we set GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but grepping shows that we do it in the fsmonitor tests. Dropping any test_times files makes sense as the right thing to do. I have no opinion on whether it should happen before running a perf test, or after generating the results. So what you did here looks good to me. An alternative approach might be to only read the test_time.n files we know should exist based on GIT_PERF_REPEAT_COUNT, perhaps like: test_seq "$GIT_PERF_REPEAT_COUNT" | perl -lne 'print "test_time.$_"' | xargs "$TEST_DIRECTORY/perf/min_time.perl" >"$base".result but I'm not convinced that the above is at all better than what you wrote, since leaving the extra files around is the footgun we're trying to avoid in the first place. Thanks, Taylor
On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote: > > GIT_PERF_REPEAT_COUNT=3 \ > > test_perf "status" " > > git status > > " > > > > GIT_PERF_REPEAT_COUNT=1 \ > > test_perf "checkout other" " > > git checkout other > > " > [...] > > Well explained, and makes sense to me. I didn't know we set > GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but > grepping shows that we do it in the fsmonitor tests. Neither did I. IMHO that is a hack that we would do better to avoid, as the point of it is to let the user drive the decision of time versus quality of results. So the first example above is spending extra time that the user may have asked us not to, and the second is getting less significant results by not repeating the trial. Presumably the issue in the second one is that the test modifies state. The "right" solution there is to give test_perf() a way to set up the state between trials (you can do it in the test_perf block, but you'd want to avoid letting the setup step affect the timing). I'd also note that GIT_PERF_REPEAT_COUNT=1 \ test_perf ... in the commit message is a bad pattern. On some shells, the one-shot variable before a function will persist after the function returns (so it would accidentally tweak the count for later tests, too). All that said, I do think cleaning up the test_time files after each test_perf is a good precuation, even if I don't think it's a good idea in general to flip the REPEAT_COUNT variable in the middle of a test. -Peff
On Wed, Oct 06, 2021 at 03:24:07PM -0400, Jeff King wrote: > All that said, I do think cleaning up the test_time files after each > test_perf is a good precuation, even if I don't think it's a good idea > in general to flip the REPEAT_COUNT variable in the middle of a test. Thanks for putting it so concisely, I couldn't agree more. Thanks, Taylor
On 10/6/21 3:24 PM, Jeff King wrote: > On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote: > >>> GIT_PERF_REPEAT_COUNT=3 \ >>> test_perf "status" " >>> git status >>> " >>> >>> GIT_PERF_REPEAT_COUNT=1 \ >>> test_perf "checkout other" " >>> git checkout other >>> " >> [...] >> >> Well explained, and makes sense to me. I didn't know we set >> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but >> grepping shows that we do it in the fsmonitor tests. > > Neither did I. IMHO that is a hack that we would do better to avoid, as > the point of it is to let the user drive the decision of time versus > quality of results. So the first example above is spending extra time > that the user may have asked us not to, and the second is getting less > significant results by not repeating the trial. > > Presumably the issue in the second one is that the test modifies state. > The "right" solution there is to give test_perf() a way to set up the > state between trials (you can do it in the test_perf block, but you'd > want to avoid letting the setup step affect the timing). > > I'd also note that > > GIT_PERF_REPEAT_COUNT=1 \ > test_perf ... > > in the commit message is a bad pattern. On some shells, the one-shot > variable before a function will persist after the function returns (so > it would accidentally tweak the count for later tests, too). > > All that said, I do think cleaning up the test_time files after each > test_perf is a good precuation, even if I don't think it's a good idea > in general to flip the REPEAT_COUNT variable in the middle of a test. > > -Peff > Yeah, I don't think I want to keep switching the value of _REPEAT_COUNT in the body of the test. (It did feel a little "against the spirit" of the framework.) I'm in the process of redoing the test to not need that. There's a problem with the perf test assumptions here and I'm curious if there's a better way to use the perf-lib that I'm not thinking of. When working with big repos (in this case 100K files), the actual checkout takes 33 seconds, but the repetitions are fast -- since they just print a warning and stop. In the 1M file case that number is ~7 minutes for the first instance.) With the code in min_time.perl silently taking the min() of the runs, it looks like the checkout was really fast when it wasn't. That fact gets hidden in the summary report printed at the end. $ time ~/work/core/git checkout p0006-ballast Updating files: 100% (100000/100000), done. Switched to branch 'p0006-ballast' real 0m33.510s user 0m2.757s sys 0m15.565s $ time ~/work/core/git checkout p0006-ballast Already on 'p0006-ballast' real 0m0.745s user 0m0.214s sys 0m4.705s $ time ~/work/core/git checkout p0006-ballast Already on 'p0006-ballast' real 0m0.738s user 0m0.134s sys 0m6.850s I could use test_expect_success() for anything that does want to change state, and then save test_perf() for status calls and other read-only tests, but I think we lose some opportunities here. I'm open to suggestions here. Thanks, Jeff
On Thu, Oct 07, 2021 at 01:49:15PM -0400, Jeff Hostetler wrote: > Yeah, I don't think I want to keep switching the value of _REPEAT_COUNT > in the body of the test. (It did feel a little "against the spirit" of > the framework.) I'm in the process of redoing the test to not need > that. Sounds good to me. :) > There's a problem with the perf test assumptions here and I'm curious > if there's a better way to use the perf-lib that I'm not thinking of. > > When working with big repos (in this case 100K files), the actual > checkout takes 33 seconds, but the repetitions are fast -- since they > just print a warning and stop. In the 1M file case that number is ~7 > minutes for the first instance.) With the code in min_time.perl > silently taking the min() of the runs, it looks like the checkout was > really fast when it wasn't. That fact gets hidden in the summary report > printed at the end. Right. So your option now is basically something like: test_perf 'checkout' ' git reset --hard the_original_state && git checkout ' I.e., reset the state _and_ perform the operation you want to time, within a single test_perf(). (Here it's a literal "git reset", but obviously it could be anything to adjust the state back to the baseline). But that reset operation will get lumped into your timing. That might not matter if it's comparatively cheap, but it might throw off all your results. What I'd propose instead is that we ought to have: test_perf 'checkout' --prepare ' git reset --hard the_original_state ' ' git checkout ' Having two multi-line snippets is a bit ugly (check out that awful middle line), but I think this could be added without breaking existing tests (they just wouldn't have a --prepare option). If that syntax is too horrendous, we could have: # this saves the snippet in a variable internally, and runs # it before each trial of the next test_perf(), after which # it is discarded test_perf_prepare ' git reset --hard the_original_state ' test_perf 'checkout' ' git checkout ' I think that would be pretty easy to implement, and would solve the most common form of this problem. And there's plenty of prior art; just about every decent benchmarking system has a "do this before each trial" mechanism. Our t/perf suite (as you probably noticed) is rather more ad-hoc and less mature. There are cases it doesn't help, though. For instance, in one of the scripts we measure the time to run "git repack -adb" to generate bitmaps. But the first run has to do more work, because we can reuse results for subsequent ones! It would help to "rm -f objects/pack/*.bitmap", but even that's not entirely fair, as it will be repacking from a single pack, versus whatever state we started with. And there I think the whole "take the best run" strategy is hampering us. These inaccuracies in our timings go unseen, because we don't do any statistical analysis of the results. Whereas a tool like hyperfine (for example) will run trials until the mean stabilizes, and then let you know when there were trials outside of a standard deviation. I know we're hesitant to introduce dependencies, but I do wonder if we could have much higher quality perf results if we accepted a dependency on a tool like that. I'd never want that for the regular test suite, but I'd my feelings for the perf suite are much looser. I suspect not many people run it at all, and its main utility is showing off improvements and looking for broad regressions. It's possible somebody would want to track down a performance change on a specific obscure platform, but in general I'd suspect they'd be much better off timing things manually in such a case. So there. That was probably more than you wanted to hear, and further than you want to go right now. In the near-term for the tests you're interested in, something like the "prepare" feature I outlined above would probably not be too hard to add, and would address your immediate problem. -Peff
On Thu, Oct 07 2021, Jeff King wrote: > And there I think the whole "take the best run" strategy is hampering > us. These inaccuracies in our timings go unseen, because we don't do any > statistical analysis of the results. Whereas a tool like hyperfine (for > example) will run trials until the mean stabilizes, and then let you > know when there were trials outside of a standard deviation. > > I know we're hesitant to introduce dependencies, but I do wonder if we > could have much higher quality perf results if we accepted a dependency > on a tool like that. I'd never want that for the regular test suite, but > I'd my feelings for the perf suite are much looser. I suspect not many > people run it at all, and its main utility is showing off improvements > and looking for broad regressions. It's possible somebody would want to > track down a performance change on a specific obscure platform, but in > general I'd suspect they'd be much better off timing things manually in > such a case. > > So there. That was probably more than you wanted to hear, and further > than you want to go right now. In the near-term for the tests you're > interested in, something like the "prepare" feature I outlined above > would probably not be too hard to add, and would address your immediate > problem. I'd really like that, as you point out the statistics in t/perf now are quite bad. A tool like hyperfine is ultimately generalized (for the purposes of the test suite) as something that can run templated code with labels. If anyone cared I don't see why we couldn't ship a hyperfine-fallback.pl or whatever that accepted the same parameters, and ran our current (and worse) end-to-end statistics. If that is something you're encouraged to work on and are taking requests :) : It would be really nice if t/perf could say emit a one-off Makefile and run the tests via that, rather than the one-off nproc=1 ./run script we've got now. With the same sort of templating a "hyperfine" invocation would need (and some prep/teardown phases) it would make it easy to run perf tests in parallel with ncores, or even across N number of machines.
Jeff King <peff@peff.net> writes: > What I'd propose instead is that we ought to have: > > test_perf 'checkout' > --prepare ' > git reset --hard the_original_state > ' ' > git checkout > ' > > Having two multi-line snippets is a bit ugly (check out that awful > middle line), but I think this could be added without breaking existing > tests (they just wouldn't have a --prepare option). > > If that syntax is too horrendous, we could have: > > # this saves the snippet in a variable internally, and runs > # it before each trial of the next test_perf(), after which > # it is discarded > test_perf_prepare ' > git reset --hard the_original_state > ' > > test_perf 'checkout' ' > git checkout > ' > > I think that would be pretty easy to implement, and would solve the most > common form of this problem. And there's plenty of prior art; just about > every decent benchmarking system has a "do this before each trial" > mechanism. Our t/perf suite (as you probably noticed) is rather more > ad-hoc and less mature. Nice. > There are cases it doesn't help, though. For instance, in one of the > scripts we measure the time to run "git repack -adb" to generate > bitmaps. But the first run has to do more work, because we can reuse > results for subsequent ones! It would help to "rm -f > objects/pack/*.bitmap", but even that's not entirely fair, as it will be > repacking from a single pack, versus whatever state we started with. You need a "do this too for each iteration but do not time it", i.e. test_perf 'repack performance' --prepare ' make a messy original repository ' --per-iteration-prepare ' prepare a test repository from the messy original ' --time-this-part-only ' git repack -adb ' Syntactically, eh, Yuck.
On Fri, Oct 08, 2021 at 10:30:09AM -0700, Junio C Hamano wrote: > > There are cases it doesn't help, though. For instance, in one of the > > scripts we measure the time to run "git repack -adb" to generate > > bitmaps. But the first run has to do more work, because we can reuse > > results for subsequent ones! It would help to "rm -f > > objects/pack/*.bitmap", but even that's not entirely fair, as it will be > > repacking from a single pack, versus whatever state we started with. > > You need a "do this too for each iteration but do not time it", i.e. > > test_perf 'repack performance' --prepare ' > make a messy original repository > ' --per-iteration-prepare ' > prepare a test repository from the messy original > ' --time-this-part-only ' > git repack -adb > ' > > Syntactically, eh, Yuck. If any step doesn't need to be per-iteration, you can do it in a separate test_expect_success block. So --prepare would always be per-iteration, I think. The tricky thing I meant to highlight in that example is that the preparation step is non-obvious, since the timed command actually throws away the original state. But we can probably stash away what we need. With the test_perf_prepare I showed earlier, maybe: test_expect_success 'stash original state' ' cp -al objects objects.orig ' test_perf_prepare 'set up original state' ' rm -rf objects && cp -al objects.orig objects ' test_perf 'repack' ' git repack -adb ' which is not too bad. Of course it is made a lot easier by my use of the unportable "cp -l", but you get the general idea. ;) This is all a rough sketch anyway, and not something I plan on working on anytime soon. For Jeff's case, the preparation would hopefully just be some sequence of reset/read-tree/etc to manipulate the index and working tree to the original state. -Peff
On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote: > On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote: > > From: Jeff Hostetler <jeffhost@microsoft.com> > > > > Teach test_perf_() to remove the temporary test_times.* files > > Small nit: s/test_times/test_time here and throughout. > > > at the end of each test. > > > > test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates > > ./test_times.[123...]. It then uses a perl script to find the minimum > > over "./test_times.*" (note the wildcard) and writes that time to > > "test-results/<testname>.<testnumber>.result". > > > > If the repeat count is changed during the pXXXX test script, stale > > test_times.* files (from previous steps) may be included in the min() > > computation. For example: > > > > ... > > GIT_PERF_REPEAT_COUNT=3 \ > > test_perf "status" " > > git status > > " > > > > GIT_PERF_REPEAT_COUNT=1 \ > > test_perf "checkout other" " > > git checkout other > > " > > ... > > > > The time reported in the summary for "XXXX.2 checkout other" would > > be "min( checkout[1], status[2], status[3] )". > > > > We prevent that error by removing the test_times.* files at the end of > > each test. > > Well explained, and makes sense to me. I didn't know we set > GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but > grepping shows that we do it in the fsmonitor tests. > > Dropping any test_times files makes sense as the right thing to do. I > have no opinion on whether it should happen before running a perf test, > or after generating the results. So what you did here looks good to me. I think it's better to remove those files before running the perf test, and leave them behind after the test finished. This would give developers an opportunity to use the timing results for whatever other statistics they might be interested in. And yes, I think it would be better if 'make test' left behind 't/test-results' with all the test trace output for later analysis as well. E.g. grepping through the test logs can uncover bugs like this: https://public-inbox.org/git/20211010172809.1472914-1-szeder.dev@gmail.com/ and I've fixed several similar test bugs that I've found when looking through 'test-results/*.out'. Alas, it's always a bit of a hassle to comment out 'make clean-except-prove-cache' in 't/Makefile'.
On 10/10/21 5:26 PM, SZEDER Gábor wrote: > On Tue, Oct 05, 2021 at 01:45:03PM -0400, Taylor Blau wrote: >> On Mon, Oct 04, 2021 at 10:29:03PM +0000, Jeff Hostetler via GitGitGadget wrote: >>> From: Jeff Hostetler <jeffhost@microsoft.com> >>> >>> Teach test_perf_() to remove the temporary test_times.* files >> >> Small nit: s/test_times/test_time here and throughout. >> >>> at the end of each test. >>> >>> test_perf_() runs a particular GIT_PERF_REPEAT_COUNT times and creates >>> ./test_times.[123...]. It then uses a perl script to find the minimum >>> over "./test_times.*" (note the wildcard) and writes that time to >>> "test-results/<testname>.<testnumber>.result". >>> >>> If the repeat count is changed during the pXXXX test script, stale >>> test_times.* files (from previous steps) may be included in the min() >>> computation. For example: >>> >>> ... >>> GIT_PERF_REPEAT_COUNT=3 \ >>> test_perf "status" " >>> git status >>> " >>> >>> GIT_PERF_REPEAT_COUNT=1 \ >>> test_perf "checkout other" " >>> git checkout other >>> " >>> ... >>> >>> The time reported in the summary for "XXXX.2 checkout other" would >>> be "min( checkout[1], status[2], status[3] )". >>> >>> We prevent that error by removing the test_times.* files at the end of >>> each test. >> >> Well explained, and makes sense to me. I didn't know we set >> GIT_PERF_REPEAT_COUNT inline with the performance tests themselves, but >> grepping shows that we do it in the fsmonitor tests. >> >> Dropping any test_times files makes sense as the right thing to do. I >> have no opinion on whether it should happen before running a perf test, >> or after generating the results. So what you did here looks good to me. > > I think it's better to remove those files before running the perf > test, and leave them behind after the test finished. This would give > developers an opportunity to use the timing results for whatever other > statistics they might be interested in. I could see doing it before. I'd like to leave it as is for now. Let's fix the correctness now and we can fine tune it later (with your suggestion below). That makes me wonder if it would it be better to have the script keep all of the test time values? That is, create something like test_time.$test_count.$test_seq. Then you could look at all of the timings over the whole test script, rather just of those of the one where you stopped it. > > And yes, I think it would be better if 'make test' left behind > 't/test-results' with all the test trace output for later analysis as > well. E.g. grepping through the test logs can uncover bugs like this: > > https://public-inbox.org/git/20211010172809.1472914-1-szeder.dev@gmail.com/ > > and I've fixed several similar test bugs that I've found when looking > through 'test-results/*.out'. Alas, it's always a bit of a hassle to > comment out 'make clean-except-prove-cache' in 't/Makefile'. > Jeff
diff --git a/t/perf/perf-lib.sh b/t/perf/perf-lib.sh index f5ed092ee59..a1b5d2804dc 100644 --- a/t/perf/perf-lib.sh +++ b/t/perf/perf-lib.sh @@ -230,6 +230,7 @@ test_perf_ () { test_ok_ "$1" fi "$TEST_DIRECTORY"/perf/min_time.perl test_time.* >"$base".result + rm test_time.* } test_perf () {