mbox series

[0/2] Set GIT_TRACE2_EVENT_NESTING in test-lib.sh

Message ID pull.1085.git.1638193666.gitgitgadget@gmail.com (mailing list archive)
Headers show
Series Set GIT_TRACE2_EVENT_NESTING in test-lib.sh | expand

Message

Philippe Blain via GitGitGadget Nov. 29, 2021, 1:47 p.m. UTC
As reported by Ævar [1] and diagnosed by Peff in a reply, the default
GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
events can start failing due to an added trace2 region. This can even be
subtle, including the fact that the progress API adds trace2 regions when in
use (which can depend on the verbose output of a test script).

Thanks, -Stolee

Derrick Stolee (2):
  test-lib.sh: set GIT_TRACE2_EVENT_NESTING
  t/t*: remove custom GIT_TRACE2_EVENT_NESTING

 t/t1092-sparse-checkout-compatibility.sh | 6 +++---
 t/t4216-log-bloom.sh                     | 2 --
 t/t5310-pack-bitmaps.sh                  | 2 +-
 t/t5705-session-id-in-capabilities.sh    | 2 --
 t/t7519-status-fsmonitor.sh              | 2 +-
 t/test-lib.sh                            | 7 +++++++
 6 files changed, 12 insertions(+), 9 deletions(-)


base-commit: 35151cf0720460a897cde9b8039af364743240e7
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1085%2Fderrickstolee%2Ftrace2-depth-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1085/derrickstolee/trace2-depth-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1085

Comments

Jeff King Nov. 29, 2021, 6:40 p.m. UTC | #1
On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote:

> As reported by Ævar [1] and diagnosed by Peff in a reply, the default
> GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
> events can start failing due to an added trace2 region. This can even be
> subtle, including the fact that the progress API adds trace2 regions when in
> use (which can depend on the verbose output of a test script).

I think this is a good change for fixing the immediate problem of the
test suite failing.

But I have to wonder if this is masking a problem that real users will
see. Aren't we silently discarding trace2 output that could be useful to
somebody debugging or trying to get performance metrics?

I.e., should we be bumping the internal nesting max of 2 to something
higher? If that would that cause problems with existing metrics, should
we be designing new metrics to avoid nesting?

-Peff
Jeff King Nov. 29, 2021, 6:44 p.m. UTC | #2
On Mon, Nov 29, 2021 at 01:40:00PM -0500, Jeff King wrote:

> On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
> > As reported by Ævar [1] and diagnosed by Peff in a reply, the default
> > GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
> > events can start failing due to an added trace2 region. This can even be
> > subtle, including the fact that the progress API adds trace2 regions when in
> > use (which can depend on the verbose output of a test script).
> 
> I think this is a good change for fixing the immediate problem of the
> test suite failing.
> 
> But I have to wonder if this is masking a problem that real users will
> see. Aren't we silently discarding trace2 output that could be useful to
> somebody debugging or trying to get performance metrics?
> 
> I.e., should we be bumping the internal nesting max of 2 to something
> higher? If that would that cause problems with existing metrics, should
> we be designing new metrics to avoid nesting?

One thing I should have added to the first paragraph: the patches
themselves look fine and I'm OK with this as a band-aid in the short
term.

-Peff
Taylor Blau Nov. 30, 2021, 12:04 a.m. UTC | #3
On Mon, Nov 29, 2021 at 01:44:59PM -0500, Jeff King wrote:
> On Mon, Nov 29, 2021 at 01:40:00PM -0500, Jeff King wrote:
>
> > On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> >
> > > As reported by Ævar [1] and diagnosed by Peff in a reply, the default
> > > GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
> > > events can start failing due to an added trace2 region. This can even be
> > > subtle, including the fact that the progress API adds trace2 regions when in
> > > use (which can depend on the verbose output of a test script).
> >
> > I think this is a good change for fixing the immediate problem of the
> > test suite failing.
> >
> > But I have to wonder if this is masking a problem that real users will
> > see. Aren't we silently discarding trace2 output that could be useful to
> > somebody debugging or trying to get performance metrics?
> >
> > I.e., should we be bumping the internal nesting max of 2 to something
> > higher? If that would that cause problems with existing metrics, should
> > we be designing new metrics to avoid nesting?
>
> One thing I should have added to the first paragraph: the patches
> themselves look fine and I'm OK with this as a band-aid in the short
> term.

Yeah, I agree. This test in particular must have bad luck, because I
noticed the same failure in GitHub's fork when merging with 2.33 (but
due to some custom trace2 regions causing us to blow past the limit even
without `-v`).

It might be nice to allow setting 'GIT_TRACE2_EVENT_NESTING=false' or
something similar, but let's deal with that elsewhere.

hanks,
Taylor
Derrick Stolee Nov. 30, 2021, 3:34 p.m. UTC | #4
On 11/29/2021 1:40 PM, Jeff King wrote:
> On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> As reported by Ævar [1] and diagnosed by Peff in a reply, the default
>> GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
>> events can start failing due to an added trace2 region. This can even be
>> subtle, including the fact that the progress API adds trace2 regions when in
>> use (which can depend on the verbose output of a test script).
> 
> I think this is a good change for fixing the immediate problem of the
> test suite failing.
> 
> But I have to wonder if this is masking a problem that real users will
> see. Aren't we silently discarding trace2 output that could be useful to
> somebody debugging or trying to get performance metrics?
> 
> I.e., should we be bumping the internal nesting max of 2 to something
> higher? If that would that cause problems with existing metrics, should
> we be designing new metrics to avoid nesting?

One thing this makes me think about is that we might want to
have items such as trace2_data_intmax() and trace2_data_string()
not be subject to the nesting limit. Only have the nesting apply
to the regions (which nest).

CC'ing Jeff Hostetler for his thoughts on that idea. We will have
a discussion offline to see if the progress regions have started
to cause inconsistencies in our existing telemetry stream of
internal users.

Thanks,
-Stolee
Jeff Hostetler Nov. 30, 2021, 10:43 p.m. UTC | #5
On 11/30/21 10:34 AM, Derrick Stolee wrote:
> On 11/29/2021 1:40 PM, Jeff King wrote:
>> On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>>> As reported by Ævar [1] and diagnosed by Peff in a reply, the default
>>> GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2
>>> events can start failing due to an added trace2 region. This can even be
>>> subtle, including the fact that the progress API adds trace2 regions when in
>>> use (which can depend on the verbose output of a test script).
>>
>> I think this is a good change for fixing the immediate problem of the
>> test suite failing.
>>
>> But I have to wonder if this is masking a problem that real users will
>> see. Aren't we silently discarding trace2 output that could be useful to
>> somebody debugging or trying to get performance metrics?
>>
>> I.e., should we be bumping the internal nesting max of 2 to something
>> higher? If that would that cause problems with existing metrics, should
>> we be designing new metrics to avoid nesting?
> 
> One thing this makes me think about is that we might want to
> have items such as trace2_data_intmax() and trace2_data_string()
> not be subject to the nesting limit. Only have the nesting apply
> to the regions (which nest).
> 
> CC'ing Jeff Hostetler for his thoughts on that idea. We will have
> a discussion offline to see if the progress regions have started
> to cause inconsistencies in our existing telemetry stream of
> internal users.
> 
> Thanks,
> -Stolee
> 


yeah, I set a fairly conservative nesting limit.  my fear was that
someone would add a region or data event to something that recurses
deeply -- like treewalking or directory scanning -- which in a monorepo
would be bad since we might get millions of rows of data.

most of the useful timing data comes from the higher level regions,
such as we spent x% scanning tracked files and y% scanning untracked
files.  once we get down inside a particular "concept", the directory-
by-directory numbers aren't usually that useful (and/or the overhead of
logging every directory we visit dominates).

WRT allowing data events to ignore the nesting limit, we have the
same monorepo risk if for example someone logs how many files were
in each directory visited, the name of every untracked file, or every
file that had some property of interest.  That might look fine on a
shallow WT, but might generate an unusable amount of data on a large
repo.  so i'd recommend we proceed carefully here.  (i'm not against
it, i'm just saying we should think about it.)


using trace2 in a test to confirm that a piece of code was visited
or not visited is cool, but something i don't think i anticipated.
so i agree that we should increase the limit in the test suite.


and yes, the addition of trace2 calls inside the progress code
complicates the nesting.  I'm wondering if the progress code should
always emit the trace2 events, regardless of whether the progress msg
itself appears on the user's console.  That way we have consistent
logging -- the whole point of adding the region in the progress
code was to bracket the time spent doing the thing that needed
progress messaging after all.  the fact that you don't have a console
doesn't change the fact that there's an expensive computation there.

(i'd also consider changing the "total_objects" data event to include
include the region title so that they are more easily identified in a
grep, but I digress.)


Thanks,
Jeff
Jeff King Dec. 1, 2021, 7:46 p.m. UTC | #6
On Tue, Nov 30, 2021 at 05:43:28PM -0500, Jeff Hostetler wrote:

> and yes, the addition of trace2 calls inside the progress code
> complicates the nesting.  I'm wondering if the progress code should
> always emit the trace2 events, regardless of whether the progress msg
> itself appears on the user's console.  That way we have consistent
> logging -- the whole point of adding the region in the progress
> code was to bracket the time spent doing the thing that needed
> progress messaging after all.  the fact that you don't have a console
> doesn't change the fact that there's an expensive computation there.

Yeah, that would at least make these things deterministic, and we'd
notice a nesting problem in the tests.

I do think bumping the default nesting level a few notches may be worth
it. Even if we didn't get rid of it entirely (and I am sympathetic to
the notion that recursive operations could produce gigantic logs), that
would give some headroom for reasonable levels of logging.

-Peff