Message ID | pull.1085.git.1638193666.gitgitgadget@gmail.com (mailing list archive) |
---|---|
Headers | show |
Series | Set GIT_TRACE2_EVENT_NESTING in test-lib.sh | expand |
On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote: > As reported by Ævar [1] and diagnosed by Peff in a reply, the default > GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2 > events can start failing due to an added trace2 region. This can even be > subtle, including the fact that the progress API adds trace2 regions when in > use (which can depend on the verbose output of a test script). I think this is a good change for fixing the immediate problem of the test suite failing. But I have to wonder if this is masking a problem that real users will see. Aren't we silently discarding trace2 output that could be useful to somebody debugging or trying to get performance metrics? I.e., should we be bumping the internal nesting max of 2 to something higher? If that would that cause problems with existing metrics, should we be designing new metrics to avoid nesting? -Peff
On Mon, Nov 29, 2021 at 01:40:00PM -0500, Jeff King wrote: > On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote: > > > As reported by Ævar [1] and diagnosed by Peff in a reply, the default > > GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2 > > events can start failing due to an added trace2 region. This can even be > > subtle, including the fact that the progress API adds trace2 regions when in > > use (which can depend on the verbose output of a test script). > > I think this is a good change for fixing the immediate problem of the > test suite failing. > > But I have to wonder if this is masking a problem that real users will > see. Aren't we silently discarding trace2 output that could be useful to > somebody debugging or trying to get performance metrics? > > I.e., should we be bumping the internal nesting max of 2 to something > higher? If that would that cause problems with existing metrics, should > we be designing new metrics to avoid nesting? One thing I should have added to the first paragraph: the patches themselves look fine and I'm OK with this as a band-aid in the short term. -Peff
On Mon, Nov 29, 2021 at 01:44:59PM -0500, Jeff King wrote: > On Mon, Nov 29, 2021 at 01:40:00PM -0500, Jeff King wrote: > > > On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote: > > > > > As reported by Ævar [1] and diagnosed by Peff in a reply, the default > > > GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2 > > > events can start failing due to an added trace2 region. This can even be > > > subtle, including the fact that the progress API adds trace2 regions when in > > > use (which can depend on the verbose output of a test script). > > > > I think this is a good change for fixing the immediate problem of the > > test suite failing. > > > > But I have to wonder if this is masking a problem that real users will > > see. Aren't we silently discarding trace2 output that could be useful to > > somebody debugging or trying to get performance metrics? > > > > I.e., should we be bumping the internal nesting max of 2 to something > > higher? If that would that cause problems with existing metrics, should > > we be designing new metrics to avoid nesting? > > One thing I should have added to the first paragraph: the patches > themselves look fine and I'm OK with this as a band-aid in the short > term. Yeah, I agree. This test in particular must have bad luck, because I noticed the same failure in GitHub's fork when merging with 2.33 (but due to some custom trace2 regions causing us to blow past the limit even without `-v`). It might be nice to allow setting 'GIT_TRACE2_EVENT_NESTING=false' or something similar, but let's deal with that elsewhere. hanks, Taylor
On 11/29/2021 1:40 PM, Jeff King wrote: > On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote: > >> As reported by Ævar [1] and diagnosed by Peff in a reply, the default >> GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2 >> events can start failing due to an added trace2 region. This can even be >> subtle, including the fact that the progress API adds trace2 regions when in >> use (which can depend on the verbose output of a test script). > > I think this is a good change for fixing the immediate problem of the > test suite failing. > > But I have to wonder if this is masking a problem that real users will > see. Aren't we silently discarding trace2 output that could be useful to > somebody debugging or trying to get performance metrics? > > I.e., should we be bumping the internal nesting max of 2 to something > higher? If that would that cause problems with existing metrics, should > we be designing new metrics to avoid nesting? One thing this makes me think about is that we might want to have items such as trace2_data_intmax() and trace2_data_string() not be subject to the nesting limit. Only have the nesting apply to the regions (which nest). CC'ing Jeff Hostetler for his thoughts on that idea. We will have a discussion offline to see if the progress regions have started to cause inconsistencies in our existing telemetry stream of internal users. Thanks, -Stolee
On 11/30/21 10:34 AM, Derrick Stolee wrote: > On 11/29/2021 1:40 PM, Jeff King wrote: >> On Mon, Nov 29, 2021 at 01:47:44PM +0000, Derrick Stolee via GitGitGadget wrote: >> >>> As reported by Ævar [1] and diagnosed by Peff in a reply, the default >>> GIT_TRACE2_EVENT_NESTING is set so low that tests that look for trace2 >>> events can start failing due to an added trace2 region. This can even be >>> subtle, including the fact that the progress API adds trace2 regions when in >>> use (which can depend on the verbose output of a test script). >> >> I think this is a good change for fixing the immediate problem of the >> test suite failing. >> >> But I have to wonder if this is masking a problem that real users will >> see. Aren't we silently discarding trace2 output that could be useful to >> somebody debugging or trying to get performance metrics? >> >> I.e., should we be bumping the internal nesting max of 2 to something >> higher? If that would that cause problems with existing metrics, should >> we be designing new metrics to avoid nesting? > > One thing this makes me think about is that we might want to > have items such as trace2_data_intmax() and trace2_data_string() > not be subject to the nesting limit. Only have the nesting apply > to the regions (which nest). > > CC'ing Jeff Hostetler for his thoughts on that idea. We will have > a discussion offline to see if the progress regions have started > to cause inconsistencies in our existing telemetry stream of > internal users. > > Thanks, > -Stolee > yeah, I set a fairly conservative nesting limit. my fear was that someone would add a region or data event to something that recurses deeply -- like treewalking or directory scanning -- which in a monorepo would be bad since we might get millions of rows of data. most of the useful timing data comes from the higher level regions, such as we spent x% scanning tracked files and y% scanning untracked files. once we get down inside a particular "concept", the directory- by-directory numbers aren't usually that useful (and/or the overhead of logging every directory we visit dominates). WRT allowing data events to ignore the nesting limit, we have the same monorepo risk if for example someone logs how many files were in each directory visited, the name of every untracked file, or every file that had some property of interest. That might look fine on a shallow WT, but might generate an unusable amount of data on a large repo. so i'd recommend we proceed carefully here. (i'm not against it, i'm just saying we should think about it.) using trace2 in a test to confirm that a piece of code was visited or not visited is cool, but something i don't think i anticipated. so i agree that we should increase the limit in the test suite. and yes, the addition of trace2 calls inside the progress code complicates the nesting. I'm wondering if the progress code should always emit the trace2 events, regardless of whether the progress msg itself appears on the user's console. That way we have consistent logging -- the whole point of adding the region in the progress code was to bracket the time spent doing the thing that needed progress messaging after all. the fact that you don't have a console doesn't change the fact that there's an expensive computation there. (i'd also consider changing the "total_objects" data event to include include the region title so that they are more easily identified in a grep, but I digress.) Thanks, Jeff
On Tue, Nov 30, 2021 at 05:43:28PM -0500, Jeff Hostetler wrote: > and yes, the addition of trace2 calls inside the progress code > complicates the nesting. I'm wondering if the progress code should > always emit the trace2 events, regardless of whether the progress msg > itself appears on the user's console. That way we have consistent > logging -- the whole point of adding the region in the progress > code was to bracket the time spent doing the thing that needed > progress messaging after all. the fact that you don't have a console > doesn't change the fact that there's an expensive computation there. Yeah, that would at least make these things deterministic, and we'd notice a nesting problem in the tests. I do think bumping the default nesting level a few notches may be worth it. Even if we didn't get rid of it entirely (and I am sympathetic to the notion that recursive operations could produce gigantic logs), that would give some headroom for reasonable levels of logging. -Peff