Message ID | 20240510172243.3529851-1-emilyshaffer@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | trace2: intercept all common signals | expand |
On Fri, May 10, 2024 at 10:22 AM Emily Shaffer <emilyshaffer@google.com> wrote: > > From: Emily Shaffer <nasamuffin@google.com> > > We already use trace2 to find out about unexpected pipe breakages, which > is nice for detecting bugs or system problems, by adding a handler for > SIGPIPE which simply writes a trace2 line. However, there are a handful > of other common signals which we might want to snoop on: > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > - SIGHUP, when the network closes unexpectedly (indicating there may be > a problem to solve) > > There are lots more signals which we might find useful later, but at > least let's teach trace2 to report these egregious ones. Conveniently, > they're also already covered by the `_common` variants in sigchain.[ch]. > > Sigchain itself is already tested via helper/test-sigchain.c, and trace2 > is tested in a number of places - let's also add tests demonstrating > that sigchain + trace2 works correctly. > > Signed-off-by: Emily Shaffer <nasamuffin@google.com> > --- Missed including the CI results. They're passing[1] with the exception of the osx-gcc run, which seems to also be failing on the latest 'master'[2] and looks to be failing in setup rather than in test run. 1: https://github.com/nasamuffin/git/actions/runs/9035666915 2: https://github.com/nasamuffin/git/actions/runs/9036080205/job/24832209129 > t/helper/test-trace2.c | 17 +++++++++++++++++ > t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++ > trace2.c | 2 +- > 3 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c > index 1adac29a57..8970956ea8 100644 > --- a/t/helper/test-trace2.c > +++ b/t/helper/test-trace2.c > @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED) > BUG("a %s message", "BUG"); > } > > +static int ut_011signal(int argc, const char **argv) > +{ > + const char *usage_error = "expect <bool common>"; > + int common = 0; > + > + if (argc != 1 || get_i(&common, argv[0])) > + die("%s", usage_error); > + > + /* > + * There is no strong reason SIGSEGV is ignored by trace2 - it's just > + * not included by sigchain_push_common(). > + */ > + raise(common ? SIGINT : SIGSEGV); > + return 0; /*unreachable*/ > +} > + > /* > * Single-threaded timer test. Create several intervals using the > * TEST1 timer. The test script can verify that an aggregate Trace2 > @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = { > { ut_008bug, "008bug", "" }, > { ut_009bug_BUG, "009bug_BUG","" }, > { ut_010bug_BUG, "010bug_BUG","" }, > + { ut_011signal, "011signal","" }, > > { ut_100timer, "100timer", "<count> <ms_delay>" }, > { ut_101timer, "101timer", "<count> <ms_delay> <threads>" }, > diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh > index c312657a12..c34ccc518c 100755 > --- a/t/t0210-trace2-normal.sh > +++ b/t/t0210-trace2-normal.sh > @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' ' > test_cmp expect actual > ' > > +test_expect_success 'trace2 reports common signals' ' > + test_when_finished "rm trace.normal actual" && > + > + # signals are fatal, so expect this to fail > + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 && > + > + perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual && > + > + test_grep "signal elapsed:" actual > +' > + > +test_expect_success 'trace2 ignores uncommon signals' ' > + test_when_finished "rm trace.normal actual" && > + > + # signals are fatal, so expect this to fail > + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 0 && > + > + perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual && > + > + ! test_grep "signal elapsed:" actual > +' > + > sane_unset GIT_TRACE2_BRIEF > > # Now test without environment variables and get all Trace2 settings > diff --git a/trace2.c b/trace2.c > index f894532d05..3692010f5d 100644 > --- a/trace2.c > +++ b/trace2.c > @@ -231,7 +231,7 @@ void trace2_initialize_fl(const char *file, int line) > tr2_sid_get(); > > atexit(tr2main_atexit_handler); > - sigchain_push(SIGPIPE, tr2main_signal_handler); > + sigchain_push_common(tr2main_signal_handler); > tr2tls_init(); > > /* > -- > 2.45.0.118.g7fe29c98d7-goog >
Emily Shaffer <emilyshaffer@google.com> writes: > From: Emily Shaffer <nasamuffin@google.com> > > We already use trace2 to find out about unexpected pipe breakages, which > is nice for detecting bugs or system problems, by adding a handler for > SIGPIPE which simply writes a trace2 line. However, there are a handful > of other common signals which we might want to snoop on: > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > - SIGHUP, when the network closes unexpectedly (indicating there may be > a problem to solve) > > There are lots more signals which we might find useful later, but at > least let's teach trace2 to report these egregious ones. Conveniently, > they're also already covered by the `_common` variants in sigchain.[ch]. > > Sigchain itself is already tested via helper/test-sigchain.c, and trace2 > is tested in a number of places - let's also add tests demonstrating > that sigchain + trace2 works correctly. > > Signed-off-by: Emily Shaffer <nasamuffin@google.com> > --- > t/helper/test-trace2.c | 17 +++++++++++++++++ > t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++ > trace2.c | 2 +- > 3 files changed, 40 insertions(+), 1 deletion(-) > > diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c > index 1adac29a57..8970956ea8 100644 > --- a/t/helper/test-trace2.c > +++ b/t/helper/test-trace2.c > @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED) > BUG("a %s message", "BUG"); > } > > +static int ut_011signal(int argc, const char **argv) > +{ > + const char *usage_error = "expect <bool common>"; > + int common = 0; > + > + if (argc != 1 || get_i(&common, argv[0])) > + die("%s", usage_error); > + > + /* > + * There is no strong reason SIGSEGV is ignored by trace2 - it's just > + * not included by sigchain_push_common(). > + */ > + raise(common ? SIGINT : SIGSEGV); > + return 0; /*unreachable*/ > +} > + > /* > * Single-threaded timer test. Create several intervals using the > * TEST1 timer. The test script can verify that an aggregate Trace2 > @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = { > { ut_008bug, "008bug", "" }, > { ut_009bug_BUG, "009bug_BUG","" }, > { ut_010bug_BUG, "010bug_BUG","" }, > + { ut_011signal, "011signal","" }, > > { ut_100timer, "100timer", "<count> <ms_delay>" }, > { ut_101timer, "101timer", "<count> <ms_delay> <threads>" }, > diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh > index c312657a12..c34ccc518c 100755 > --- a/t/t0210-trace2-normal.sh > +++ b/t/t0210-trace2-normal.sh > @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' ' > test_cmp expect actual > ' > > +test_expect_success 'trace2 reports common signals' ' > + test_when_finished "rm trace.normal actual" && > + > + # signals are fatal, so expect this to fail > + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 && Is it deliberate that this does not use test_must_fail or is it an oversight? The same comment applies to all other uses of "! env". We often see the use of "env" in conjunction with a test that is expected to fail because test_must_fail VAR=VAL cmd simply does not work. If you are not using test_expect_fail, then ! VAR=VAL cmd should be sufficient, but it would mean that you will be happy even if the way "cmd" dies is not in a controlled way (e.g. due to receiving a signal). Ah, perhaps that is it? Is "test-tool trace2 011signal 1" raise a signal to kill itself and after showing the event in the trace stream it is expected to die the narual death of receiving the same signal by re-raising it? If that is what is happening here, not using test_must_fail is absolutely the right thing to do, but then I doubt you need "env" there. Also, if we know what signal is raised, then we should also know the exit status from this (i.e. signal number plus 128 or something) that we want to validate? I dunno. > diff --git a/trace2.c b/trace2.c > index f894532d05..3692010f5d 100644 > --- a/trace2.c > +++ b/trace2.c > @@ -231,7 +231,7 @@ void trace2_initialize_fl(const char *file, int line) > tr2_sid_get(); > > atexit(tr2main_atexit_handler); > - sigchain_push(SIGPIPE, tr2main_signal_handler); > + sigchain_push_common(tr2main_signal_handler); > tr2tls_init(); > > /*
On Fri, May 10, 2024 at 11:47 AM Junio C Hamano <gitster@pobox.com> wrote: > > Emily Shaffer <emilyshaffer@google.com> writes: > > > From: Emily Shaffer <nasamuffin@google.com> > > > > We already use trace2 to find out about unexpected pipe breakages, which > > is nice for detecting bugs or system problems, by adding a handler for > > SIGPIPE which simply writes a trace2 line. However, there are a handful > > of other common signals which we might want to snoop on: > > > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > > - SIGHUP, when the network closes unexpectedly (indicating there may be > > a problem to solve) > > > > There are lots more signals which we might find useful later, but at > > least let's teach trace2 to report these egregious ones. Conveniently, > > they're also already covered by the `_common` variants in sigchain.[ch]. > > > > Sigchain itself is already tested via helper/test-sigchain.c, and trace2 > > is tested in a number of places - let's also add tests demonstrating > > that sigchain + trace2 works correctly. > > > > Signed-off-by: Emily Shaffer <nasamuffin@google.com> > > --- > > t/helper/test-trace2.c | 17 +++++++++++++++++ > > t/t0210-trace2-normal.sh | 22 ++++++++++++++++++++++ > > trace2.c | 2 +- > > 3 files changed, 40 insertions(+), 1 deletion(-) > > > > diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c > > index 1adac29a57..8970956ea8 100644 > > --- a/t/helper/test-trace2.c > > +++ b/t/helper/test-trace2.c > > @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED) > > BUG("a %s message", "BUG"); > > } > > > > +static int ut_011signal(int argc, const char **argv) > > +{ > > + const char *usage_error = "expect <bool common>"; > > + int common = 0; > > + > > + if (argc != 1 || get_i(&common, argv[0])) > > + die("%s", usage_error); > > + > > + /* > > + * There is no strong reason SIGSEGV is ignored by trace2 - it's just > > + * not included by sigchain_push_common(). > > + */ > > + raise(common ? SIGINT : SIGSEGV); > > + return 0; /*unreachable*/ > > +} > > + > > /* > > * Single-threaded timer test. Create several intervals using the > > * TEST1 timer. The test script can verify that an aggregate Trace2 > > @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = { > > { ut_008bug, "008bug", "" }, > > { ut_009bug_BUG, "009bug_BUG","" }, > > { ut_010bug_BUG, "010bug_BUG","" }, > > + { ut_011signal, "011signal","" }, > > > > { ut_100timer, "100timer", "<count> <ms_delay>" }, > > { ut_101timer, "101timer", "<count> <ms_delay> <threads>" }, > > diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh > > index c312657a12..c34ccc518c 100755 > > --- a/t/t0210-trace2-normal.sh > > +++ b/t/t0210-trace2-normal.sh > > @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' ' > > test_cmp expect actual > > ' > > > > +test_expect_success 'trace2 reports common signals' ' > > + test_when_finished "rm trace.normal actual" && > > + > > + # signals are fatal, so expect this to fail > > + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 && > > Is it deliberate that this does not use test_must_fail or is it an > oversight? The same comment applies to all other uses of "! env". > > We often see the use of "env" in conjunction with a test that is > expected to fail because > > test_must_fail VAR=VAL cmd > > simply does not work. If you are not using test_expect_fail, then > > ! VAR=VAL cmd > > should be sufficient, but it would mean that you will be happy even > if the way "cmd" dies is not in a controlled way (e.g. due to > receiving a signal). > > Ah, perhaps that is it? Is "test-tool trace2 011signal 1" raise a > signal to kill itself and after showing the event in the trace > stream it is expected to die the narual death of receiving the same > signal by re-raising it? Yes, it is because test_must_fail expects "natural" death. You can tell test_must_fail which signal you'd expect to receive, in theory, but I didn't get it to work (and it will be tricky to provide the correct signal in shell - I had originally hardcoded signal ints in sh, but then moved the signal enum->int resolution into helper/test-trace2.c because the alternative is doing some nasty grepping on other shell utility outputs, since the signal codes aren't platform/arch consistent). Anyway, I will try it without `env`. > > If that is what is happening here, not using test_must_fail is > absolutely the right thing to do, but then I doubt you need "env" > there. Also, if we know what signal is raised, then we should also > know the exit status from this (i.e. signal number plus 128 or > something) that we want to validate? I dunno. We could? But I don't feel strongly about it. If I specify the exit status, the test will be brittle if we change the exit codes later (for example, 128 or -1 for all error exits is kind of an antipattern; it might be nice to ask Git to return meaningful error codes depending on what went wrong, in the future). We are already checking later on during the test_grep that we exited due to a fatal signal. Thanks for the feedback - I'll get going on a v2 and aim to send it later today, since I don't hear you saying that the patch's overall goal is objectionable. While I'm at it, since you pointed out ! instead of test_must_fail, I wondered if I should change "! test_grep" as well - but when I grep t/ it looks like it's not usual to use `test_must_fail test_grep`, but instead to use `test_grep ! <omitted pattern> <file>`. I'll change that too. I also wonder - do we want to capture SIGKILL as well? Some people may have muscle memory for `kill -9` (I do, for better or worse) instead of gentler `kill`. My intent was to notice any indication of user frustration resulting in manual termination, which would include `kill -9` too... - Emily
On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote: > From: Emily Shaffer <nasamuffin@google.com> > > We already use trace2 to find out about unexpected pipe breakages, which > is nice for detecting bugs or system problems, by adding a handler for > SIGPIPE which simply writes a trace2 line. However, there are a handful > of other common signals which we might want to snoop on: > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > - SIGHUP, when the network closes unexpectedly (indicating there may be > a problem to solve) > > There are lots more signals which we might find useful later, but at > least let's teach trace2 to report these egregious ones. Conveniently, > they're also already covered by the `_common` variants in sigchain.[ch]. I think this would be a useful thing to have, but having looked at the trace2 signal code, this is almost certain to cause racy deadlocks. The exact details depend on the specific trace2 target backend, but looking at the various fn_signal() methods, they rely on allocations via strbufs. This is a problem in signal handlers because we can get a signal at any time, including when other code is inside malloc() holding a lock. And then further calls to malloc() will block forever on that lock. We should be able to do a quick experiment. Try this snippet, which repeatedly kills "git log -p" (which is likely to be allocating memory) and waits for it to exit. Eventually each invocation will stall on a deadlock: -- >8 -- doit() { me=$1 i=0 while true; do GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 & sleep 0.1 kill $! wait $! 2>/dev/null i=$((i+1)) echo $me:$i done } for i in $(seq 1 64); do doit $i & done -- >8 -- I didn't have the patience to wait for all of them to stall, but if you let it run for a bit and check "ps", you'll see some git processes which are hanging. Stracing shows them stuck on a lock, like: $ strace -p 1838693 strace: Process 1838693 attached futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached <detached ...> This problem existed before your patch. I imagine it was much less likely (or perhaps even impossible) with SIGPIPE though, because we'd see that signal only when in a write() syscall, which implies we're not in malloc(). Whereas we can get SIGTERM, etc, any time. Obviously the script above is meant to exacerbate the situation, and most runs would be fine. But over the course of normal use across many users and many runs, I think we would see this in practice. I think your test won't because it triggers the signal only from raise(). So I think before doing this, we'd need to clean up the trace2 signal code to avoid any allocations. -Peff
On Fri, May 10, 2024 at 12:34:07PM -0700, Emily Shaffer wrote: > Yes, it is because test_must_fail expects "natural" death. You can > tell test_must_fail which signal you'd expect to receive, in theory, > but I didn't get it to work (and it will be tricky to provide the > correct signal in shell - I had originally hardcoded signal ints in > sh, but then moved the signal enum->int resolution into > helper/test-trace2.c because the alternative is doing some nasty > grepping on other shell utility outputs, since the signal codes aren't > platform/arch consistent). We have test_match_signal(). Unfortunately it's not integrated with test_expect_code(), so you have to do: { thing_which_fails; OUT=$?; } && test_match_signal 15 "$OUT" See 5263e22cba (t7006: simplify exit-code checks for sigpipe tests, 2021-11-21) for an example. > I also wonder - do we want to capture SIGKILL as well? Some people may > have muscle memory for `kill -9` (I do, for better or worse) instead > of gentler `kill`. My intent was to notice any indication of user > frustration resulting in manual termination, which would include `kill > -9` too... You can't catch SIGKILL; its whole purpose is to be un-catchable. -Peff
On Fri, May 10, 2024 at 12:46 PM Jeff King <peff@peff.net> wrote: > > On Fri, May 10, 2024 at 12:34:07PM -0700, Emily Shaffer wrote: > > > Yes, it is because test_must_fail expects "natural" death. You can > > tell test_must_fail which signal you'd expect to receive, in theory, > > but I didn't get it to work (and it will be tricky to provide the > > correct signal in shell - I had originally hardcoded signal ints in > > sh, but then moved the signal enum->int resolution into > > helper/test-trace2.c because the alternative is doing some nasty > > grepping on other shell utility outputs, since the signal codes aren't > > platform/arch consistent). > > We have test_match_signal(). Unfortunately it's not integrated with > test_expect_code(), so you have to do: > > { thing_which_fails; OUT=$?; } && > test_match_signal 15 "$OUT" Right, what I meant above is that `15` isn't portable, I'd have to get the correct int value of SIGINT/SIGSEGV from some other shell utility at test time. > > See 5263e22cba (t7006: simplify exit-code checks for sigpipe tests, > 2021-11-21) for an example. > > > I also wonder - do we want to capture SIGKILL as well? Some people may > > have muscle memory for `kill -9` (I do, for better or worse) instead > > of gentler `kill`. My intent was to notice any indication of user > > frustration resulting in manual termination, which would include `kill > > -9` too... > > You can't catch SIGKILL; its whole purpose is to be un-catchable. Duh :') Thanks. > > -Peff
On Fri, May 10, 2024 at 12:49:06PM -0700, Emily Shaffer wrote: > > We have test_match_signal(). Unfortunately it's not integrated with > > test_expect_code(), so you have to do: > > > > { thing_which_fails; OUT=$?; } && > > test_match_signal 15 "$OUT" > > Right, what I meant above is that `15` isn't portable, I'd have to get > the correct int value of SIGINT/SIGSEGV from some other shell utility > at test time. Yes, but we already rely on it elsewhere (like t0005), and the idea of test_match_signal is that it would convert from "standard" numbers to something platform specific. Though aside from Windows (where the signal number is sometimes lost entirely) we've never had to actually do such conversion so far; "15" really is standard. If your primary concern, though, is the trace2 output and not the exit code of the program, then it may not be worth worrying too much about. -Peff
Emily Shaffer <nasamuffin@google.com> writes: > While I'm at it, since you pointed out ! instead of test_must_fail, I > wondered if I should change "! test_grep" as well - but when I grep t/ > it looks like it's not usual to use `test_must_fail test_grep`, but > instead to use `test_grep ! <omitted pattern> <file>`. I'll change > that too. "! test_grep" is an anti-pattern. We should have a documentation somewhere in t/README or nearby (if we don't, somebody please add one). The point of test_grep is "when we expect to see hits, we do show them to the standard output even if we just used a bare 'grep', but when such a test fails, we can easily miss the failure, because the failure is signalled only by $? and no output---hence, test_grep helper loudly says that we expected to find something but we did not see any". Using "! test_grep" will make "! grep" louder in a wrong case. That is the whole reason why "test_grep !" exists. > I also wonder - do we want to capture SIGKILL as well? An eternally interesting question is "How would you catch an uncatchable signal?" ;-)
On Fri, May 10, 2024 at 01:34:42PM -0700, Junio C Hamano wrote: > Emily Shaffer <nasamuffin@google.com> writes: > > > While I'm at it, since you pointed out ! instead of test_must_fail, I > > wondered if I should change "! test_grep" as well - but when I grep t/ > > it looks like it's not usual to use `test_must_fail test_grep`, but > > instead to use `test_grep ! <omitted pattern> <file>`. I'll change > > that too. > > "! test_grep" is an anti-pattern. We should have a documentation > somewhere in t/README or nearby (if we don't, somebody please add > one). Better than documentation, maybe something like: diff --git a/t/check-non-portable-shell.pl b/t/check-non-portable-shell.pl index b2b28c2ced..7de2c30aa0 100755 --- a/t/check-non-portable-shell.pl +++ b/t/check-non-portable-shell.pl @@ -51,6 +51,7 @@ sub err { err q(quote "$val" in 'local var=$val'); /^\s*([A-Z0-9_]+=(\w*|(["']).*?\3)\s+)+(\w+)/ and exists($func{$4}) and err '"FOO=bar shell_func" assignment extends beyond "shell_func"'; + /! test_grep/ and err 'do not invert test_* functions'; $line = ''; # this resets our $. for each file close ARGV if eof; There's at least one other case already. If you shorten it to just "! test_" to catch more functions, you can see there are a lot of wrong test_cmp invocations, too (maybe not quite as bad because we don't produce a specific message, but we'd yield a confusing diff output). But I think we can't cover everything; there are some like test_have_prereq which obviously are invertible. -Peff
On Fri, May 10, 2024 at 12:41 PM Jeff King <peff@peff.net> wrote: > > On Fri, May 10, 2024 at 10:22:43AM -0700, Emily Shaffer wrote: > > > From: Emily Shaffer <nasamuffin@google.com> > > > > We already use trace2 to find out about unexpected pipe breakages, which > > is nice for detecting bugs or system problems, by adding a handler for > > SIGPIPE which simply writes a trace2 line. However, there are a handful > > of other common signals which we might want to snoop on: > > > > - SIGINT, SIGTERM, or SIGQUIT, when a user manually cancels a command in > > frustration or mistake (via Ctrl-C, Ctrl-D, or `kill`) > > - SIGHUP, when the network closes unexpectedly (indicating there may be > > a problem to solve) > > > > There are lots more signals which we might find useful later, but at > > least let's teach trace2 to report these egregious ones. Conveniently, > > they're also already covered by the `_common` variants in sigchain.[ch]. > > I think this would be a useful thing to have, but having looked at the > trace2 signal code, this is almost certain to cause racy deadlocks. > > The exact details depend on the specific trace2 target backend, but > looking at the various fn_signal() methods, they rely on allocations via > strbufs. This is a problem in signal handlers because we can get a > signal at any time, including when other code is inside malloc() holding > a lock. And then further calls to malloc() will block forever on that > lock. > > We should be able to do a quick experiment. Try this snippet, which > repeatedly kills "git log -p" (which is likely to be allocating memory) > and waits for it to exit. Eventually each invocation will stall on a > deadlock: > > -- >8 -- > doit() { > me=$1 > i=0 > while true; do > GIT_TRACE2=1 ./git log -p >/dev/null 2>&1 & > sleep 0.1 > kill $! > wait $! 2>/dev/null > i=$((i+1)) > echo $me:$i > done > } > > for i in $(seq 1 64); do > doit $i & > done > -- >8 -- > > I didn't have the patience to wait for all of them to stall, but if you > let it run for a bit and check "ps", you'll see some git processes which > are hanging. Stracing shows them stuck on a lock, like: > > $ strace -p 1838693 > strace: Process 1838693 attached > futex(0x7facf02df3e0, FUTEX_WAIT_PRIVATE, 2, NULL^Cstrace: Process 1838693 detached > <detached ...> > > This problem existed before your patch. I imagine it was much less > likely (or perhaps even impossible) with SIGPIPE though, because we'd > see that signal only when in a write() syscall, which implies we're not > in malloc(). Whereas we can get SIGTERM, etc, any time. > > Obviously the script above is meant to exacerbate the situation, and > most runs would be fine. But over the course of normal use across many > users and many runs, I think we would see this in practice. I think your > test won't because it triggers the signal only from raise(). > > So I think before doing this, we'd need to clean up the trace2 signal > code to avoid any allocations. I started to look into doing this, and it's actually really tricky. I've got a sample diff here (https://github.com/git/git/commit/bf8a5084ede2b9c476e0cf90b7f198c52573fba7); I'll need to do it for the other two trace formats as well. But, the entire trace2 library relies heavily on strbuf, which doesn't have a stack-allocated form. I'm also not sure how we can guarantee the no-alloc-ness of these - maybe there's some flag we can give to one of the analyzers or something? - so I'm worried about backsliding in the future. Anyway, I won't have time to work on these again until the end of next week. If this looks like a reasonable direction I'll pick it up again then; otherwise, maybe it makes sense for the fn_signal() dispatcher to just time out if the handler process doesn't terminate in, say, 1s? - Emily > > -Peff
On Mon, May 13, 2024 at 09:21:54AM -0700, Emily Shaffer wrote: > I started to look into doing this, and it's actually really tricky. > I've got a sample diff here > (https://github.com/git/git/commit/bf8a5084ede2b9c476e0cf90b7f198c52573fba7); > I'll need to do it for the other two trace formats as well. But, the > entire trace2 library relies heavily on strbuf, which doesn't have a > stack-allocated form. I'm also not sure how we can guarantee the > no-alloc-ness of these - maybe there's some flag we can give to one of > the analyzers or something? - so I'm worried about backsliding in the > future. Looking briefly over that patch, a few thoughts: - rather than try to make the generic trace2 printing functions handle both the alloc and fixed-buffer cases, if the signal handlers only need a limited set of functions, it might be easier to just let them live in a totally parallel universe. For the simple printing case that's not too much extra code, and then the complications are limited to the signal-handling functions themselves. It's a bit more tricky with json, but we might be able to get away with just hand-forming it into a buffer, given the relative simplicity of it. In some cases you might need to precompute and stash buffers ahead of time that could be used by the signal handler (e.g., the whole tr2_sid thing). - the opposite approach might be: stop using any allocating functions in the trace2 code. There's a certain simplicity there, even for non-signal functions, that we know we're just touching a few fixed-size buffers, and you can never create a weird DoS by tweaking the tracing code. But it would mean rewriting a lot of it (including json formatting stuff) without many of our usual strbuf niceties. This is more or less the approach we take with error(), die(), etc, which are built on vreportf() and its fixed buffer. - you probably don't want to use xsnprintf(). That function is for cases where it's a bug to truncate, and we're just asserting that everything fit as expected. For your purposes, you probably want regular snprintf(). Again, see vreportf(). I don't think there's an easy static analysis solution here. It's more than just allocation, too. Anything that holds a lock is a potential problem (e.g., stdio streams), and likewise anything that looks at global state that might be in the middle of being mutated. So overall it is a pretty thorny problem, and for the most part we've just tried to keep what we do inside signal handlers to a minimum (usually cleanup, but even there we have to be careful not to do things like build up allocated paths for recursive removal). > Anyway, I won't have time to work on these again until the end of next > week. If this looks like a reasonable direction I'll pick it up again > then; otherwise, maybe it makes sense for the fn_signal() dispatcher > to just time out if the handler process doesn't terminate in, say, 1s? The timeout would help with locks, but not other weird logic bugs you can get into. Fundamentally you really want to do as little as possible from a signal handler. -Peff
Jeff King <peff@peff.net> writes: > - the opposite approach might be: stop using any allocating functions > in the trace2 code. There's a certain simplicity there, even for > non-signal functions, that we know we're just touching a few > fixed-size buffers, and you can never create a weird DoS by tweaking > the tracing code. But it would mean rewriting a lot of it (including > json formatting stuff) without many of our usual strbuf niceties. > > This is more or less the approach we take with error(), die(), etc, > which are built on vreportf() and its fixed buffer. Would another approach be to add various trace2 functions that use strbuf() allocation a way to tell if they are called from a signal handing codepath, and punt (by doing nothing if needed, but hopefully we have enough slop in the buffer to say "hey we got interrupted so no more detailed report for you, sorry") if that is the case? > So overall it is a pretty thorny problem, and for the most part we've > just tried to keep what we do inside signal handlers to a minimum > (usually cleanup, but even there we have to be careful not to do things > like build up allocated paths for recursive removal). Yes, I agree that it is the right approach to do very little in a signal handler.
On Thu, May 16, 2024 at 09:32:36AM -0700, Junio C Hamano wrote: > Jeff King <peff@peff.net> writes: > > > - the opposite approach might be: stop using any allocating functions > > in the trace2 code. There's a certain simplicity there, even for > > non-signal functions, that we know we're just touching a few > > fixed-size buffers, and you can never create a weird DoS by tweaking > > the tracing code. But it would mean rewriting a lot of it (including > > json formatting stuff) without many of our usual strbuf niceties. > > > > This is more or less the approach we take with error(), die(), etc, > > which are built on vreportf() and its fixed buffer. > > Would another approach be to add various trace2 functions that use > strbuf() allocation a way to tell if they are called from a signal > handing codepath, and punt (by doing nothing if needed, but > hopefully we have enough slop in the buffer to say "hey we got > interrupted so no more detailed report for you, sorry") if that is > the case? We do use that "in_signal" flag in other handlers. E.g., when run-command avoids calling free() in a signal, and the tempfile code avoids using stdio. But in the case of these trace functions, I think they'd all need to be rewritten to avoid strbufs. That message formatting is the whole point, and there is no way to have a strbuf which truncates rather than growing (though it is something we've discussed). So I think we either need to rip strbufs out of most of trace2, or let these signal paths re-implement the formatting in a super-simple way. -Peff
diff --git a/t/helper/test-trace2.c b/t/helper/test-trace2.c index 1adac29a57..8970956ea8 100644 --- a/t/helper/test-trace2.c +++ b/t/helper/test-trace2.c @@ -231,6 +231,22 @@ static int ut_010bug_BUG(int argc UNUSED, const char **argv UNUSED) BUG("a %s message", "BUG"); } +static int ut_011signal(int argc, const char **argv) +{ + const char *usage_error = "expect <bool common>"; + int common = 0; + + if (argc != 1 || get_i(&common, argv[0])) + die("%s", usage_error); + + /* + * There is no strong reason SIGSEGV is ignored by trace2 - it's just + * not included by sigchain_push_common(). + */ + raise(common ? SIGINT : SIGSEGV); + return 0; /*unreachable*/ +} + /* * Single-threaded timer test. Create several intervals using the * TEST1 timer. The test script can verify that an aggregate Trace2 @@ -482,6 +498,7 @@ static struct unit_test ut_table[] = { { ut_008bug, "008bug", "" }, { ut_009bug_BUG, "009bug_BUG","" }, { ut_010bug_BUG, "010bug_BUG","" }, + { ut_011signal, "011signal","" }, { ut_100timer, "100timer", "<count> <ms_delay>" }, { ut_101timer, "101timer", "<count> <ms_delay> <threads>" }, diff --git a/t/t0210-trace2-normal.sh b/t/t0210-trace2-normal.sh index c312657a12..c34ccc518c 100755 --- a/t/t0210-trace2-normal.sh +++ b/t/t0210-trace2-normal.sh @@ -244,6 +244,28 @@ test_expect_success 'bug messages followed by BUG() are written to trace2' ' test_cmp expect actual ' +test_expect_success 'trace2 reports common signals' ' + test_when_finished "rm trace.normal actual" && + + # signals are fatal, so expect this to fail + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 1 && + + perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual && + + test_grep "signal elapsed:" actual +' + +test_expect_success 'trace2 ignores uncommon signals' ' + test_when_finished "rm trace.normal actual" && + + # signals are fatal, so expect this to fail + ! env GIT_TRACE2="$(pwd)/trace.normal" test-tool trace2 011signal 0 && + + perl "$TEST_DIRECTORY/t0210/scrub_normal.perl" <trace.normal >actual && + + ! test_grep "signal elapsed:" actual +' + sane_unset GIT_TRACE2_BRIEF # Now test without environment variables and get all Trace2 settings diff --git a/trace2.c b/trace2.c index f894532d05..3692010f5d 100644 --- a/trace2.c +++ b/trace2.c @@ -231,7 +231,7 @@ void trace2_initialize_fl(const char *file, int line) tr2_sid_get(); atexit(tr2main_atexit_handler); - sigchain_push(SIGPIPE, tr2main_signal_handler); + sigchain_push_common(tr2main_signal_handler); tr2tls_init(); /*