diff mbox series

a less-invasive racy-leak fix, was Re: What's cooking in git.git (Dec 2024, #11; Mon, 30)

Message ID 20250101191422.GC1391912@coredump.intra.peff.net (mailing list archive)
State New
Headers show
Series a less-invasive racy-leak fix, was Re: What's cooking in git.git (Dec 2024, #11; Mon, 30) | expand

Commit Message

Jeff King Jan. 1, 2025, 7:14 p.m. UTC
On Mon, Dec 30, 2024 at 09:33:20AM -0800, Junio C Hamano wrote:

> * jk/lsan-race-with-barrier (2024-12-30) 5 commits
>   (merged to 'next' on 2024-12-30 at 3fc0e14928)
>  + grep: work around LSan threading race with barrier
>  + index-pack: work around LSan threading race with barrier
>  + thread-utils: introduce optional barrier type
>  + Revert "index-pack: spawn threads atomically"
>  + test-lib: use individual lsan dir for --stress runs
> 
>  CI jobs that run threaded programs under LSan has been giving false
>  positives from time to time, which has been worked around.
> 
>  Will merge to 'master'.
>  source: <20241230042325.GA112439@coredump.intra.peff.net>

This graduated faster than I expected. :)

I think the first two patches are obviously good, and we should take
them as-is. But while poking at the problem a bit more, I found
something quite interesting that might let us fix it without any code
change at all.

I wanted a minimal reproduction of the problem so that I could report it
to the sanitizer folks upstream. So I thought this would work:

-- >8 --

cat >foo.c <<\EOF
#include <stdlib.h>
#include <pthread.h>

static void *run(void *data)
{
	return NULL;
}

int main(void)
{
	pthread_t threads[256];

	for (int i = 0; i < sizeof(threads)/sizeof(*threads); i++)
		pthread_create(&threads[i], NULL, run, NULL);
	exit(1);
}
EOF

gcc -fsanitize=leak foo.c
./a.out

-- 8< --

But it doesn't! Even if I run it thousands of times. And yet if I put
similar code inside of git.c and run it within our test suite, I see the
failure almost immediately.  So there's something about our test suite
that triggers it.  And indeed, the secret sauce turns out to be this
option (depending on system load you might need to run it a couple times
to trigger the failure):

  $ LSAN_OPTIONS=fast_unwind_on_malloc=0 ./a.out
  =================================================================
  ==1836417==ERROR: LeakSanitizer: detected memory leaks
  
  Direct leak of 32 byte(s) in 1 object(s) allocated from:
      #0 0x7ff8bb014556 in realloc ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:98
      #1 0x7ff8bae9d2c1 in __pthread_getattr_np nptl/pthread_getattr_np.c:180
      #2 0x7ff8bb02500d in __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:150
      #3 0x7ff8bb025187 in __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ../../../../src/libsanitizer/sanitizer_common/sanitizer_linux_libcdep.cpp:614
      #4 0x7ff8bb017d18 in __lsan::ThreadStart(unsigned int, unsigned long long, __sanitizer::ThreadType) ../../../../src/libsanitizer/lsan/lsan_posix.cpp:53
      #5 0x7ff8bb0143a9 in ThreadStartFunc<false> ../../../../src/libsanitizer/lsan/lsan_interceptors.cpp:431
      #6 0x7ff8bae9bf51 in start_thread nptl/pthread_create.c:447
      #7 0x7ff8baf1a677 in __clone3 ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
  
  SUMMARY: LeakSanitizer: 32 byte(s) leaked in 1 allocation(s).


It seems a little funny to me that the different stack unwinder would
also have different thread setup code, but I'm not at all familiar with
lsan's internals, so who knows. I might dig into that but haven't yet
(and I haven't yet reported it upstream). But anyway, that implies that
we can get rid of the race with just:


A little hacky, but it lets us have our cake and eat it, too. No changes
to the code, and no bad stack traces.

What do you think?

-Peff

Comments

Junio C Hamano Jan. 2, 2025, 12:25 a.m. UTC | #1
Jeff King <peff@peff.net> writes:

> On Mon, Dec 30, 2024 at 09:33:20AM -0800, Junio C Hamano wrote:
>
>> * jk/lsan-race-with-barrier (2024-12-30) 5 commits
> ...
> This graduated faster than I expected. :)

Heh, it is before -rc2 and the change is only about tests, so ...

> ...
> So that line is doing something useful. But it may not be worth the racy
> pain it's causing. So some alternatives are:
>
>   - we drop that line by default, and then when people are investigating
>     a specific leak, they can override LSAN_OPTIONS themselves to get
>     better output (though of course knowing that you can even do is
>     tricky)
>
>   - we keep that line by default, but override LSAN_OPTIONS for CI to
>     avoid the race. That makes all local leak-checking traces
>     informative by default. But CI ones may be truncated. I'm not sure
>     if people use the CI ones directly, or investigate further
>     themselves.
>
>   - we could annotate individual scripts or even tests to disable the
>     option (since it's really just threaded programs). This is more
>     hassle, but would limit the blast radius.
>
> I don't love any of those, but they may be less bad than all of the
> barrier trickery. And it may be that this is even something we could get
> fixed in LSan upstream, and it would just be a temporary workaround. I'm
> still going to pursue that.
>
> And finally, one other option (that I'm not sure why I didn't consider
> before): can we just ignore the false positives, similar to what we did
> in 370ef7e40d (test-lib: ignore uninteresting LSan output, 2023-08-28).

Good point.

> I think we'd have to stop doing abort_on_error for the leak checker and
> just rely on the logs, but that's OK (we always check the logs these
> days).
> ...
> A little hacky, but it lets us have our cake and eat it, too. No changes
> to the code, and no bad stack traces.
>
> What do you think?

I like the small hack.  "This is ultimately LSan's racy-ness and not
ours, so let's avoid changing our code to work it around when we can
do the workaround somewhere else" is an attitude that I would endorse
fully.

Thanks.
Jeff King Jan. 2, 2025, 2:32 a.m. UTC | #2
On Wed, Jan 01, 2025 at 04:25:02PM -0800, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> > On Mon, Dec 30, 2024 at 09:33:20AM -0800, Junio C Hamano wrote:
> >
> >> * jk/lsan-race-with-barrier (2024-12-30) 5 commits
> > ...
> > This graduated faster than I expected. :)
> 
> Heh, it is before -rc2 and the change is only about tests, so ...

Yeah, I figured as much. I also considered it of relatively low
importance during -rc, but I guess CI false positives do tend to annoy
everybody and waste their time. :)

It looks like you pushed out the version of 'master' with it merged. I
had figured you'd revert jk/lsan-race-with-barrier out of next, so I
wondered how we would proceed (revert the whole merge from master to
rebuild, or do a moral revert of the final three).

Looking at jk/lsan-race-ignore-false-positive, it looks like you did the
moral revert via fc89d14c63 (Revert barrier-based LSan threading race
workaround, 2025-01-01). That commit's tree matches what I'd expect (I
guess you probably used "revert -n HEAD~3..HEAD" just like I did).

It would be nice if the 3-commit revert mentioned the specific commits
it was reverting. If you revert the whole merge, you get the merge
commit's id and you can find the original commits by inspecting the
history (but of course here we were reverting only part of it, so we
couldn't do that here). If you revert the sequence without "-n" you get
three individual reverts. Which is informative, but a little clunky in
the history.

I wonder if revert should have a "squash" mode that reverts all of the
commits (perhaps in reverse order of application in case they depend on
each other textually), and then gives you a commit message template
similar to git-fmt-merge-msg, where we list all of the commits, one per
line (though probably with their commit ids in this case).

Probably not a big deal either way, and certainly not a blocker for the
series. Thanks (as usual) for doing all the maintainer juggling.

-Peff
Chris Torek Jan. 2, 2025, 2:41 a.m. UTC | #3
On Wed, Jan 1, 2025 at 6:32 PM Jeff King <peff@peff.net> wrote:
> I wonder if revert should have a "squash" mode that reverts all of the
> commits (perhaps in reverse order of application in case they depend on
> each other textually), and then gives you a commit message template
> similar to git-fmt-merge-msg, where we list all of the commits, one per
> line (though probably with their commit ids in this case).

I have kind of wanted this in the past, but never enough to build it, I just
did a few separate reverts.

Chris
Jeff King Jan. 2, 2025, 3:24 a.m. UTC | #4
On Wed, Jan 01, 2025 at 04:25:02PM -0800, Junio C Hamano wrote:

> > What do you think?
> 
> I like the small hack.  "This is ultimately LSan's racy-ness and not
> ours, so let's avoid changing our code to work it around when we can
> do the workaround somewhere else" is an attitude that I would endorse
> fully.

BTW, in case anybody wants to follow along with what happens upstream, I
reported it here:

  https://github.com/google/sanitizers/issues/1836

-Peff
Junio C Hamano Jan. 2, 2025, 2:42 p.m. UTC | #5
Jeff King <peff@peff.net> writes:

> On Wed, Jan 01, 2025 at 04:25:02PM -0800, Junio C Hamano wrote:
>
>> Jeff King <peff@peff.net> writes:
>> 
>> > On Mon, Dec 30, 2024 at 09:33:20AM -0800, Junio C Hamano wrote:
>> >
>> >> * jk/lsan-race-with-barrier (2024-12-30) 5 commits
>> > ...
>> > This graduated faster than I expected. :)
>> 
>> Heh, it is before -rc2 and the change is only about tests, so ...
>
> Yeah, I figured as much. I also considered it of relatively low
> importance during -rc, but I guess CI false positives do tend to annoy
> everybody and waste their time. :)
>
> It looks like you pushed out the version of 'master' with it merged. I
> had figured you'd revert jk/lsan-race-with-barrier out of next, so I
> wondered how we would proceed (revert the whole merge from master to
> rebuild, or do a moral revert of the final three).

Revert the effect of the tip-part (except for the bottom two) and
then queue the new ones, which would allow me to merge the whole
thing in one go without losing the bottom two's effect (which would
happen if we reverted the whole thing first, and then reused the
bottom two commits to build the new iteration on top).

> Looking at jk/lsan-race-ignore-false-positive, it looks like you did the
> moral revert via fc89d14c63 (Revert barrier-based LSan threading race
> workaround, 2025-01-01). That commit's tree matches what I'd expect (I
> guess you probably used "revert -n HEAD~3..HEAD" just like I did).

I actually did "read-tree -u -m" followed by "commit" ;-) 

> It would be nice if the 3-commit revert mentioned the specific commits
> it was reverting.

True.  I should probably amend while I can.

> I wonder if revert should have a "squash" mode that reverts all of the
> commits (perhaps in reverse order of application in case they depend on
> each other textually), and then gives you a commit message template
> similar to git-fmt-merge-msg, where we list all of the commits, one per
> line (though probably with their commit ids in this case).

I am not sure if I follow.  Should "revert HEAD~3..HEAD" give such
concatenation of messages, something similar to what "rebase -i"
gives us when seeing multiple "squash"es in a row?
Jeff King Jan. 2, 2025, 7:06 p.m. UTC | #6
On Thu, Jan 02, 2025 at 06:42:30AM -0800, Junio C Hamano wrote:

> > I wonder if revert should have a "squash" mode that reverts all of the
> > commits (perhaps in reverse order of application in case they depend on
> > each other textually), and then gives you a commit message template
> > similar to git-fmt-merge-msg, where we list all of the commits, one per
> > line (though probably with their commit ids in this case).
> 
> I am not sure if I follow.  Should "revert HEAD~3..HEAD" give such
> concatenation of messages, something similar to what "rebase -i"
> gives us when seeing multiple "squash"es in a row?

I don't think we need to concatenate all of the individual revert
messages. I was thinking of producing something more like:

  <SUBJECT: DESCRIBE YOUR REVERT HERE>

  Revert the following commits:

     - 7a8d9efc26 (grep: work around LSan threading race with barrier, 2024-12-29)
     - 526c0a851b (index-pack: work around LSan threading race with barrier, 2024-12-29)
     - 7d0037b59a (thread-utils: introduce optional barrier type, 2024-12-29)

You could perhaps even auto-populate the subject with:

  Revert jk/lsan-race-with-barrier~3..jk/lsan-race-with-barrier

similar to how git-merge uses "Merge branch ...". But it's a little
clunky to read, and unlike merge, it's a lot easier to use names that
are not very meaningful (e.g., I checked out a new branch based on that
one and then used HEAD~3..HEAD, which is worthless to mention).

-Peff
Junio C Hamano Jan. 2, 2025, 7:33 p.m. UTC | #7
Jeff King <peff@peff.net> writes:

>   <SUBJECT: DESCRIBE YOUR REVERT HERE>
>
>   Revert the following commits:
>
>      - 7a8d9efc26 (grep: work around LSan threading race with barrier, 2024-12-29)
>      - 526c0a851b (index-pack: work around LSan threading race with barrier, 2024-12-29)
>      - 7d0037b59a (thread-utils: introduce optional barrier type, 2024-12-29)

Ah, I love it.

> You could perhaps even auto-populate the subject with:
>
>   Revert jk/lsan-race-with-barrier~3..jk/lsan-race-with-barrier
>
> similar to how git-merge uses "Merge branch ...". But it's a little
> clunky to read, and unlike merge, it's a lot easier to use names that
> are not very meaningful (e.g., I checked out a new branch based on that
> one and then used HEAD~3..HEAD, which is worthless to mention).

Yup, agreed.  The commit title is much less interesting to automate
than the "we revert these three" list.

Thanks.
diff mbox series

Patch

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 96f2dfb69d..2936e810b5 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -80,7 +80,6 @@  prepend_var ASAN_OPTIONS : detect_leaks=0
 export ASAN_OPTIONS
 
 prepend_var LSAN_OPTIONS : $GIT_SAN_OPTIONS
-prepend_var LSAN_OPTIONS : fast_unwind_on_malloc=0
 export LSAN_OPTIONS
 
 prepend_var UBSAN_OPTIONS : $GIT_SAN_OPTIONS

And indeed, that lets t5309 run successfully under --stress for me (with
the log-dir fix from the base of my series).

That line comes from in 71f26798f2 (test-lib: add "fast_unwind_on_malloc=0"
to LSAN_OPTIONS, 2022-02-27), with the goal of making the backtraces a
bit more readable. I'm not sure how important it is. I tried reverting a
random commit from one of Patrick's recent leak-fix series:

  git revert 58e7568c619cce872b17987f0d14b3de3703129a
  make SANITIZE=leak
  cd t
  ./t1091-sparse-checkout-builtin.sh -i

and the trace looks quite reasonable to me. That old commit 71f2678f2
gives an example from t0006. That script was marked leak-free by
974c919d36 (date API: add and use a date_mode_release(), 2022-02-16). If
we revert its change to test-date like so:

  git diff-tree -Rp 974c919d36 -- t/helper/test-date.c | git apply -3
  make SANITIZE=leak
  cd t
  ./t0006-date.sh -i

then we do get the same crappy trace as mentioned in 71f26798f2 (if I
understand correctly, this is because libc functions which allocate are
not built with -fno-omit-frame-pointer, so we can't walk past them).

So that line is doing something useful. But it may not be worth the racy
pain it's causing. So some alternatives are:

  - we drop that line by default, and then when people are investigating
    a specific leak, they can override LSAN_OPTIONS themselves to get
    better output (though of course knowing that you can even do is
    tricky)

  - we keep that line by default, but override LSAN_OPTIONS for CI to
    avoid the race. That makes all local leak-checking traces
    informative by default. But CI ones may be truncated. I'm not sure
    if people use the CI ones directly, or investigate further
    themselves.

  - we could annotate individual scripts or even tests to disable the
    option (since it's really just threaded programs). This is more
    hassle, but would limit the blast radius.

I don't love any of those, but they may be less bad than all of the
barrier trickery. And it may be that this is even something we could get
fixed in LSan upstream, and it would just be a temporary workaround. I'm
still going to pursue that.

And finally, one other option (that I'm not sure why I didn't consider
before): can we just ignore the false positives, similar to what we did
in 370ef7e40d (test-lib: ignore uninteresting LSan output, 2023-08-28).
I think we'd have to stop doing abort_on_error for the leak checker and
just rely on the logs, but that's OK (we always check the logs these
days). And detecting the false positive is a little involved. But this
seems to work:

diff --git a/t/test-lib.sh b/t/test-lib.sh
index 96f2dfb69d..ab7d2d321e 100644
--- a/t/test-lib.sh
+++ b/t/test-lib.sh
@@ -80,6 +80,7 @@  prepend_var ASAN_OPTIONS : detect_leaks=0
 export ASAN_OPTIONS
 
 prepend_var LSAN_OPTIONS : $GIT_SAN_OPTIONS
+prepend_var LSAN_OPTIONS : exitcode=0
 prepend_var LSAN_OPTIONS : fast_unwind_on_malloc=0
 export LSAN_OPTIONS
 
@@ -346,7 +347,8 @@  nr_san_dir_leaks_ () {
 	find "$TEST_RESULTS_SAN_DIR" \
 		-type f \
 		-name "$TEST_RESULTS_SAN_FILE_PFX.*" 2>/dev/null |
-	xargs grep -lv "Unable to get registers from thread" |
+	xargs grep ^DEDUP_TOKEN |
+	grep -v sanitizer::GetThreadStackTopAndBottom |
 	wc -l
 }