diff mbox series

t5500.43: make the check a bit more robust

Message ID pull.753.git.1602600323973.gitgitgadget@gmail.com (mailing list archive)
State New, archived
Headers show
Series t5500.43: make the check a bit more robust | expand

Commit Message

Johannes Schindelin Oct. 13, 2020, 2:45 p.m. UTC
From: Johannes Schindelin <johannes.schindelin@gmx.de>

In 2b695ecd74d (t5500: count objects through stderr, not trace,
2020-05-06) we tried to ensure that the "Total 3" message could be
grepped in Git's output, even if it sometimes got chopped up into
multiple lines in the trace machinery.

However, the first instance where this mattered now goes through the
sideband machinery, where it is _still_ possible for messages to get
chopped up.

Note: we have code in `demultiplex_sideband()` _specifically_ to stitch
back together lines that were delivered in separate sideband packets.
However, this stitching fails when a primary packet is delivered in
between the two sideband packets: since a primary packet was received,
`demultiplex_sideband()` has to return that (and cannot wait for other
sideband packets, and therefore has to flush any incomplete line it
received).

This seems only to happen occasionally in the `vs-test` part of our CI
builds, i.e. with binaries built using Visual C, but not when building
with GCC or clang; The symptom is that t5500.43 fails to find a line
matching `remote: Total 3` in the `log` file, which ends in something
along these lines:

	remote: Tota
	remote: l 3 (delta 0), reused 0 (delta 0), pack-reused 0

To work around that, use some `sed` magic to re-stitch together those
split lines, after the fact.

The other test case touched by 2b695ecd74d (t5500: count objects through
stderr, not trace, 2020-05-06) is not affected by this issue because the
sideband machinery is not involved there.

Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
---
    Work around flakiness in t5500.43
    
    It seems that this test became flaky only recently, although I have to
    admit that I have no idea why: the involved code does not seem to have
    changed recently at all. It should have been fixed by 
    https://lore.kernel.org/git/20200506220741.71021-1-jonathantanmy@google.com/
    , but apparently wasn't completely fixed, despite what I said in that
    thread.

Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-753%2Fdscho%2Funflake-t5500.43-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-753/dscho/unflake-t5500.43-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/753

 t/t5500-fetch-pack.sh | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)


base-commit: d4a392452e292ff924e79ec8458611c0f679d6d4

Comments

Junio C Hamano Oct. 13, 2020, 5:53 p.m. UTC | #1
"Johannes Schindelin via GitGitGadget" <gitgitgadget@gmail.com>
writes:

> Note: we have code in `demultiplex_sideband()` _specifically_ to stitch
> back together lines that were delivered in separate sideband packets.
> However, this stitching fails when a primary packet is delivered in
> between the two sideband packets: since a primary packet was received,
> `demultiplex_sideband()` has to return that (and cannot wait for other
> sideband packets, and therefore has to flush any incomplete line it
> received).

Ouch.  Good find.

>  	git -C myclient fetch --progress origin 2>log &&
> -	test_i18ngrep "remote: Total 3 " log
> +	if ! test_i18ngrep "remote: Total 3 " log
> +	then
> +		# It is possible that the "Total 3" line is delivered in
> +		# multiple sideband packets, and that a primary packet is
> +		# delivered in between. When that happens, the line will be
> +		# presented on multiple "remote:" lines.
> +		sed "/^remote: T/{
> +			:a
> +			N
> +			s/\nremote: //
> +			ba

OK, so if we see multiple and adjacent "remote:" lines, we strip the
leading "remote: " prefix from the second and subsequent lines and
concatenate them into one giant line?  Sounds good.

Will queue.  Thanks.

> +		}" log >log.unsplit &&
> +		test_i18ngrep "remote: Total 3 " log.unsplit
> +	fi
>  '
>  
>  test_expect_success 'in_vain resetted upon ACK' '
>
> base-commit: d4a392452e292ff924e79ec8458611c0f679d6d4
Jeff King Oct. 13, 2020, 6:55 p.m. UTC | #2
On Tue, Oct 13, 2020 at 02:45:23PM +0000, Johannes Schindelin via GitGitGadget wrote:

> Note: we have code in `demultiplex_sideband()` _specifically_ to stitch
> back together lines that were delivered in separate sideband packets.
> However, this stitching fails when a primary packet is delivered in
> between the two sideband packets: since a primary packet was received,
> `demultiplex_sideband()` has to return that (and cannot wait for other
> sideband packets, and therefore has to flush any incomplete line it
> received).

This sounds like a bug. We should accumulate multiple sideband 2
packets, even across data packets. And I think we _used_ to do that. The
original recv_sideband() looked roughly like this:

	while (!retval) {
		 packet_read();
		 band = buf[0] & 0xff;
		 switch (band) {
		 case 3:
		   ...stuff data into outbuf...
		   retval = SIDEBAND_ERROR_REMOTE;
		   break;
		 case 2:
		   ...print full lines, stuff remainder into outbuf...
		   break; /* note, does _not_ set retval */
		 case 1:
		   write_or_die(out, buf + 1, len);
		   break; /* likewise, does not set retval */
		 default:
		   ...complain about broken sideband...
		   retval = SIDEBAND_PROTOCOL_ERROR;
		}
	}
	if (outbuf.len)
		...print outbuf...

So we would keep looping until we see EOF, sideband 3, or a protocol
error. But notably we would not break out of the loop on bands 1 or 2,
and we do not flush band 2 until we break out of the loop.

But then in fbd76cd450 (sideband: reverse its dependency on pkt-line,
2019-01-16), the function became demultiplex_sideband(). The loop went
away, and we pump only a single packet on each call. When we see
sideband 2, we do an early return. But on sideband 1, we continue to the
cleanup: label, which flushes the scratch buffer.

I think we need to return from there without hitting that cleanup label,
like this:

diff --git a/sideband.c b/sideband.c
index 0a60662fa6..a5405b9aaa 100644
--- a/sideband.c
+++ b/sideband.c
@@ -190,7 +190,7 @@ int demultiplex_sideband(const char *me, char *buf, int len,
 		return 0;
 	case 1:
 		*sideband_type = SIDEBAND_PRIMARY;
-		break;
+		return 1;
 	default:
 		strbuf_addf(scratch, "%s%s: protocol error: bad band #%d",
 			    scratch->len ? "\n" : "", me, band);

Does that make the problem go away for you?

>     Work around flakiness in t5500.43
>     
>     It seems that this test became flaky only recently, although I have to
>     admit that I have no idea why: the involved code does not seem to have
>     changed recently at all. It should have been fixed by 
>     https://lore.kernel.org/git/20200506220741.71021-1-jonathantanmy@google.com/
>     , but apparently wasn't completely fixed, despite what I said in that
>     thread.

I think this is a real bug, and we shouldn't be changing the tests to
accommodate. Users may actually see the broken lines, and our tests are
telling us that.

I suspect it's uncommon in practice because it requires the server side
also splitting the line across two packets. And while the server-side
upload-pack might get a partial write from a child pack-objects or
whatever, that should only happen if:

  - the pipe buffer fills up. Which is hard to do since our messages
    tend to be very small. So perhaps it implies a very tiny pipe
    buffer, and/or slow scheduling of the receiving side to actually
    clean it out.

  - the writer is fully buffering its stderr writes instead of sending
    them a line at a time

The latter seems a more likely candidate for switching from gcc to
Visual C (which presumably has a different libc / CRT).  I think the
client should be handling this more robustly, but it may be worth
digging into the buffering issue if it means progress may not be
delivered in quite as timely a way as we expect it to be.

-Peff
Junio C Hamano Oct. 13, 2020, 7:07 p.m. UTC | #3
Jeff King <peff@peff.net> writes:

>>     Work around flakiness in t5500.43
>>     
>>     It seems that this test became flaky only recently, although I have to
>>     admit that I have no idea why: the involved code does not seem to have
>>     changed recently at all. It should have been fixed by 
>>     https://lore.kernel.org/git/20200506220741.71021-1-jonathantanmy@google.com/
>>     , but apparently wasn't completely fixed, despite what I said in that
>>     thread.
>
> I think this is a real bug, and we shouldn't be changing the tests to
> accommodate. Users may actually see the broken lines, and our tests are
> telling us that.

True; I reacted too hastily.  Will not queue the workaround ;-)
Jeff King Oct. 13, 2020, 7:09 p.m. UTC | #4
On Tue, Oct 13, 2020 at 02:55:15PM -0400, Jeff King wrote:

> But then in fbd76cd450 (sideband: reverse its dependency on pkt-line,
> 2019-01-16), the function became demultiplex_sideband(). The loop went
> away, and we pump only a single packet on each call. When we see
> sideband 2, we do an early return. But on sideband 1, we continue to the
> cleanup: label, which flushes the scratch buffer.
> 
> I think we need to return from there without hitting that cleanup label,
> like this:

By the way, the reason this works is that the "scratch" buffer persists
beyond individual calls to demultiplex_sideband(). So we can get away
with letting it persist between the two.

However unlike the original recv_sideband(), which broke out of the loop
on error and then flushed scratch, our post-fbd76cd450 does not do the
same. It now looks like:

int recv_sideband(const char *me, int in_stream, int out)
{
        char buf[LARGE_PACKET_MAX + 1];
        int len;
        struct strbuf scratch = STRBUF_INIT;
        enum sideband_type sideband_type;

        while (1) {
                len = packet_read(in_stream, NULL, NULL, buf, LARGE_PACKET_MAX,
                                  0);
                if (!demultiplex_sideband(me, buf, len, 0, &scratch,
                                          &sideband_type))
                        continue;
                switch (sideband_type) {
                case SIDEBAND_PRIMARY:
                        write_or_die(out, buf + 1, len - 1);
                        break;
                default: /* errors: message already written */
                        return sideband_type;
                }
        }
}

I wondered if we could ever have a case where we broke out of the loop
with data left over in "scratch" (or its buffer left allocated). I think
the answer is no. We only break out of the loop if
demultiplex_sideband() returned non-zero _and_ its not the primary
sideband. So both before and after my suggested fix, we'd have hit the
cleanup label in demultiplex_sideband(), which flushes and frees the
buffer.

I do have to say that the original loop before that commit was a lot
easier to follow, though.

Another weirdness I noticed is that it doesn't distinguish a flush
packet (expected) from a zero-byte packet (an error). But neither did
the original. I would have guessed the zero-byte packet was meant to
trigger this second conditional:

        if (len == 0) {
                *sideband_type = SIDEBAND_FLUSH;
                goto cleanup;
        }
        if (len < 1) {
                strbuf_addf(scratch,
                            "%s%s: protocol error: no band designator",
                            scratch->len ? "\n" : "", me);
                *sideband_type = SIDEBAND_PROTOCOL_ERROR;
                goto cleanup;
        }

but we'd hit the first conditional before then. We would still trigger
the second if we saw EOF while reading the packet (because we set the
length to -1 then), but then it's arguably the wrong error to be
showing.

So I think this could be improved a bit by using
packet_read_with_status() in the recv_sideband() caller.

-Peff
Johannes Schindelin Oct. 17, 2020, 2:34 a.m. UTC | #5
Hi Peff,

On Tue, 13 Oct 2020, Jeff King wrote:

> On Tue, Oct 13, 2020 at 02:45:23PM +0000, Johannes Schindelin via GitGitGadget wrote:
>
> > Note: we have code in `demultiplex_sideband()` _specifically_ to stitch
> > back together lines that were delivered in separate sideband packets.
> > However, this stitching fails when a primary packet is delivered in
> > between the two sideband packets: since a primary packet was received,
> > `demultiplex_sideband()` has to return that (and cannot wait for other
> > sideband packets, and therefore has to flush any incomplete line it
> > received).
>
> This sounds like a bug. We should accumulate multiple sideband 2
> packets, even across data packets. And I think we _used_ to do that. The
> original recv_sideband() looked roughly like this:
>
> 	while (!retval) {
> 		 packet_read();
> 		 band = buf[0] & 0xff;
> 		 switch (band) {
> 		 case 3:
> 		   ...stuff data into outbuf...
> 		   retval = SIDEBAND_ERROR_REMOTE;
> 		   break;
> 		 case 2:
> 		   ...print full lines, stuff remainder into outbuf...
> 		   break; /* note, does _not_ set retval */
> 		 case 1:
> 		   write_or_die(out, buf + 1, len);
> 		   break; /* likewise, does not set retval */
> 		 default:
> 		   ...complain about broken sideband...
> 		   retval = SIDEBAND_PROTOCOL_ERROR;
> 		}
> 	}
> 	if (outbuf.len)
> 		...print outbuf...
>
> So we would keep looping until we see EOF, sideband 3, or a protocol
> error. But notably we would not break out of the loop on bands 1 or 2,
> and we do not flush band 2 until we break out of the loop.
>
> But then in fbd76cd450 (sideband: reverse its dependency on pkt-line,
> 2019-01-16), the function became demultiplex_sideband(). The loop went
> away, and we pump only a single packet on each call. When we see
> sideband 2, we do an early return. But on sideband 1, we continue to the
> cleanup: label, which flushes the scratch buffer.
>
> I think we need to return from there without hitting that cleanup label,
> like this:
>
> diff --git a/sideband.c b/sideband.c
> index 0a60662fa6..a5405b9aaa 100644
> --- a/sideband.c
> +++ b/sideband.c
> @@ -190,7 +190,7 @@ int demultiplex_sideband(const char *me, char *buf, int len,
>  		return 0;
>  	case 1:
>  		*sideband_type = SIDEBAND_PRIMARY;
> -		break;
> +		return 1;
>  	default:
>  		strbuf_addf(scratch, "%s%s: protocol error: bad band #%d",
>  			    scratch->len ? "\n" : "", me, band);
>
> Does that make the problem go away for you?

Yes. But it took a substantial amount of time for myself to convince
myself that this oneliner is actually correct.

>
> >     Work around flakiness in t5500.43
> >
> >     It seems that this test became flaky only recently, although I have to
> >     admit that I have no idea why: the involved code does not seem to have
> >     changed recently at all. It should have been fixed by
> >     https://lore.kernel.org/git/20200506220741.71021-1-jonathantanmy@google.com/
> >     , but apparently wasn't completely fixed, despite what I said in that
> >     thread.
>
> I think this is a real bug, and we shouldn't be changing the tests to
> accommodate. Users may actually see the broken lines, and our tests are
> telling us that.
>
> I suspect it's uncommon in practice because it requires the server side
> also splitting the line across two packets. And while the server-side
> upload-pack might get a partial write from a child pack-objects or
> whatever, that should only happen if:
>
>   - the pipe buffer fills up. Which is hard to do since our messages
>     tend to be very small. So perhaps it implies a very tiny pipe
>     buffer, and/or slow scheduling of the receiving side to actually
>     clean it out.
>
>   - the writer is fully buffering its stderr writes instead of sending
>     them a line at a time
>
> The latter seems a more likely candidate for switching from gcc to
> Visual C (which presumably has a different libc / CRT).  I think the
> client should be handling this more robustly, but it may be worth
> digging into the buffering issue if it means progress may not be
> delivered in quite as timely a way as we expect it to be.

FWIW I _think_ the issue at hand is that newer MSVC runtimes deliver
`stderr` byte-by-byte. While that is totally allowed, it is different from
how things seem to be done on Linux/Unix/Darwin.

In my tests, I frequently saw the first two or four characters of that
"Total 3" line be delivered in its own sideband packet, hence the
breakage.

Thank you for keeping me honest. I would much rather have avoided spending
several hours on this investigation, but in the end, it is something that
is better made correct than left incorrect.

Having said that it is neither a very new bug nor does it seem to matter
on Linux/macOS (and not even in Git for Windows because it is built using
mingw-w64-gcc, which uses an older MSVC runtime, which apparently does not
share that issue with newer versions), therefore I plan on sending off v2
only after v2.29.0 was published.

Ciao,
Dscho
Johannes Schindelin Oct. 17, 2020, 3:31 a.m. UTC | #6
Hi Peff,

On Tue, 13 Oct 2020, Jeff King wrote:

> On Tue, Oct 13, 2020 at 02:55:15PM -0400, Jeff King wrote:
>
> > But then in fbd76cd450 (sideband: reverse its dependency on pkt-line,
> > 2019-01-16), the function became demultiplex_sideband(). The loop went
> > away, and we pump only a single packet on each call. When we see
> > sideband 2, we do an early return. But on sideband 1, we continue to the
> > cleanup: label, which flushes the scratch buffer.
> >
> > I think we need to return from there without hitting that cleanup label,
> > like this:
>
> By the way, the reason this works is that the "scratch" buffer persists
> beyond individual calls to demultiplex_sideband(). So we can get away
> with letting it persist between the two.

The thing that threw me was that I somehow expected `recv_sideband()` to
return primary data as it is read, much like `read()` operates. And yes,
I also found the split version of the code (`recv_sideband()` contains the
`while` loop and lives in `pkt-line.c` while `demultiplex_sideband()`
contains the `scratch` handling and the `switch` between packet types and
it lives in `sideband.c`) was much harder to read than the original
version.

> However unlike the original recv_sideband(), which broke out of the loop
> on error and then flushed scratch, our post-fbd76cd450 does not do the
> same. It now looks like:
>
> int recv_sideband(const char *me, int in_stream, int out)
> {
>         char buf[LARGE_PACKET_MAX + 1];
>         int len;
>         struct strbuf scratch = STRBUF_INIT;
>         enum sideband_type sideband_type;
>
>         while (1) {
>                 len = packet_read(in_stream, NULL, NULL, buf, LARGE_PACKET_MAX,
>                                   0);
>                 if (!demultiplex_sideband(me, buf, len, 0, &scratch,
>                                           &sideband_type))
>                         continue;
>                 switch (sideband_type) {
>                 case SIDEBAND_PRIMARY:
>                         write_or_die(out, buf + 1, len - 1);
>                         break;
>                 default: /* errors: message already written */
>                         return sideband_type;
>                 }
>         }
> }
>
> I wondered if we could ever have a case where we broke out of the loop
> with data left over in "scratch" (or its buffer left allocated). I think
> the answer is no. We only break out of the loop if
> demultiplex_sideband() returned non-zero _and_ its not the primary
> sideband. So both before and after my suggested fix, we'd have hit the
> cleanup label in demultiplex_sideband(), which flushes and frees the
> buffer.

Right.

It took me quite a while to convince myself of that, too.

And since I am really worried that the complexity of the code makes it
easy to introduce a regression, I spent quite a bit of time to figure out
how to implement a good regression test for exactly this issue.

Even so, the regression test will not necessarily catch problems where the
`while` loop is abandoned without processing any unfinished sideband
message. I introduced a `BUG()` for that case, but it is quite a bit
unsatisfying that I could not come up with a better way to ensure that
this does not happen.

> I do have to say that the original loop before that commit was a lot
> easier to follow, though.
>
> Another weirdness I noticed is that it doesn't distinguish a flush
> packet (expected) from a zero-byte packet (an error). But neither did
> the original. I would have guessed the zero-byte packet was meant to
> trigger this second conditional:
>
>         if (len == 0) {
>                 *sideband_type = SIDEBAND_FLUSH;
>                 goto cleanup;
>         }
>         if (len < 1) {
>                 strbuf_addf(scratch,
>                             "%s%s: protocol error: no band designator",
>                             scratch->len ? "\n" : "", me);
>                 *sideband_type = SIDEBAND_PROTOCOL_ERROR;
>                 goto cleanup;
>         }
>
> but we'd hit the first conditional before then. We would still trigger
> the second if we saw EOF while reading the packet (because we set the
> length to -1 then), but then it's arguably the wrong error to be
> showing.
>
> So I think this could be improved a bit by using
> packet_read_with_status() in the recv_sideband() caller.

Possibly. But is it really a bug to send a zero-byte packet? It is
inefficient, sure. But I could see how it could potentially simplify code,
or serve e.g. as some sort of a "keep-alive" mechanism or whatever.

In other words, I am not sure that  we should treat this as an error, but
yes, we should probably not treat it as a flush (even if it is likely that
our current sideband users simply won't ever send empty primary packets).

Ciao,
Dscho
diff mbox series

Patch

diff --git a/t/t5500-fetch-pack.sh b/t/t5500-fetch-pack.sh
index 3557374312..5e556313af 100755
--- a/t/t5500-fetch-pack.sh
+++ b/t/t5500-fetch-pack.sh
@@ -400,7 +400,20 @@  test_expect_success 'in_vain not triggered before first ACK' '
 	test_commit -C myserver bar &&
 
 	git -C myclient fetch --progress origin 2>log &&
-	test_i18ngrep "remote: Total 3 " log
+	if ! test_i18ngrep "remote: Total 3 " log
+	then
+		# It is possible that the "Total 3" line is delivered in
+		# multiple sideband packets, and that a primary packet is
+		# delivered in between. When that happens, the line will be
+		# presented on multiple "remote:" lines.
+		sed "/^remote: T/{
+			:a
+			N
+			s/\nremote: //
+			ba
+		}" log >log.unsplit &&
+		test_i18ngrep "remote: Total 3 " log.unsplit
+	fi
 '
 
 test_expect_success 'in_vain resetted upon ACK' '