[4/1] t3920: replace two cats with a tee

Message ID	203cb627-2423-8a35-d280-9f9ffc66e072@web.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <git-owner@kernel.org> Message-ID: <203cb627-2423-8a35-d280-9f9ffc66e072@web.de> Date: Fri, 2 Dec 2022 17:51:22 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: [PATCH 4/1] t3920: replace two cats with a tee To: Johannes Sixt <j6t@kdbg.org>, Philippe Blain <levraiphilippeblain@gmail.com> Cc: Git Mailing List <git@vger.kernel.org>, junio C Hamano <gitster@pobox.com> References: <febcfb0a-c410-fb71-cff9-92acfcb269e2@kdbg.org> From: =?utf-8?q?Ren=C3=A9_Scharfe?= <l.s.r@web.de> In-Reply-To: <febcfb0a-c410-fb71-cff9-92acfcb269e2@kdbg.org> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable UI-OutboundReport: notjunk:1;M01:P0:SYKExHB6FWM=;xNDkyZCqfUGqdvlhsMrgd8BY92z GjNOfHXxcrvn7w9f/I/pkpZJ8aFlPAjdOkslfqheqC8KqxQOUizEWgn79vXTApHwH5iIzX7pe tVU6RhbMgWmPN+1GduNna7cbObwmCKPTNLKjL8SSfkIICADOk8Tn7OWQM18M+UpDoAKn26i1M kKOwoHbj/ON+UV0WrCIRkXFilz7ct7KDi+yXVB6OFwSIKFdi/GQPL9KIgfUo+H/pZharWiTEm xRgqQQlbaKids58FN/b5slUhLDpkxjMnjDAcoX0uWorMZEa8bR5W3cKItPpCRCzHNFIyL4izt qxn6mqONlw1yEd6nkHNRKCY2QkVJ8tueoO0YcdkjFk7dhaCoqWQe6MlH2C3uHEsJ/sUbgZbcw TcA3Nfdsb7eA1L3MtGbRPgEwMauQEBa/FMXhxTJaxeeupHOZTZ9kmJkdhflG9JppwhiGluYB0 OSc5WlURRhvHFtjBOfsUKFMglTnjC07FEldkvq1x0h/6Cm91mPCaa2fAnekuZ0a8D7tAXPv3Y Iu5UIS7muDOyWqUULUHGaiKFFBMPK6FvzwjdvuYqbZQBYr87PnB7MOSDczoloZoN//KbuhE2W wWxelcAmHo6PNQLOad5QBlWBbTmk9YXrJtMyRjkbflo8BEp6QNB21vNLNdCPTkbqH9wHhxHHB XtTh9wDLGDQwXW5iHQ/6MnzJ+F6SnnNq/ky1duZ0pz49RemyKebijkYEoxCq7Yvrry8Zh8cnS 2LC/BCBA+6b72FxoFLYTa6CMFpng9k7gWvF4vudsbXlJ/LYoXmcBfzZhzDSbxJJ7zkxcu0KMc n9nr0nrEOu92bTX4/pkKmFx2YHqhD7CHHBfWVvqDO3jwxc+vw1LeNQSVrdOrzfqJL3VCrnWVy njn7TV+HBumuKNHi48x5BN3ldrRY4C6jIPStOVMqhfFZMgJ6HA88hq/ggcUXnTgB/v69JwaT3 fXm0Pw== Precedence: bulk
Series	t3920: don't ignore errors of more than one command with `\|\| true` \| expand t3920: don't ignore errors of more than one command with `\|\| true` [2/1] t3920: support CR-eating grep [3/1] t3920: simplify redirection of loop output [4/1] t3920: replace two cats with a tee

René Scharfe Dec. 2, 2022, 4:51 p.m. UTC

Use tee(1) to replace two calls of cat(1) for writing files with
different line endings.  That's shorter and spawns less processes.

It has a small, but measurable performance impact on my Windows machine.
Here are the numbers before:

   $ (cd t && hyperfine.exe -w3 "sh.exe t3920-crlf-messages.sh")
   Benchmark 1: sh.exe t3920-crlf-messages.sh
     Time (mean ± σ):      5.705 s ±  0.047 s    [User: 0.000 s, System: 0.001 s]
     Range (min … max):    5.632 s …  5.772 s    10 runs

... and with this patch:

   $ (cd t && hyperfine.exe -w3 "sh.exe t3920-crlf-messages.sh")
   Benchmark 1: sh.exe t3920-crlf-messages.sh
     Time (mean ± σ):      5.616 s ±  0.021 s    [User: 0.001 s, System: 0.002 s]
     Range (min … max):    5.577 s …  5.644 s    10 runs

Signed-off-by: René Scharfe <l.s.r@web.de>
---
 t/t3920-crlf-messages.sh | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--
2.38.1.windows.1

Eric Sunshine Dec. 3, 2022, 5:09 a.m. UTC | #1

On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
> Use tee(1) to replace two calls of cat(1) for writing files with
> different line endings.  That's shorter and spawns less processes.
> [...]
> Signed-off-by: René Scharfe <l.s.r@web.de>
> ---
> diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
> @@ -9,8 +9,7 @@ LIB_CRLF_BRANCHES=""
>  create_crlf_ref () {
> -       cat >.crlf-orig-$branch.txt &&
> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&

This feels slightly magical and more difficult to reason about than
using simple redirection to eliminate the second `cat`. Wouldn't this
work just as well?

    cat >.crlf-orig-$branch.txt &&
    append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&

(Plus, this avoids introducing `tee` into the test suite, more or
less. The few existing instances are all from the same test author and
don't seem particularly legitimate -- they appear to be aids the
author used while developing the test to be able to watch its output
as it ran.)

René Scharfe Dec. 3, 2022, 8:43 a.m. UTC | #2

Am 03.12.22 um 06:09 schrieb Eric Sunshine:
> On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
>> Use tee(1) to replace two calls of cat(1) for writing files with
>> different line endings.  That's shorter and spawns less processes.
>> [...]
>> Signed-off-by: René Scharfe <l.s.r@web.de>
>> ---
>> diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
>> @@ -9,8 +9,7 @@ LIB_CRLF_BRANCHES=""
>>  create_crlf_ref () {
>> -       cat >.crlf-orig-$branch.txt &&
>> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>
> This feels slightly magical and more difficult to reason about than
> using simple redirection to eliminate the second `cat`. Wouldn't this
> work just as well?
>
>     cat >.crlf-orig-$branch.txt &&
>     append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&

It would work, of course, but this is the exact use case for tee(1).  No
repetition, no extra redirection symbols, just an nicely fitting piece
of pipework.  Don't fear the tee! ;-)

(I'm delighted to learn from https://en.wikipedia.org/wiki/Tee_(command)
that PowerShell has a tee command as well.)

> (Plus, this avoids introducing `tee` into the test suite, more or
> less. The few existing instances are all from the same test author and
> don't seem particularly legitimate -- they appear to be aids the
> author used while developing the test to be able to watch its output
> as it ran.)

I agree that the tee calls in t1001 and t5523 are unnecessary.

René

Ævar Arnfjörð Bjarmason Dec. 3, 2022, 12:53 p.m. UTC | #3

On Sat, Dec 03 2022, René Scharfe wrote:

> Am 03.12.22 um 06:09 schrieb Eric Sunshine:
>> On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
>>> Use tee(1) to replace two calls of cat(1) for writing files with
>>> different line endings.  That's shorter and spawns less processes.
>>> [...]
>>> Signed-off-by: René Scharfe <l.s.r@web.de>
>>> ---
>>> diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
>>> @@ -9,8 +9,7 @@ LIB_CRLF_BRANCHES=""
>>>  create_crlf_ref () {
>>> -       cat >.crlf-orig-$branch.txt &&
>>> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>
>> This feels slightly magical and more difficult to reason about than
>> using simple redirection to eliminate the second `cat`. Wouldn't this
>> work just as well?
>>
>>     cat >.crlf-orig-$branch.txt &&
>>     append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&
>
> It would work, of course, but this is the exact use case for tee(1).  No
> repetition, no extra redirection symbols, just an nicely fitting piece
> of pipework.  Don't fear the tee! ;-)
>
> (I'm delighted to learn from https://en.wikipedia.org/wiki/Tee_(command)
> that PowerShell has a tee command as well.)

I don't really care, but I must say I agree with Eric here. Not having
surprising patterns in the test suite has a value of its own.

In this case I wonder if you want to optimize this whether we couldn't
do much better with "test_commit_bulk", maybe by teaching it a small set
of new tricks.

I.e. if I do:

	git fast-export --all

At the end of the setup test it seems we just end up with refs with
names that correspond to their contents, and with double newlines in
them or whatever. This is a lot of "grep", "sed", "tr" etc. just to end
up with that.

So maybe we can create them as a patch, possibly with some slight "sed"
munging on the input stream, just just teach it to accept a "ref prefix"
and "commit message contents". That could just be an argument that you
"$(printf "...")", so we don't even need a sub-process....

Also this:

     perl -wE 'say for 1..1024*100' | tee /tmp/x | perl -nE 'print "in: $_"; exit 1 if $_ == 512'; tail -n 1 /tmp/x

Isn't deterministic. Now, in this case I doubt it matters, but it's nice
to have intermediate files in the test suite be determanistic, i.e. to
always have the full content be in the file at the top after the "top".

With a "tee" you need to worry about the "append_cr" function it's being
piped in stopping the stdin.

I don't think it matters in this case, but in general as a pattern: I do
fear the "tee" a bit :)

René Scharfe Dec. 3, 2022, 5:22 p.m. UTC | #4

Am 03.12.22 um 13:53 schrieb Ævar Arnfjörð Bjarmason:
>
> On Sat, Dec 03 2022, René Scharfe wrote:
>
>> Am 03.12.22 um 06:09 schrieb Eric Sunshine:
>>> On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
>>>> Use tee(1) to replace two calls of cat(1) for writing files with
>>>> different line endings.  That's shorter and spawns less processes.
>>>> [...]
>>>> Signed-off-by: René Scharfe <l.s.r@web.de>
>>>> ---
>>>> diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
>>>> @@ -9,8 +9,7 @@ LIB_CRLF_BRANCHES=""
>>>>  create_crlf_ref () {
>>>> -       cat >.crlf-orig-$branch.txt &&
>>>> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>>> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>>
>>> This feels slightly magical and more difficult to reason about than
>>> using simple redirection to eliminate the second `cat`. Wouldn't this
>>> work just as well?
>>>
>>>     cat >.crlf-orig-$branch.txt &&
>>>     append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&
>>
>> It would work, of course, but this is the exact use case for tee(1).  No
>> repetition, no extra redirection symbols, just an nicely fitting piece
>> of pipework.  Don't fear the tee! ;-)
>>
>> (I'm delighted to learn from https://en.wikipedia.org/wiki/Tee_(command)
>> that PowerShell has a tee command as well.)
>
> I don't really care, but I must say I agree with Eric here. Not having
> surprising patterns in the test suite has a value of its own.

That's a good general guideline, but I wouldn't have expected a pipe
with three holes to startle anyone. *shrug*

> In this case I wonder if you want to optimize this whether we couldn't
> do much better with "test_commit_bulk", maybe by teaching it a small set
> of new tricks.
>
> I.e. if I do:
>
> 	git fast-export --all
>
> At the end of the setup test it seems we just end up with refs with
> names that correspond to their contents, and with double newlines in
> them or whatever. This is a lot of "grep", "sed", "tr" etc. just to end
> up with that.
>
> So maybe we can create them as a patch, possibly with some slight "sed"
> munging on the input stream, just just teach it to accept a "ref prefix"
> and "commit message contents". That could just be an argument that you
> "$(printf "...")", so we don't even need a sub-process....

The files are used later for verification, so their contents can't just
be passed on via parameters.

Had a similar idea and spent too much time on creating the four files in
a single awk invocation.  The code was too verbose and yet hard to read
for my taste.

> Also this:
>
>      perl -wE 'say for 1..1024*100' | tee /tmp/x | perl -nE 'print "in: $_"; exit 1 if $_ == 512'; tail -n 1 /tmp/x
>
> Isn't deterministic. Now, in this case I doubt it matters, but it's nice
> to have intermediate files in the test suite be determanistic, i.e. to
> always have the full content be in the file at the top after the "top".

Whoa, such a one-liner is a good argument for banishing Perl.

So to rephrase it in a way that I can understand, you say that something
like this:

	$ cd /tmp; seq 100000 | tee x | head -1 >/dev/null; wc -l x

... will probably report less than 100000 lines because the downpipe
command ends the whole thing early.

> With a "tee" you need to worry about the "append_cr" function it's being
> piped in stopping the stdin.
>
> I don't think it matters in this case, but in general as a pattern: I do
> fear the "tee" a bit :)

Right, append_cr reads until EOF.

René

Ævar Arnfjörð Bjarmason Dec. 4, 2022, 9:34 a.m. UTC | #5

On Sat, Dec 03 2022, René Scharfe wrote:

> Am 03.12.22 um 13:53 schrieb Ævar Arnfjörð Bjarmason:
>>
>> On Sat, Dec 03 2022, René Scharfe wrote:
>>
>>> Am 03.12.22 um 06:09 schrieb Eric Sunshine:
>>>> On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
>>>>> Use tee(1) to replace two calls of cat(1) for writing files with
>>>>> different line endings.  That's shorter and spawns less processes.
>>>>> [...]
>>>>> Signed-off-by: René Scharfe <l.s.r@web.de>
>>>>> ---
>>>>> diff --git a/t/t3920-crlf-messages.sh b/t/t3920-crlf-messages.sh
>>>>> @@ -9,8 +9,7 @@ LIB_CRLF_BRANCHES=""
>>>>>  create_crlf_ref () {
>>>>> -       cat >.crlf-orig-$branch.txt &&
>>>>> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>>>> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
>>>>
>>>> This feels slightly magical and more difficult to reason about than
>>>> using simple redirection to eliminate the second `cat`. Wouldn't this
>>>> work just as well?
>>>>
>>>>     cat >.crlf-orig-$branch.txt &&
>>>>     append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&
>>>
>>> It would work, of course, but this is the exact use case for tee(1).  No
>>> repetition, no extra redirection symbols, just an nicely fitting piece
>>> of pipework.  Don't fear the tee! ;-)
>>>
>>> (I'm delighted to learn from https://en.wikipedia.org/wiki/Tee_(command)
>>> that PowerShell has a tee command as well.)
>>
>> I don't really care, but I must say I agree with Eric here. Not having
>> surprising patterns in the test suite has a value of its own.
>
> That's a good general guideline, but I wouldn't have expected a pipe
> with three holes to startle anyone. *shrug*

It's more that you're used to seeing one thing, the "cat >in" at the
start of a function is a common pattern.

Then it takes some time to stop and grok an a new pattern. If I was
hacking on a function like that I'd probably stop to try to understand
"why", even though I understood the "what".

I'd then find it was to try to optimize things on Windows a bit... :)

I'm not saying it's not worth it in this case, just pointing out that
boring "standard" patterns have a value of their own in us collectively
understanding them, which has a value of its own. Whether optimizing a
test case outweighs that is another matter (sometimes it would).

>> In this case I wonder if you want to optimize this whether we couldn't
>> do much better with "test_commit_bulk", maybe by teaching it a small set
>> of new tricks.
>>
>> I.e. if I do:
>>
>> 	git fast-export --all
>>
>> At the end of the setup test it seems we just end up with refs with
>> names that correspond to their contents, and with double newlines in
>> them or whatever. This is a lot of "grep", "sed", "tr" etc. just to end
>> up with that.
>>
>> So maybe we can create them as a patch, possibly with some slight "sed"
>> munging on the input stream, just just teach it to accept a "ref prefix"
>> and "commit message contents". That could just be an argument that you
>> "$(printf "...")", so we don't even need a sub-process....
>
> The files are used later for verification, so their contents can't just
> be passed on via parameters.
>
> Had a similar idea and spent too much time on creating the four files in
> a single awk invocation.  The code was too verbose and yet hard to read
> for my taste.

Hah, I didn't try. Just a suggestion in case it made sense :)

>> Also this:
>>
>>      perl -wE 'say for 1..1024*100' | tee /tmp/x | perl -nE 'print "in: $_"; exit 1 if $_ == 512'; tail -n 1 /tmp/x
>>
>> Isn't deterministic. Now, in this case I doubt it matters, but it's nice
>> to have intermediate files in the test suite be determanistic, i.e. to
>> always have the full content be in the file at the top after the "top".
>
> Whoa, such a one-liner is a good argument for banishing Perl.
>
> So to rephrase it in a way that I can understand, you say that something
> like this:
>
> 	$ cd /tmp; seq 100000 | tee x | head -1 >/dev/null; wc -l x
>
> ... will probably report less than 100000 lines because the downpipe
> command ends the whole thing early.

Yes, the "perl" line was just a quick demo hack.

But the point is that the initial perl process on the LHS will be killed
with a SIGPIPE as the "perl" on the RHS stops and a SIGPIPE is
propagated up the chain.

I don't think it matters in this case, but just pointing out that it
*is* an edge case this sort of pattern introduces.

I've sometimes resorted to recursively diffing the trash directories of
two test runs to see if they're the same. E.g. I've caught cases where
the stderr of programs unexpectedly changes, but we had no test coverage
for it.

I think it's good to avoid patterns in general that make test runs
nondeterministic.

In this case it's only nondeterministic on failure, so it's probably
fine.

Eric Sunshine Dec. 4, 2022, 4:39 p.m. UTC | #6

On Sun, Dec 4, 2022 at 4:41 AM Ævar Arnfjörð Bjarmason <avarab@gmail.com> wrote:
> On Sat, Dec 03 2022, René Scharfe wrote:
> > Am 03.12.22 um 13:53 schrieb Ævar Arnfjörð Bjarmason:
> >> On Sat, Dec 03 2022, René Scharfe wrote:
> >>> Am 03.12.22 um 06:09 schrieb Eric Sunshine:
> >>>> On Fri, Dec 2, 2022 at 11:51 AM René Scharfe <l.s.r@web.de> wrote:
> >>>>> -       cat >.crlf-orig-$branch.txt &&
> >>>>> -       cat .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
> >>>>> +       tee .crlf-orig-$branch.txt | append_cr >.crlf-message-$branch.txt &&
> >>>>
> >>>> This feels slightly magical and more difficult to reason about than
> >>>> using simple redirection to eliminate the second `cat`. Wouldn't this
> >>>> work just as well?
> >>>>
> >>>>     cat >.crlf-orig-$branch.txt &&
> >>>>     append_cr <.crlf-orig-$branch.txt >.crlf-message-$branch.txt &&
> >>>
> >>> It would work, of course, but this is the exact use case for tee(1).  No
> >>> repetition, no extra redirection symbols, just an nicely fitting piece
> >>> of pipework.  Don't fear the tee! ;-)
> >>
> >> I don't really care, but I must say I agree with Eric here. Not having
> >> surprising patterns in the test suite has a value of its own.
> >
> > That's a good general guideline, but I wouldn't have expected a pipe
> > with three holes to startle anyone. *shrug*
>
> It's more that you're used to seeing one thing, the "cat >in" at the
> start of a function is a common pattern.
>
> Then it takes some time to stop and grok an a new pattern. If I was
> hacking on a function like that I'd probably stop to try to understand
> "why", even though I understood the "what".
>
> I'm not saying it's not worth it in this case, just pointing out that
> boring "standard" patterns have a value of their own in us collectively
> understanding them, which has a value of its own. Whether optimizing a
> test case outweighs that is another matter (sometimes it would).

Perhaps my experience is atypical, but in decades of using Unix, my
use of `tee` can (probably) be counted on a single finger, so the
patch, as implemented, did have higher cognitive load for me than a
patch using simple redirection would have had. Anyhow, I mentioned the
redirection approach, not to ask for a change, but only in case you
had overlooked the (to me) simpler approach. I didn't expect it to
spark so much discussion (though I do agree with everything Ævar has
said about following established patterns).

That said, I'm still rather unclear on the purpose of this patch. In a
sense, it feels like mere churn for 1/100 of a second gain (assuming
I'm reading the `hyperfine` output correctly).

[4/1] t3920: replace two cats with a tee

Commit Message

Comments

Patch