diff mbox series

[2/3] Revert "send-email: extract email-parsing code into a subroutine"

Message ID 20231020101310.GB2673716@coredump.intra.peff.net (mailing list archive)
State Accepted
Commit 637e8944a13af5eae2dcaef99d4d84645f2b60ac
Headers show
Series some send-email --compose fixes | expand

Commit Message

Jeff King Oct. 20, 2023, 10:13 a.m. UTC
This reverts commit b6049542b97e7b135e0e82bf996084d461224d32.

Prior to that commit, we read the results of the user editing the
"--compose" message in a loop, picking out parts we cared about, and
streaming the result out to a ".final" file. That commit split the
reading/interpreting into two phases; we'd now read into a hash, and
then pick things out of the hash.

The goal was making the code more readable. And in some ways it did,
because the ugly regexes are confined to the reading phase. But it also
introduced several bugs, because now the two phases need to match each
other. In particular:

  - we pick out headers like "Subject: foo" with a case-insensitive
    regex, and then use the user-provided header name as the key in a
    case-sensitive hash. So if the user wrote "subject: foo", we'd no
    longer recognize it as a subject.

  - the namespace for the hash keys conflates header names with meta
    information like "body". If you put "body: foo" in your message, it
    would be misinterpreted as the actual message body (nobody is likely
    to do that in practice, but it seems like an unnecessary danger).

  - the handling for to/cc/bcc is totally broken. The behavior before
    that commit is to recognize and skip those headers, with a note to
    the user that they are not yet handled. Not great, but OK. But
    after the patch, the reading side now splits the addresses into a
    perl array-ref. But the interpreting side doesn't handle this at
    all, and blindly prints the stringified array-ref value. This leads
    to garbage like:

      (mbox) Adding to: ARRAY (0x555b4345c428) from line 'To: ARRAY(0x555b4345c428)'
      error: unable to extract a valid address from: ARRAY (0x555b4345c428)
      What to do with this address? ([q]uit|[d]rop|[e]dit):

    Probably not a huge deal, since nobody should even try to use those
    headers in the first place (since they were not implemented). But
    the new behavior is worse, and indicative of the sorts of problems
    that come from having the two layers.

The revert had a few conflicts, due to later work in this area from
15dc3b9161 (send-email: rename variable for clarity, 2018-03-04) and
d11c943c78 (send-email: support separate Reply-To address, 2018-03-04).
I've ported the changes from those commits over as part of the conflict
resolution.

The new tests show the bugs. Note the use of GIT_SEND_EMAIL_NOTTY in the
second one. Without it, the test is happy to reach outside the test
harness to the developer's actual terminal (when run with the buggy
state before this patch).

Signed-off-by: Jeff King <peff@peff.net>
---
I guess "readable" is up for debate here, but I find the inline handling
a lot easier to follow (and it's half as many lines; most of the
diffstat is the new tests).

But one thing that gives me pause is that the neither before or after
this patch do we handle continuation lines like:

  Subject: this is the beginning
    and this is more subject

And it would probably be a lot easier to add when storing the headers in
a hash (it's not impossible to do it the other way, but you basically
have to delay processing each line with a small state machine).

So another option is to just fix the individual bugs separately.

 git-send-email.perl   | 120 ++++++++++++++----------------------------
 t/t9001-send-email.sh |  35 ++++++++++++
 2 files changed, 75 insertions(+), 80 deletions(-)

Comments

Oswald Buddenhagen Oct. 20, 2023, 10:45 a.m. UTC | #1
On Fri, Oct 20, 2023 at 06:13:10AM -0400, Jeff King wrote:
>But one thing that gives me pause is that the neither before or after
>this patch do we handle continuation lines like:
>
>  Subject: this is the beginning
>    and this is more subject
>
>And it would probably be a lot easier to add when storing the headers in
>a hash (it's not impossible to do it the other way, but you basically
>have to delay processing each line with a small state machine).
>
that seems like a rather significant point, doesn't it?

>So another option is to just fix the individual bugs separately.
>
... so that seems preferable to me, given that the necessary fixes seem 
rather trivial.

> I guess "readable" is up for debate here, but I find the inline handling
> a lot easier to follow
>
any particular reason for that?

> (and it's half as many lines; most of the diffstat is the new tests).

>-	if ($parsed_email{'From'}) {
>-		$sender = delete($parsed_email{'From'});
>-	}

this verbosity could be cut down somewhat using just

   $sender = delete($parsed_email{'From'});

and if the value can be pre-set and needs to be preserved,

   $sender = delete($parsed_email{'From'}) // $sender;

but this seems kind of counter-productive legibility-wise.

regards
Junio C Hamano Oct. 20, 2023, 9:45 p.m. UTC | #2
Jeff King <peff@peff.net> writes:

>   - the handling for to/cc/bcc is totally broken.

It is good to see another evidence that "--compose" is probably not
as often as used as we thought.  With enough bugs discovered,
perhaps someday we can declare "it cannot be that the feature is
used in the wild, without anybody getting hit by these bugs---let's
deprecate and eventually remove it" ;-)
Jeff King Oct. 23, 2023, 6:40 p.m. UTC | #3
On Fri, Oct 20, 2023 at 12:45:43PM +0200, Oswald Buddenhagen wrote:

> On Fri, Oct 20, 2023 at 06:13:10AM -0400, Jeff King wrote:
> > But one thing that gives me pause is that the neither before or after
> > this patch do we handle continuation lines like:
> > 
> >  Subject: this is the beginning
> >    and this is more subject
> > 
> > And it would probably be a lot easier to add when storing the headers in
> > a hash (it's not impossible to do it the other way, but you basically
> > have to delay processing each line with a small state machine).
> > 
> that seems like a rather significant point, doesn't it?

Maybe. It depends on whether anybody is interested in adding
continuation support. Nobody has in the previous 18 years, and nobody
has asked for it.

> > So another option is to just fix the individual bugs separately.
> > 
> ... so that seems preferable to me, given that the necessary fixes seem
> rather trivial.

They're not too bad. Probably:

  1. lc() the keys we put into the hash

  2. match to/cc/bcc and dereference their arrays

  3. maybe handle 'body' separately from headers to avoid confusion

But there may be other similar bugs lurking. One I didn't mention: the
hash-based version randomly reorders headers!

> > I guess "readable" is up for debate here, but I find the inline handling
> > a lot easier to follow
> > 
> any particular reason for that?

For the reasons I gave in the commit message: namely that the matching
and logic is in one place and doesn't need to be duplicated (e.g., the
special handling of to/cc/bcc, which caused a bug here).

> > (and it's half as many lines; most of the diffstat is the new tests).
> 
> > -	if ($parsed_email{'From'}) {
> > -		$sender = delete($parsed_email{'From'});
> > -	}
> 
> this verbosity could be cut down somewhat using just
> 
>   $sender = delete($parsed_email{'From'});
> 
> and if the value can be pre-set and needs to be preserved,
> 
>   $sender = delete($parsed_email{'From'}) // $sender;
> 
> but this seems kind of counter-productive legibility-wise.

We do need to avoid overwriting the pre-set value. The "//" one would
work, but we support perl versions old enough that they don't have it.

-Peff
Jeff King Oct. 23, 2023, 6:47 p.m. UTC | #4
On Fri, Oct 20, 2023 at 02:45:30PM -0700, Junio C Hamano wrote:

> Jeff King <peff@peff.net> writes:
> 
> >   - the handling for to/cc/bcc is totally broken.
> 
> It is good to see another evidence that "--compose" is probably not
> as often as used as we thought.  With enough bugs discovered,
> perhaps someday we can declare "it cannot be that the feature is
> used in the wild, without anybody getting hit by these bugs---let's
> deprecate and eventually remove it" ;-)

I'm not sure if that is evidence or not. The to/cc/bcc feature was just
never implemented. The commit from 2017 made it more broken than saying
"not yet implemented", but that may only be an indication that nobody
wants it or tried to use it.

I dunno. As I noted, the same feature exists when reading the
cover-letter from a set of format-patch files. And of course it is
implemented using totally separate code (in pre_process_file). One
possible cleanup would be to unify those two, but I'm sure there would
be behavior changes. Some of them perhaps good (e.g., it looks like
pre_process_file is more careful about rfc2047 handling) and some of
them I'm not so sure of (e.g., support for --header-cmd in the --compose
letter).

I think an interested person could champion such changes, but I am not
that interested in send-email (I don't use it, and some of its code is
pretty ancient and gross). My goal was to fix the bug I saw with minimal
regression (I waffled even on my patch 2).

-Peff
Oswald Buddenhagen Oct. 23, 2023, 7:50 p.m. UTC | #5
On Mon, Oct 23, 2023 at 02:40:10PM -0400, Jeff King wrote:
>On Fri, Oct 20, 2023 at 12:45:43PM +0200, Oswald Buddenhagen wrote:
>> that seems like a rather significant point, doesn't it?
>
>Maybe. It depends on whether anybody is interested in adding
>continuation support. Nobody has in the previous 18 years, and nobody
>has asked for it.
>
dunno, it seems like a bug to me. so if i cared at all about this 
functionality, i'd fix it just because. so at least it doesn't seem nice 
to make it harder for a potential volunteer.

>> > So another option is to just fix the individual bugs separately.
>> > 
>> ... so that seems preferable to me, given that the necessary fixes 
>> seem
>> rather trivial.
>
>They're not too bad. Probably:
>
>  1. lc() the keys we put into the hash
>
>  2. match to/cc/bcc and dereference their arrays
>
>  3. maybe handle 'body' separately from headers to avoid confusion
>
with the header keys lowercased, one could simply use BODY as the key 
and be done with it.

>But there may be other similar bugs lurking.

>One I didn't mention: the
>hash-based version randomly reorders headers!
>
hmm, yeah, that would mean using Tie::IxHash if one wanted to do it 
elegantly, at the cost of depending on another non-core module.

also, it means that another hash with non-lowercased versions of the 
keys would have to be kept.

ok, that's stupid. it would be easier to just keep an additional array 
of the original keys for iteration, and check the hash before emitting 
them.

>> > I guess "readable" is up for debate here, but I find the inline handling
>> > a lot easier to follow
>> > 
>> any particular reason for that?
>
>For the reasons I gave in the commit message: namely that the matching
>and logic is in one place and doesn't need to be duplicated (e.g., the
>special handling of to/cc/bcc, which caused a bug here).
>
from what i can see, there isn't really anything to "match", apart from 
agreeing on the data structure (which the code partially failed to do, 
but that's trivial enough). and layering/abstracting things is usually 
considered a good thing, unless the cost/benefit ratio is completely 
backwards.

>The "//" one would
>work, but we support perl versions old enough that they don't have it.
>
according to my grepping, that ship has sailed.
also, why _would_ you support such ancient perl versions? that makes 
even less sense to me than supporting ancient c compilers.

regards
Jeff King Oct. 25, 2023, 6:11 a.m. UTC | #6
On Mon, Oct 23, 2023 at 09:50:54PM +0200, Oswald Buddenhagen wrote:

> > The "//" one would
> > work, but we support perl versions old enough that they don't have it.
> > 
> according to my grepping, that ship has sailed.
> also, why _would_ you support such ancient perl versions? that makes even
> less sense to me than supporting ancient c compilers.

It may be reasonable to bump the default perl version for the script.
But that would require somebody digging into what tends to ship these
days (which can be sometimes be surprising; witness macos using old
versions of bash due to license issues), and then updating the "use
5.008" in the script.

The "//" operator was added in perl 5.10. I'm not sure what you found
that makes you think the ship has sailed. The only hits for "//" I see
look like the end of substitution regexes ("s/foo//" and similar). But
if we are not consistent with the "use" claim, that is worth fixing.

-Peff
Oswald Buddenhagen Oct. 25, 2023, 9:23 a.m. UTC | #7
On Wed, Oct 25, 2023 at 02:11:20AM -0400, Jeff King wrote:
>The "//" operator was added in perl 5.10. I'm not sure what you found
>that makes you think the ship has sailed. The only hits for "//" I see
>look like the end of substitution regexes ("s/foo//" and similar).
>
grep with spaces around the operator, then you can see the instance in 
git-credential-netrc.perl easily.

regards
Junio C Hamano Oct. 27, 2023, 10:31 p.m. UTC | #8
Oswald Buddenhagen <oswald.buddenhagen@gmx.de> writes:

> On Wed, Oct 25, 2023 at 02:11:20AM -0400, Jeff King wrote:
>>The "//" operator was added in perl 5.10. I'm not sure what you found
>>that makes you think the ship has sailed. The only hits for "//" I see
>>look like the end of substitution regexes ("s/foo//" and similar).
>>
> grep with spaces around the operator, then you can see the instance in
> git-credential-netrc.perl easily.

Good find, but given the relative prevalence in use between netrc
helper and send-email, my conclusion is rather opposite.  It seems
to indicate that avoiding "//" would still be prudent if the only
tool we can find find broken on 5.008 is the netrc helper.
Jeff King Oct. 30, 2023, 9:13 a.m. UTC | #9
On Wed, Oct 25, 2023 at 11:23:01AM +0200, Oswald Buddenhagen wrote:

> On Wed, Oct 25, 2023 at 02:11:20AM -0400, Jeff King wrote:
> > The "//" operator was added in perl 5.10. I'm not sure what you found
> > that makes you think the ship has sailed. The only hits for "//" I see
> > look like the end of substitution regexes ("s/foo//" and similar).
> > 
> grep with spaces around the operator, then you can see the instance in
> git-credential-netrc.perl easily.

Ah, yeah, there is one instance there. That script does not have a "use"
marker, though, and we do not necessarily need or want to be as strict
with contrib/ scripts, which are quite optional compared to core
functionality like send-email.

That said, I do suspect that requiring 5.10 or later would not be too
burdensome these days. If we want to do so, then the first step would be
updating the text in INSTALL, along with the "use" directives in most
files.  Probably d48b284183 (perl: bump the required Perl version to 5.8
from 5.6.[21], 2010-09-24) could serve as a template.

-Peff
diff mbox series

Patch

diff --git a/git-send-email.perl b/git-send-email.perl
index 288ea1ae80..bbda2a931b 100755
--- a/git-send-email.perl
+++ b/git-send-email.perl
@@ -888,73 +888,59 @@  sub get_patch_subject {
 		do_edit($compose_filename);
 	}
 
+	open my $c2, ">", $compose_filename . ".final"
+		or die sprintf(__("Failed to open %s.final: %s"), $compose_filename, $!);
+
 	open $c, "<", $compose_filename
 		or die sprintf(__("Failed to open %s: %s"), $compose_filename, $!);
 
+	my $need_8bit_cte = file_has_nonascii($compose_filename);
+	my $in_body = 0;
+	my $summary_empty = 1;
 	if (!defined $compose_encoding) {
 		$compose_encoding = "UTF-8";
 	}
-
-	my %parsed_email;
-	while (my $line = <$c>) {
-		next if $line =~ m/^GIT:/;
-		parse_header_line($line, \%parsed_email);
-		if ($line =~ /^$/) {
-			$parsed_email{'body'} = filter_body($c);
+	while(<$c>) {
+		next if m/^GIT:/;
+		if ($in_body) {
+			$summary_empty = 0 unless (/^\n$/);
+		} elsif (/^\n$/) {
+			$in_body = 1;
+			if ($need_8bit_cte) {
+				print $c2 "MIME-Version: 1.0\n",
+					 "Content-Type: text/plain; ",
+					   "charset=$compose_encoding\n",
+					 "Content-Transfer-Encoding: 8bit\n";
+			}
+		} elsif (/^MIME-Version:/i) {
+			$need_8bit_cte = 0;
+		} elsif (/^Subject:\s*(.+)\s*$/i) {
+			$initial_subject = $1;
+			my $subject = $initial_subject;
+			$_ = "Subject: " .
+				quote_subject($subject, $compose_encoding) .
+				"\n";
+		} elsif (/^In-Reply-To:\s*(.+)\s*$/i) {
+			$initial_in_reply_to = $1;
+			next;
+		} elsif (/^Reply-To:\s*(.+)\s*$/i) {
+			$reply_to = $1;
+		} elsif (/^From:\s*(.+)\s*$/i) {
+			$sender = $1;
+			next;
+		} elsif (/^(?:To|Cc|Bcc):/i) {
+			print __("To/Cc/Bcc fields are not interpreted yet, they have been ignored\n");
+			next;
 		}
+		print $c2 $_;
 	}
 	close $c;
+	close $c2;
 
-	open my $c2, ">", $compose_filename . ".final"
-	or die sprintf(__("Failed to open %s.final: %s"), $compose_filename, $!);
-
-
-	if ($parsed_email{'From'}) {
-		$sender = delete($parsed_email{'From'});
-	}
-	if ($parsed_email{'In-Reply-To'}) {
-		$initial_in_reply_to = delete($parsed_email{'In-Reply-To'});
-	}
-	if ($parsed_email{'Reply-To'}) {
-		$reply_to = delete($parsed_email{'Reply-To'});
-	}
-	if ($parsed_email{'Subject'}) {
-		$initial_subject = delete($parsed_email{'Subject'});
-		print $c2 "Subject: " .
-			quote_subject($initial_subject, $compose_encoding) .
-			"\n";
-	}
-
-	if ($parsed_email{'MIME-Version'}) {
-		print $c2 "MIME-Version: $parsed_email{'MIME-Version'}\n",
-				"Content-Type: $parsed_email{'Content-Type'};\n",
-				"Content-Transfer-Encoding: $parsed_email{'Content-Transfer-Encoding'}\n";
-		delete($parsed_email{'MIME-Version'});
-		delete($parsed_email{'Content-Type'});
-		delete($parsed_email{'Content-Transfer-Encoding'});
-	} elsif (file_has_nonascii($compose_filename)) {
-		my $content_type = (delete($parsed_email{'Content-Type'}) or
-			"text/plain; charset=$compose_encoding");
-		print $c2 "MIME-Version: 1.0\n",
-			"Content-Type: $content_type\n",
-			"Content-Transfer-Encoding: 8bit\n";
-	}
-	# Preserve unknown headers
-	foreach my $key (keys %parsed_email) {
-		next if $key eq 'body';
-		print $c2 "$key: $parsed_email{$key}";
-	}
-
-	if ($parsed_email{'body'}) {
-		print $c2 "\n$parsed_email{'body'}\n";
-		delete($parsed_email{'body'});
-	} else {
+	if ($summary_empty) {
 		print __("Summary email is empty, skipping it\n");
 		$compose = -1;
 	}
-
-	close $c2;
-
 } elsif ($annotate) {
 	do_edit(@files);
 }
@@ -1009,32 +995,6 @@  sub ask {
 	return;
 }
 
-sub parse_header_line {
-	my $lines = shift;
-	my $parsed_line = shift;
-	my $addr_pat = join "|", qw(To Cc Bcc);
-
-	foreach (split(/\n/, $lines)) {
-		if (/^($addr_pat):\s*(.+)$/i) {
-		        $parsed_line->{$1} = [ parse_address_line($2) ];
-		} elsif (/^([^:]*):\s*(.+)\s*$/i) {
-		        $parsed_line->{$1} = $2;
-		}
-	}
-}
-
-sub filter_body {
-	my $c = shift;
-	my $body = "";
-	while (my $body_line = <$c>) {
-		if ($body_line !~ m/^GIT:/) {
-			$body .= $body_line;
-		}
-	}
-	return $body;
-}
-
-
 my %broken_encoding;
 
 sub file_declares_8bit_cte {
diff --git a/t/t9001-send-email.sh b/t/t9001-send-email.sh
index 263db3ad17..9644ff5793 100755
--- a/t/t9001-send-email.sh
+++ b/t/t9001-send-email.sh
@@ -2505,4 +2505,39 @@  test_expect_success $PREREQ 'test forbidSendmailVariables behavior override' '
 		HEAD^
 '
 
+test_expect_success $PREREQ '--compose handles lowercase headers' '
+	write_script fake-editor <<-\EOF &&
+	sed "s/^From:.*/from: edited-from@example.com/i" "$1" >"$1.tmp" &&
+	mv "$1.tmp" "$1"
+	EOF
+	clean_fake_sendmail &&
+	git send-email \
+		--compose \
+		--from="Example <from@example.com>" \
+		--to=nobody@example.com \
+		--smtp-server="$(pwd)/fake.sendmail" \
+		HEAD^ &&
+	grep "From: edited-from@example.com" msgtxt1
+'
+
+test_expect_success $PREREQ '--compose handles to headers' '
+	write_script fake-editor <<-\EOF &&
+	sed "s/^$/To: edited-to@example.com\n/" <"$1" >"$1.tmp" &&
+	echo this is the body >>"$1.tmp" &&
+	mv "$1.tmp" "$1"
+	EOF
+	clean_fake_sendmail &&
+	GIT_SEND_EMAIL_NOTTY=1 \
+	git send-email \
+		--compose \
+		--from="Example <from@example.com>" \
+		--to=nobody@example.com \
+		--smtp-server="$(pwd)/fake.sendmail" \
+		HEAD^ &&
+	# Ideally the "to" header we specified would be used,
+	# but the program explicitly warns that these are
+	# ignored. For now, just make sure we did not abort.
+	grep "To:" msgtxt1
+'
+
 test_done