Message ID | 20240621092721.2980939-2-csokas.bence@prolan.hu (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | [resend] git-send-email: Use sanitized address when reading mbox body | expand |
"Csókás, Bence" <csokas.bence@prolan.hu> writes: > Commas and other punctuation marks in 'Cc: ', 'Signed-off-by: ' > etc. lines mess with git-send-email. In parsing the mbox headers, > this is handled by calling `sanitize_address()`. This function > is called when parsing the message body as well, but was only > used for comparing it to $author. Now we add it to @cc too. > > Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu> > --- > git-send-email.perl | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/git-send-email.perl b/git-send-email.perl > index f0be4b4560..72044e5ef3 100755 > --- a/git-send-email.perl > +++ b/git-send-email.perl > @@ -1847,9 +1847,9 @@ sub pre_process_file { > $what, $_) unless $quiet; > next; > } > - push @cc, $c; > + push @cc, $sc; > printf(__("(body) Adding cc: %s from line '%s'\n"), > - $c, $_) unless $quiet; > + $sc, $_) unless $quiet; > } > } Hmph, there is this piece of code before the block: if ($c !~ /.+@.+|<.+>/) { printf("(body) Ignoring %s from line '%s'\n", $what, $_) unless $quiet; next; } This is to reject strings that do not even look like an e-mail address, but we called sanitize_address() call on $c way before this check. I wonder if we should move this block way up, even before the call to santize_address()? That was a relatively unrelated tangent. In the same function, there is this snippet about Cc: (I didn't check if the same issue is shared with other header fields): elsif (/^Cc:\s+(.*)$/i) { foreach my $addr (parse_address_line($1)) { my $qaddr = unquote_rfc2047($addr); my $saddr = sanitize_address($qaddr); if ($saddr eq $sender) { next if ($suppress_cc{'self'}); } else { next if ($suppress_cc{'cc'}); } printf(__("(mbox) Adding cc: %s from line '%s'\n"), $addr, $_) unless $quiet; push @cc, $addr; } } We seem to use sanitized address only for the purpose of suppressing Cc, and use the original address given in the input for e-mail purposes (@cc is the same variable as the patch under discussion sends the "fixed" address to, which holds the data to formulate the Cc: header of the outgoing message, I presume?). So in that way, we seem to be very consistent. Possibly we are being consistent in a broken way, but I am not yet convinced that we are. It looks to me that there are many other places that we try to be as faithful to the original as possible. In the same block as the one that handles "Cc:" I quoted above, an address on "From:" is also sent intact into @cc and addresses on "To:" are handled the same way. The patch under discussion singles out the addresses on the trailers in the message body and treat them differently from others, which I am not sure is what we want to do.
Hi! On 6/21/24 19:27, Junio C Hamano wrote: > Hmph, there is this piece of code before the block: > > if ($c !~ /.+@.+|<.+>/) { > printf("(body) Ignoring %s from line '%s'\n", > $what, $_) unless $quiet; > next; > } > > This is to reject strings that do not even look like an e-mail > address, but we called sanitize_address() call on $c way before this > check. I wonder if we should move this block way up, even before > the call to santize_address()? > > That was a relatively unrelated tangent. Hm, maybe. Should I add this to the patch as well? > In the same function, there is this snippet about Cc: (I didn't > check if the same issue is shared with other header fields): Yes, I've seen that. However, the mbox headers are more likely to conform to RFC 2047, or at least `git format-patch` generates correct mbox headers. (Before sending the original patch, I was playing around with purposefully broken mbox files, and at least Thunderbird seems to correctly parse these non-conforming mbox headers in _most_ cases, but that's probably just extra caution on Mozilla's side). > It looks to me that there are many other places that we try to be as > faithful to the original as possible. In the same block as the one > that handles "Cc:" I quoted above, an address on "From:" is also > sent intact into @cc and addresses on "To:" are handled the same > way. > > The patch under discussion singles out the addresses on the trailers > in the message body and treat them differently from others, which I > am not sure is what we want to do. I think it is a reasonable assumption that the mbox headers will be conforming, whereas the message body is just freeform text and no such guarantees exist. But if we want to be paranoid about it, we could try and sanitize everything. Bence
Csókás Bence <csokas.bence@prolan.hu> writes: > On 6/21/24 19:27, Junio C Hamano wrote: >> Hmph, there is this piece of code before the block: >> if ($c !~ /.+@.+|<.+>/) { >> printf("(body) Ignoring %s from line '%s'\n", >> $what, $_) unless $quiet; >> next; >> } >> This is to reject strings that do not even look like an e-mail >> address, but we called sanitize_address() call on $c way before this >> check. I wonder if we should move this block way up, even before >> the call to santize_address()? >> That was a relatively unrelated tangent. > > Hm, maybe. Should I add this to the patch as well? No, it is an unrelated tangent. We try not to mix unrelated things into a single patch. Perhaps as a separate "clean-up" patch after the dust from the main topic settles, or as a preliminary "clean-up" before the main topic, would be good (but see below). >> It looks to me that there are many other places that we try to be as >> faithful to the original as possible. In the same block as the one >> that handles "Cc:" I quoted above, an address on "From:" is also >> sent intact into @cc and addresses on "To:" are handled the same >> way. >> The patch under discussion singles out the addresses on the trailers >> in the message body and treat them differently from others, which I >> am not sure is what we want to do. > > I think it is a reasonable assumption that the mbox headers will be > conforming, whereas the message body is just freeform text and no such > guarantees exist. But if we want to be paranoid about it, we could try > and sanitize everything. I actually do not think we want to be more paranoid. I'd rather trust what the user gave us, which was where my response came from. Having said that, given that the "Ignoring ..." check we saw earlier in this message triggers only for trailers, it may be a sensible position to take that mail headers are more trustworthy and a random address-looking strings on the trailer lines should be checked more carefully. Because the "Ignoring ..." check is primarily to reject strings that are not addresses (things like bug IDs, or just people's names without e-mail addresses to credit for a new feature request) that we do not ever intend to actually Cc: the message to, another sensible choice coming from that "strings on trailers may not even be addresses" position may be not to add the $c to the Cc: list, if $sc (the sanitized address) and $c (the original address) are different. That is, "the simple regexp check currently used to trigger 'Ignoring ...' message thought that the string looked like an address, but when we ask sanitize_address, it turns out that it was not, really." And if we were to take that route, "Ignoring ..." check, which I called an unrelated tangent, would become very much related to the topic at hand, as it would mean the resulting logic would look more like this: my $sc = sanitize_address($c); if ($c !~ /.+@.+|<.+>/ || $sc ne $c) { printf("(body) Ignoring %s from line '%s'\n", $what, $_) unless $quiet; next; } In any case, if we were to move forward with this topic (whether "send to corrected $sc instead" or "$c is suspicious, do not add it to $cc" is picked as the direction), the motivation behind the design decision to treat the address taken from a trailer line differently needs to be explained better, I think. I had a chance to ask you directly on list, and you gave a good explanation to me in the message I am responding to, but you will not be around when future readers of "git log" want to ask the same question "why are we singling this out while we use the original address in all the other places?". Your proposed commit log message is the place to help them. Oh, before I forget, is this something we can test easily in t9001? We would want to protect a new behaviour from accidental breakage, so adding a test or two would be a good thing. Thanks.
Hi! On 6/24/24 17:50, Junio C Hamano wrote: > another sensible choice coming from that "strings on trailers may > not even be addresses" position may be not to add the $c to the Cc: > list, if $sc (the sanitized address) and $c (the original address) > are different. That is, "the simple regexp check currently used to > trigger 'Ignoring ...' message thought that the string looked like > an address, but when we ask sanitize_address, it turns out that it > was not, really." Maybe. Though we would need to run through the sanitized address through `unquote_rfc2047()` first. But I don't think it's necessary; if someone feeds us an "unsanitary" address (for instance, there is whitespace between the angle brackets), we should try to make sense of it, and worst case, the SMTP server rejects it, as it does now. > In any case, if we were to move forward with this topic (whether > "send to corrected $sc instead" or "$c is suspicious, do not add it > to $cc" is picked as the direction), the motivation behind the > design decision to treat the address taken from a trailer line > differently needs to be explained better, I think. [...] > Your proposed commit log message is the place to help them. Okay. So in short, I should add justification for trusting mbox headers and not the message body, correct? > Oh, before I forget, is this something we can test easily in t9001? > We would want to protect a new behaviour from accidental breakage, > so adding a test or two would be a good thing. Maybe, I'll look into it. Bence
Csókás Bence <csokas.bence@prolan.hu> writes: >> In any case, if we were to move forward with this topic (whether >> "send to corrected $sc instead" or "$c is suspicious, do not add it >> to $cc" is picked as the direction), the motivation behind the >> design decision to treat the address taken from a trailer line >> differently needs to be explained better, I think. [...] >> Your proposed commit log message is the place to help them. > > Okay. So in short, I should add justification for trusting mbox > headers and not the message body, correct? We want these to be explained for future readers: (1) we stuff the sanitized address to @cc in this particular place, but we still let all other places copy the original taken from the message to @to or @cc (as in the original code). (2) the reason behind the above difference is because we trust less about the "address looking" strings that appear on the trailer lines. So, not just (2), but in order for the readers to understand why they should care about (2), they need to be told (1) as well. Thanks.
diff --git a/git-send-email.perl b/git-send-email.perl index f0be4b4560..72044e5ef3 100755 --- a/git-send-email.perl +++ b/git-send-email.perl @@ -1847,9 +1847,9 @@ sub pre_process_file { $what, $_) unless $quiet; next; } - push @cc, $c; + push @cc, $sc; printf(__("(body) Adding cc: %s from line '%s'\n"), - $c, $_) unless $quiet; + $sc, $_) unless $quiet; } } close $fh;
Commas and other punctuation marks in 'Cc: ', 'Signed-off-by: ' etc. lines mess with git-send-email. In parsing the mbox headers, this is handled by calling `sanitize_address()`. This function is called when parsing the message body as well, but was only used for comparing it to $author. Now we add it to @cc too. Signed-off-by: Csókás, Bence <csokas.bence@prolan.hu> --- git-send-email.perl | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)