Message ID | pull.1420.git.1668536405563.gitgitgadget@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | range-diff: support reading mbox files | expand |
On Tue, Nov 15, 2022 at 06:20:05PM +0000, Johannes Schindelin via GitGitGadget wrote: > Let's offer a way to read those patches from pre-prepared MBox files > instead when an argument "mbox:<filename>" is passed instead of a commit > range. Very cool. Thanks for working on this. I don't have time to review the whole parser right now (and I agree that it would be nice to see it hook into the existing stuff in mailinfo.c), but the idea sounds delightful. If it's possible to use the battle-tested parts of mailinfo.c without much effort, I'd be in favor of waiting on that. But in case that it's not, I agree with you that we shouldn't let perfect be the enemy of the good[^1], either. Thanks, Taylor [^1]: Though let's make sure that "good" doesn't have any buffer overruns in it ;-)
Hi Dscho On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote: > From: Johannes Schindelin <johannes.schindelin@gmx.de> > > Internally, the `git range-diff` command spawns a `git log` process and > parses its output for the given commit ranges. > > This works well when the patches that need to be compared are present in > the local repository in the form of commits. > > In scenarios where that is not the case, the `range-diff` command is > currently less helpful. > > The Git mailing list is such a scenario: Instead of using Git to > exchange commits, the patches are sent there as plain-text and no commit > range can be specified to let `range-diff` consume those patches. > > Instead, the expectation is to download the mails, apply them locally > and then use `range-diff`. This can be quite cumbersome e.g. when a > suitable base revision has to be found first where the patch applies > cleanly. That's a good motivation for this change. > Let's offer a way to read those patches from pre-prepared MBox files > instead when an argument "mbox:<filename>" is passed instead of a commit > range. > > For extra convenience, interpret the filename `-` as standard input. > This makes it easy to compare contributions on the mailing list with the > actual commits that were integrated into Git's main branch. Example: > > commit=5c4003ca3f0e9ac6d3c8aa3e387ff843bd440411 > mid=bdfa3845b81531863941e6a97c28eb1afa62dd2c.1489435755.git.johannes.schindelin@gmx.de > curl -s https://lore.kernel.org/git/$mid/raw | > git range-diff mbox:- $commit^! > > This addresses https://github.com/gitgitgadget/git/issues/207 > > Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de> > --- > range-diff: support reading mbox files > > I frequently find myself wanting to look at the range-diff between some > local commits and the patches on the Git mailing list, but unwilling to > go through the process of finding an appropriate base revision to apply > the patches onto (just to throw the generated patches away afterwards, > anyway). > > So I came up with this patch. May it be helpful to other developers, > too. > > This patch contains a home-rolled mbox parser. Initially, I wrote a > really basic parser and it worked well enough, but, you know, as things > go it became more complex than that in order to provide actually useful > range-diffs for existing commits and their corresponding mails (because > of in-body From: headers, because of -- trailers and long subjects, just > to name a few reasons). In hindsight, it might have made sense to try to > to reuse the parser that is available in mailinfo.c, which I had > initially dismissed as overly complex and unnecessary for this use case. > If anyone feels up to it, I would invite them to adjust this code to > replace the mbox parser with one based on the mailinfo.c. Incrementally, > of course, because the perfect is the enemy of the good. > > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1420%2Fdscho%2Frange-diff-from-mbox-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1420/dscho/range-diff-from-mbox-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/1420 > > Documentation/git-range-diff.txt | 3 +- > range-diff.c | 317 ++++++++++++++++++++++++++++++- > t/t3206-range-diff.sh | 9 + > 3 files changed, 327 insertions(+), 2 deletions(-) > > diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt > index 0b393715d70..e2c4661acde 100644 > --- a/Documentation/git-range-diff.txt > +++ b/Documentation/git-range-diff.txt > @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges: > > - `<range1> <range2>`: Either commit range can be of the form > `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES` > - in linkgit:gitrevisions[7] for more details. > + in linkgit:gitrevisions[7] for more details. Alternatively, the > + patches can be provided as an mbox-formatted file via `mbox:<path>`. > > - `<rev1>...<rev2>`. This is equivalent to > `<rev2>..<rev1> <rev1>..<rev2>`. > diff --git a/range-diff.c b/range-diff.c > index 124dd678c38..7c84cdbeffa 100644 > --- a/range-diff.c > +++ b/range-diff.c > @@ -12,6 +12,7 @@ > #include "userdiff.h" > #include "apply.h" > #include "revision.h" > +#include "dir.h" > > struct patch_util { > /* For the search for an exact match */ > @@ -26,6 +27,293 @@ struct patch_util { > struct object_id oid; > }; > > +static inline int strtost(char const *s, size_t *result, const char **end) > +{ > + unsigned long u; > + char *p; > + > + errno = 0; > + /* negative values would be accepted by strtoul */ > + if (*s == '-') > + return -1; I think it is right to treat the input as untrusted and so look for malformed hunk headers. However This test is not sufficient for that, we expect a digit so I think if (!isdigit(*s)) return -1; would be safer. The use of strtoul() looks good as we set errno to zero before the call and check both errno and endp afterwards. > + u = strtoul(s, &p, 10); > + if (errno || p == s) > + return -1; > + if (result) > + *result = u; > + *end = p; > + > + return 0; > +} > + > +static int parse_hunk_header(const char *p, > + size_t *old_count, size_t *new_count, > + const char **end) > +{ > + size_t o = 1, n = 1; > + > + if (!skip_prefix(p, "@@ -", &p) || > + strtost(p, NULL, &p) || > + (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) || It took me a minute to understand the double negatives but it is correctly checking if we have ' ' or ',<digits>' > + !skip_prefix(p, " +", &p) || > + strtost(p, NULL, &p) || > + (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) || > + !skip_prefix(p, " @@", &p)) > + return -1; > + > + *old_count = o; > + *new_count = n; > + *end = p; > + > + return 0; > +} > + > +static inline int find_eol(const char *line, size_t size) > +{ > + char *eol; > + > + eol = memchr(line, '\n', size); > + if (!eol) > + return size; > + > + if (eol != line && eol[-1] == '\r') > + eol[-1] = '\0'; > + else > + *eol = '\0'; > + > + return eol + 1 - line; We return the offset to the start of the next line, not the length of the line. This will be important later. > +} > + > +static int read_mbox(const char *path, struct string_list *list) > +{ > + struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT; > + struct strbuf long_subject = STRBUF_INIT; > + struct patch_util *util = NULL; > + enum { > + MBOX_BEFORE_HEADER, > + MBOX_IN_HEADER, > + MBOX_IN_COMMIT_MESSAGE, > + MBOX_AFTER_TRIPLE_DASH, > + MBOX_IN_DIFF > + } state = MBOX_BEFORE_HEADER; > + char *line, *current_filename = NULL; > + int len; > + size_t size, old_count = 0, new_count = 0; > + const char *author = NULL, *subject = NULL; > + > + if (!strcmp(path, "-")) { > + if (strbuf_read(&contents, STDIN_FILENO, 0) < 0) > + return error_errno(_("could not read stdin")); > + } else if (strbuf_read_file(&contents, path, 0) < 0) > + return error_errno(_("could not read '%s'"), path); > + > + line = contents.buf; > + size = contents.len; > + for (; size > 0; size -= len, line += len) { size is unsigned so we're effectively testing 'size != 0' which means if we're off by one somewhere we'll have an undetected buffer overflow. Using a signed type wouldn't prevent the buffer overflow but it would limit its extent. > + const char *p; > + > + len = find_eol(line, size); Here len is not the length of line if it originally ended "\r\n". > + if (state == MBOX_BEFORE_HEADER) { > + if (!skip_prefix(line, "From ", &p)) > + continue; > + > + util = xcalloc(1, sizeof(*util)); > + if (get_oid_hex(p, &util->oid) < 0) > + oidcpy(&util->oid, null_oid()); > + util->matching = -1; > + author = subject = NULL; > + > + state = MBOX_IN_HEADER; I wondered if there should there be a `continue;` here but I think it probably needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to clarify that would be helpful. > + } > + > + if (starts_with(line, "diff --git ")) { > + struct patch patch = { 0 }; > + struct strbuf root = STRBUF_INIT; > + int linenr = 0; > + int orig_len; > + > + state = MBOX_IN_DIFF; > + old_count = new_count = 0; > + strbuf_addch(&buf, '\n'); > + if (!util->diff_offset) > + util->diff_offset = buf.len; > + line[len - 1] = '\n'; Here the line will still be NUL terminated if it originally ended "\r\n" which presumably messes up the call to parse_git_diff_header() below. I have not checked if parse_git_diff_header() can handle "\r\n" when it is parsing the rest of the diff header. > + orig_len = len; > + len = parse_git_diff_header(&root, &linenr, 1, line, > + len, size, &patch); > + if (len < 0) { > + error(_("could not parse git header '%.*s'"), > + orig_len, line); > + free(util); > + free(current_filename); > + string_list_clear(list, 1); > + strbuf_release(&buf); > + strbuf_release(&contents); > + strbuf_release(&long_subject); > + return -1; > + } > + > + if (patch.old_name) > + skip_prefix(patch.old_name, "a/", > + (const char **)&patch.old_name); > + if (patch.new_name) > + skip_prefix(patch.new_name, "b/", > + (const char **)&patch.new_name); I think this is fine for now but we might want to support other prefixes in the future. If it is not a copy or rename then the filename can be deduced by finding the common tail of patch.old_name and patch.new_name and stripping anything before the first '/'. If it is a copy or rename then I suspect there is no prefix (though I've not checked) > + strbuf_addstr(&buf, " ## "); > + if (patch.is_new > 0) `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we don't need the '> 0' > + strbuf_addf(&buf, "%s (new)", patch.new_name); > + else if (patch.is_delete > 0) > + strbuf_addf(&buf, "%s (deleted)", patch.old_name); > + else if (patch.is_rename) > + strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name); > + else > + strbuf_addstr(&buf, patch.new_name); > + > + free(current_filename); > + if (patch.is_delete > 0) > + current_filename = xstrdup(patch.old_name); > + else > + current_filename = xstrdup(patch.new_name); > + > + if (patch.new_mode && patch.old_mode && > + patch.old_mode != patch.new_mode) > + strbuf_addf(&buf, " (mode change %06o => %06o)", > + patch.old_mode, patch.new_mode); > + > + strbuf_addstr(&buf, " ##\n"); > + util->diffsize++; > + } else if (state == MBOX_IN_HEADER) { > + if (!line[0]) { > + state = MBOX_IN_COMMIT_MESSAGE; > + /* Look for an in-body From: */ > + if (size > 5 && skip_prefix(line + 1, "From: ", &p)) { The "size > 5" seems a bit unnecessary as we're using skip_prefix() > + size -= p - line; > + line += p - line; This is good, we're accounting for reading the next line. > + len = find_eol(line, size); > + > + while (isspace(*p)) > + p++; > + author = p; > + } > + strbuf_addstr(&buf, " ## Metadata ##\n"); > + if (author) > + strbuf_addf(&buf, "Author: %s\n", author); > + strbuf_addstr(&buf, "\n ## Commit message ##\n"); > + if (subject) > + strbuf_addf(&buf, " %s\n\n", subject); > + } else if (skip_prefix(line, "From: ", &p)) { > + while (isspace(*p)) > + p++; > + author = p; > + } else if (skip_prefix(line, "Subject: ", &p)) { > + const char *q; > + > + while (isspace(*p)) > + p++; > + subject = p; > + > + if (starts_with(p, "[PATCH") && > + (q = strchr(p, ']'))) { > + q++; > + while (isspace(*q)) > + q++; > + subject = q; > + } > + > + if (len < size && line[len] == ' ') { > + /* handle long subject */ > + strbuf_reset(&long_subject); > + strbuf_addstr(&long_subject, subject); > + while (len < size && line[len] == ' ') { > + line += len; > + size -= len; > + len = find_eol(line, size); > + strbuf_addstr(&long_subject, line); Looks good > + } > + subject = long_subject.buf; > + } > + } > + } else if (state == MBOX_IN_COMMIT_MESSAGE) { > + if (!*line) Not a big issue elsewhere you've used "!line[0]" Style: there should be braces on this branch. > + strbuf_addch(&buf, '\n'); > + else if (strcmp(line, "---")) { > + int tabs = 0; > + > + /* simulate tab expansion */ > + while (line[tabs] == '\t') > + tabs++; > + strbuf_addf(&buf, "%*s%s\n", > + 4 + 8 * tabs, "", line + tabs); > + } else { > + /* > + * Trim the trailing newline that is added > + * by `format-patch`. > + */ > + strbuf_trim_trailing_newline(&buf); > + state = MBOX_AFTER_TRIPLE_DASH; > + } > + } else if (state == MBOX_IN_DIFF) { > + switch (line[0]) { > + case '\0': > + continue; /* ignore empty lines after diff */ > + case '+': > + case '-': > + case ' ': > + if (!old_count && !new_count) > + break; This shouldn't happen in a well formed diff. Below we happily accept bad counts, is there a reason to reject them here? > + if (old_count && line[0] != '+') > + old_count--; > + if (new_count && line[0] != '-') > + new_count--; The diff is malformed if old_count == 0 and we see '-' or ' ' or new_count == 0 and we see '+' or ' '. The code is careful not to decrement the count in that case so I think it is harmless to accept diffs with bad line counts in the hunk header. > + /* fallthrough */ > + case '\\': > + strbuf_addstr(&buf, line); > + strbuf_addch(&buf, '\n'); > + util->diffsize++; I think this might be a better place to break if old_count and new_count are both zero. > + continue; > + case '@': > + if (parse_hunk_header(line, &old_count, > + &new_count, &p)) > + break; > + > + strbuf_addstr(&buf, "@@"); > + if (current_filename && *p) > + strbuf_addf(&buf, " %s:", > + current_filename); > + strbuf_addstr(&buf, p); > + strbuf_addch(&buf, '\n'); > + util->diffsize++; > + continue; > + } This is effectively the `default:` clause as it is executed when we don't handle the line above. We ignore the contents of this line which makes me wonder what happens if it is the start of another diff. Do we have tests that alter more than one file in a single commit? I think this is a useful addition, it could perhaps benefit from more tests though. Having tests for bad input, "\r\n" line endings and getting the author from a From: header as well as an in-body From: line would give a bit more reassurance about how robust the parser is. Best Wishes Phillip
Hi Dscho Having slept on it I think I misunderstood what was happening in the diff parsing yesterday On 16/11/2022 14:40, Phillip Wood wrote: >> + } else if (state == MBOX_IN_DIFF) { >> + switch (line[0]) { >> + case '\0': >> + continue; /* ignore empty lines after diff */ >> + case '+': >> + case '-': >> + case ' ': >> + if (!old_count && !new_count) >> + break; > > This shouldn't happen in a well formed diff. Below we happily accept bad > counts, is there a reason to reject them here? I think this might be picking up the "--" at the end of the patch as we don't want to break here at the end of a hunk. If so then a comment would be helpful. >> + if (old_count && line[0] != '+') >> + old_count--; >> + if (new_count && line[0] != '-') >> + new_count--; > > The diff is malformed if old_count == 0 and we see '-' or ' ' or > new_count == 0 and we see '+' or ' '. The code is careful not to > decrement the count in that case so I think it is harmless to accept > diffs with bad line counts in the hunk header. >> + /* fallthrough */ >> + case '\\': >> + strbuf_addstr(&buf, line); >> + strbuf_addch(&buf, '\n'); >> + util->diffsize++; > > I think this might be a better place to break if old_count and new_count > are both zero. It would be the right place to break at the end of each hunk, but I don't think we want to do that. >> + continue; >> + case '@': >> + if (parse_hunk_header(line, &old_count, >> + &new_count, &p)) >> + break; >> + >> + strbuf_addstr(&buf, "@@"); >> + if (current_filename && *p) >> + strbuf_addf(&buf, " %s:", >> + current_filename); >> + strbuf_addstr(&buf, p); >> + strbuf_addch(&buf, '\n'); >> + util->diffsize++; >> + continue; >> + } > > This is effectively the `default:` clause as it is executed when we > don't handle the line above. We ignore the contents of this line which > makes me wonder what happens if it is the start of another diff. We'll pick that up earlier with "if (starts_with(line, "diff --git"))" We only get here at the end of a patch (assuming it has the "--" line from format-patch) Best Wishes Phillip > Do we > have tests that alter more than one file in a single commit? > > I think this is a useful addition, it could perhaps benefit from more > tests though. Having tests for bad input, "\r\n" line endings and > getting the author from a From: header as well as an in-body From: line > would give a bit more reassurance about how robust the parser is. > > Best Wishes > > Phillip
On Tue, Nov 15 2022, Johannes Schindelin via GitGitGadget wrote: > + if (starts_with(line, "diff --git ")) { > + struct patch patch = { 0 }; > + struct strbuf root = STRBUF_INIT; > + int linenr = 0; > + int orig_len; > + > + state = MBOX_IN_DIFF; > + old_count = new_count = 0; > + strbuf_addch(&buf, '\n'); > + if (!util->diff_offset) > + util->diff_offset = buf.len; > + line[len - 1] = '\n'; > + orig_len = len; > + len = parse_git_diff_header(&root, &linenr, 1, line, > + len, size, &patch); Try this with SANITIZE=leak, e.g. this seems to fix 1/4 leaks that pop up if you try the command noted in the patch: diff --git a/range-diff.c b/range-diff.c index 77fa9b970b1..7ff33f92e39 100644 --- a/range-diff.c +++ b/range-diff.c @@ -142,6 +142,7 @@ static int read_mbox(const char *path, struct string_list *list) orig_len = len; len = parse_git_diff_header(&root, &linenr, 1, line, len, size, &patch); + free(patch.def_name); if (len < 0) { error(_("could not parse git header '%.*s'"), orig_len, line); Maybe it really should segfault, I didn't check carefully, but your test passes with SANITIZE=address with this, so if so it's missing coverage...
Hi Ævar, On Thu, 17 Nov 2022, Ævar Arnfjörð Bjarmason wrote: > > On Tue, Nov 15 2022, Johannes Schindelin via GitGitGadget wrote: > > > + if (starts_with(line, "diff --git ")) { > > + struct patch patch = { 0 }; > > + struct strbuf root = STRBUF_INIT; > > + int linenr = 0; > > + int orig_len; > > + > > + state = MBOX_IN_DIFF; > > + old_count = new_count = 0; > > + strbuf_addch(&buf, '\n'); > > + if (!util->diff_offset) > > + util->diff_offset = buf.len; > > + line[len - 1] = '\n'; > > + orig_len = len; > > + len = parse_git_diff_header(&root, &linenr, 1, line, > > + len, size, &patch); > > Try this with SANITIZE=leak, e.g. this seems to fix 1/4 leaks that pop > up if you try the command noted in the patch: > > diff --git a/range-diff.c b/range-diff.c > index 77fa9b970b1..7ff33f92e39 100644 > --- a/range-diff.c > +++ b/range-diff.c > @@ -142,6 +142,7 @@ static int read_mbox(const char *path, struct string_list *list) > orig_len = len; > len = parse_git_diff_header(&root, &linenr, 1, line, > len, size, &patch); > + free(patch.def_name); > if (len < 0) { > error(_("could not parse git header '%.*s'"), > orig_len, line); Thank you for keeping your feedback concise. Much appreciated. This will be addressed in the next iteration, Johannes
Hi Phillip, On Thu, 17 Nov 2022, Phillip Wood wrote: > On 16/11/2022 14:40, Phillip Wood wrote: > > > + } else if (state == MBOX_IN_DIFF) { > > > + switch (line[0]) { > > > + case '\0': > > > + continue; /* ignore empty lines after diff */ > > > + case '+': > > > + case '-': > > > + case ' ': > > > + if (!old_count && !new_count) > > > + break; > > > > This shouldn't happen in a well formed diff. Below we happily accept bad > > counts, is there a reason to reject them here? > > I think this might be picking up the "--" at the end of the patch as we don't > want to break here at the end of a hunk. If so then a comment would be > helpful. Agreed. And yes, it is picking up the "-- " line at the end of the patch. > > > + if (old_count && line[0] != '+') > > > + old_count--; > > > + if (new_count && line[0] != '-') > > > + new_count--; > > > > The diff is malformed if old_count == 0 and we see '-' or ' ' or new_count > > == 0 and we see '+' or ' '. The code is careful not to decrement the count > > in that case so I think it is harmless to accept diffs with bad line counts > > in the hunk header. I might be overly cautious here, but as you mentioned elsewhere, it is really bad if a `size_t` is decremented below 0, and `new_count`/`old_count` are of that type. > > > + /* fallthrough */ > > > + case '\\': > > > + strbuf_addstr(&buf, line); > > > + strbuf_addch(&buf, '\n'); > > > + util->diffsize++; > > > > I think this might be a better place to break if old_count and new_count are > > both zero. > > It would be the right place to break at the end of each hunk, but I don't > think we want to do that. It would not even be the right place to break here then: think of the `\ No newline at end of file` lines: they come after the preceding line decremented `old_count`/`new_count`, yet we still want them to be part of the diff. > > > > + continue; > > > + case '@': > > > + if (parse_hunk_header(line, &old_count, > > > + &new_count, &p)) > > > + break; > > > + > > > + strbuf_addstr(&buf, "@@"); > > > + if (current_filename && *p) > > > + strbuf_addf(&buf, " %s:", > > > + current_filename); > > > + strbuf_addstr(&buf, p); > > > + strbuf_addch(&buf, '\n'); > > > + util->diffsize++; > > > + continue; > > > + } > > > > This is effectively the `default:` clause as it is executed when we don't > > handle the line above. We ignore the contents of this line which makes me > > wonder what happens if it is the start of another diff. > > We'll pick that up earlier with "if (starts_with(line, "diff --git"))" > > We only get here at the end of a patch (assuming it has the "--" line from > format-patch) We also get here in case of garbage in the middle of a diff ;-) Thank you for setting a fantastic example how to review code in a constructive, helpful manner! Dscho
Hi Phillip, On Wed, 16 Nov 2022, Phillip Wood wrote: > On 15/11/2022 18:20, Johannes Schindelin via GitGitGadget wrote: > > From: Johannes Schindelin <johannes.schindelin@gmx.de> > > > > Internally, the `git range-diff` command spawns a `git log` process and > > parses its output for the given commit ranges. > > > > This works well when the patches that need to be compared are present in > > the local repository in the form of commits. > > > > In scenarios where that is not the case, the `range-diff` command is > > currently less helpful. > > > > The Git mailing list is such a scenario: Instead of using Git to > > exchange commits, the patches are sent there as plain-text and no commit > > range can be specified to let `range-diff` consume those patches. > > > > Instead, the expectation is to download the mails, apply them locally > > and then use `range-diff`. This can be quite cumbersome e.g. when a > > suitable base revision has to be found first where the patch applies > > cleanly. > > That's a good motivation for this change. Side note (potentially fun to know): A couple of weeks ago, I wanted to work on this because I was asked to review a new patch series iteration on the Git mailing list (I forgot which one it was, precisely), which came without a range-diff, and I wanted to download both iterations using a variation of https://github.com/git-for-windows/build-extra/blob/HEAD/apply-from-lore.sh and then use the mbox mode of `range-diff` on them. Unfortunately, I ran out of time back then (it's a common theme for me these days), and the patch series (rightfully) advanced to `next` without my review, and I let this here patch slide. > > +static inline int strtost(char const *s, size_t *result, const char **end) > > +{ > > + unsigned long u; > > + char *p; > > + > > + errno = 0; > > + /* negative values would be accepted by strtoul */ > > + if (*s == '-') > > + return -1; > > I think it is right to treat the input as untrusted and so look for malformed > hunk headers. However This test is not sufficient for that, we expect a digit > so I think > > if (!isdigit(*s)) > return -1; > > would be safer. The use of strtoul() looks good as we set errno to zero before > the call and check both errno and endp afterwards. Good point. As you might have guessed, this function is a copy/edit of existing code, in this instance `strtoul_ui()`. I have made the change you suggested. > > > + u = strtoul(s, &p, 10); > > + if (errno || p == s) > > + return -1; > > + if (result) > > + *result = u; > > + *end = p; > > + > > + return 0; > > +} > > + > > +static int parse_hunk_header(const char *p, > > + size_t *old_count, size_t *new_count, > > + const char **end) > > +{ > > + size_t o = 1, n = 1; > > + > > + if (!skip_prefix(p, "@@ -", &p) || > > + strtost(p, NULL, &p) || > > + (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) || > > It took me a minute to understand the double negatives but it is correctly > checking if we have ' ' or ',<digits>' I actually hesitated when I wrote this, and only kept the code as-is because I remembered how often Junio uses double negatives and thought it would be a good joke to do the same here. Joke's obviously on me, though. I've hence changed it to: /* The range is -<start>[,<count>], defaulting to count = 1 */ !(*p == ' ' || (*p == ',' && !strtost(p + 1, &o, &p))) || and /* The range is * +<start>[,<count>], * defaulting to count = 1 */ !(*p == ' ' || (*p == ',' && !strtost(p + 1, &n, &p))) || > > > + !skip_prefix(p, " +", &p) || > > + strtost(p, NULL, &p) || > > + (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) || > > + !skip_prefix(p, " @@", &p)) > > + return -1; > > + > > + *old_count = o; > > + *new_count = n; > > + *end = p; > > + > > + return 0; > > +} > > + > > +static inline int find_eol(const char *line, size_t size) > > +{ > > + char *eol; > > + > > + eol = memchr(line, '\n', size); > > + if (!eol) > > + return size; > > + > > + if (eol != line && eol[-1] == '\r') > > + eol[-1] = '\0'; > > + else > > + *eol = '\0'; > > + > > + return eol + 1 - line; > > We return the offset to the start of the next line, not the length of the > line. This will be important later. You're right. I changed the name to `find_next_line()` and added an informative comment on top of the function. > > > +} > > + > > +static int read_mbox(const char *path, struct string_list *list) > > +{ > > + struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT; > > + struct strbuf long_subject = STRBUF_INIT; > > + struct patch_util *util = NULL; > > + enum { > > + MBOX_BEFORE_HEADER, > > + MBOX_IN_HEADER, > > + MBOX_IN_COMMIT_MESSAGE, > > + MBOX_AFTER_TRIPLE_DASH, > > + MBOX_IN_DIFF > > + } state = MBOX_BEFORE_HEADER; > > + char *line, *current_filename = NULL; > > + int len; > > + size_t size, old_count = 0, new_count = 0; > > + const char *author = NULL, *subject = NULL; > > + > > + if (!strcmp(path, "-")) { > > + if (strbuf_read(&contents, STDIN_FILENO, 0) < 0) > > + return error_errno(_("could not read stdin")); > > + } else if (strbuf_read_file(&contents, path, 0) < 0) > > + return error_errno(_("could not read '%s'"), path); > > + > > + line = contents.buf; > > + size = contents.len; > > + for (; size > 0; size -= len, line += len) { > > size is unsigned so we're effectively testing 'size != 0' which means if we're > off by one somewhere we'll have an undetected buffer overflow. Using a signed > type wouldn't prevent the buffer overflow but it would limit its extent. `contents.len` is of type `size_t`, though, and I'd like to stay consistent. I did remove the misleading `> 0` from the condition, though. > > > + const char *p; > > + > > + len = find_eol(line, size); > > Here len is not the length of line if it originally ended "\r\n". > > > + if (state == MBOX_BEFORE_HEADER) { > > + if (!skip_prefix(line, "From ", &p)) > > + continue; > > + > > + util = xcalloc(1, sizeof(*util)); > > + if (get_oid_hex(p, &util->oid) < 0) > > + oidcpy(&util->oid, null_oid()); > > + util->matching = -1; > > + author = subject = NULL; > > + > > + state = MBOX_IN_HEADER; > > I wondered if there should there be a `continue;` here but I think it probably > needs to "fall-through" to the MBOX_IN_HEADER handling below. A comment to > clarify that would be helpful. You're absolutely correct, there totally should be a `continue` here: we saw a `From `, and in the `MBOX_IN_HEADER` we look for either an empty line, the `From:` header or the `Subject:` header, so we know that the current line cannot match, no need to fall through. > > > + } > > + > > + if (starts_with(line, "diff --git ")) { > > + struct patch patch = { 0 }; > > + struct strbuf root = STRBUF_INIT; > > + int linenr = 0; > > + int orig_len; > > + > > + state = MBOX_IN_DIFF; > > + old_count = new_count = 0; > > + strbuf_addch(&buf, '\n'); > > + if (!util->diff_offset) > > + util->diff_offset = buf.len; > > + line[len - 1] = '\n'; > > Here the line will still be NUL terminated if it originally ended "\r\n" which > presumably messes up the call to parse_git_diff_header() below. I have not > checked if parse_git_diff_header() can handle "\r\n" when it is parsing the > rest of the diff header. I changed it locally to reinstate the `\r\n`, only to figure out that almost the entire `apply.c` machinery totally falls over CR/LF line endings. I changed the code to detect a Carriage Return and error out if one is detected. > > > + orig_len = len; > > + len = parse_git_diff_header(&root, &linenr, 1, line, > > + len, size, &patch); > > + if (len < 0) { > > + error(_("could not parse git header '%.*s'"), > > + orig_len, line); > > + free(util); > > + free(current_filename); > > + string_list_clear(list, 1); > > + strbuf_release(&buf); > > + strbuf_release(&contents); > > + strbuf_release(&long_subject); > > + return -1; > > + } > > + > > + if (patch.old_name) > > + skip_prefix(patch.old_name, "a/", > > + (const char **)&patch.old_name); > > + if (patch.new_name) > > + skip_prefix(patch.new_name, "b/", > > + (const char **)&patch.new_name); > > I think this is fine for now but we might want to support other prefixes in > the future. If it is not a copy or rename then the filename can be deduced by > finding the common tail of patch.old_name and patch.new_name and stripping > anything before the first '/'. If it is a copy or rename then I suspect there > is no prefix (though I've not checked) Since `skip_prefix()` does not do anything if there is no match, I think it is sane to err by not stripping anything unless the expected `a/` and `b/` prefixes are seen. > > > + strbuf_addstr(&buf, " ## "); > > + if (patch.is_new > 0) > > `patch.is_now` and `patch.is_delete` are booleans like `patch.is_rename` so we > don't need the '> 0' Good catch. > > > + strbuf_addf(&buf, "%s (new)", patch.new_name); > > + else if (patch.is_delete > 0) > > + strbuf_addf(&buf, "%s (deleted)", > > patch.old_name); > > + else if (patch.is_rename) > > + strbuf_addf(&buf, "%s => %s", patch.old_name, > > patch.new_name); > > + else > > + strbuf_addstr(&buf, patch.new_name); > > + > > + free(current_filename); > > + if (patch.is_delete > 0) > > + current_filename = xstrdup(patch.old_name); > > + else > > + current_filename = xstrdup(patch.new_name); > > + > > + if (patch.new_mode && patch.old_mode && > > + patch.old_mode != patch.new_mode) > > + strbuf_addf(&buf, " (mode change %06o => > > %06o)", > > + patch.old_mode, patch.new_mode); > > + > > + strbuf_addstr(&buf, " ##\n"); > > + util->diffsize++; > > + } else if (state == MBOX_IN_HEADER) { > > + if (!line[0]) { > > + state = MBOX_IN_COMMIT_MESSAGE; > > + /* Look for an in-body From: */ > > + if (size > 5 && skip_prefix(line + 1, "From: > > ", &p)) { > > The "size > 5" seems a bit unnecessary as we're using skip_prefix() Right! > > > + size -= p - line; > > + line += p - line; > > This is good, we're accounting for reading the next line. > > > + len = find_eol(line, size); > > + > > + while (isspace(*p)) > > + p++; > > + author = p; > > + } > > + strbuf_addstr(&buf, " ## Metadata ##\n"); > > + if (author) > > + strbuf_addf(&buf, "Author: %s\n", > > author); > > + strbuf_addstr(&buf, "\n ## Commit message > > ##\n"); > > + if (subject) > > + strbuf_addf(&buf, " %s\n\n", > > subject); > > + } else if (skip_prefix(line, "From: ", &p)) { > > + while (isspace(*p)) > > + p++; > > + author = p; > > + } else if (skip_prefix(line, "Subject: ", &p)) { > > + const char *q; > > + > > + while (isspace(*p)) > > + p++; > > + subject = p; > > + > > + if (starts_with(p, "[PATCH") && > > + (q = strchr(p, ']'))) { > > + q++; > > + while (isspace(*q)) > > + q++; > > + subject = q; > > + } > > + > > + if (len < size && line[len] == ' ') { > > + /* handle long subject */ > > + strbuf_reset(&long_subject); > > + strbuf_addstr(&long_subject, subject); > > + while (len < size && line[len] == ' ') > > { > > + line += len; > > + size -= len; > > + len = find_eol(line, size); > > + strbuf_addstr(&long_subject, > > line); > > Looks good > > > + } > > + subject = long_subject.buf; > > + } > > + } > > + } else if (state == MBOX_IN_COMMIT_MESSAGE) { > > + if (!*line) > > Not a big issue elsewhere you've used "!line[0]" > Style: there should be braces on this branch. Fixed on both accounts. > [... addressed in follow-up mail...] > > I think this is a useful addition, it could perhaps benefit from more tests > though. Having tests for bad input, "\r\n" line endings and getting the author > from a From: header as well as an in-body From: line would give a bit more > reassurance about how robust the parser is. Good point, I've augmented the test case a bit more. Thank you again. It is a real pleasure to receive these constructive reviews from you that pay so much attention to detail, and always lead to a clear path forward. Thanks! Dscho
diff --git a/Documentation/git-range-diff.txt b/Documentation/git-range-diff.txt index 0b393715d70..e2c4661acde 100644 --- a/Documentation/git-range-diff.txt +++ b/Documentation/git-range-diff.txt @@ -37,7 +37,8 @@ There are three ways to specify the commit ranges: - `<range1> <range2>`: Either commit range can be of the form `<base>..<rev>`, `<rev>^!` or `<rev>^-<n>`. See `SPECIFYING RANGES` - in linkgit:gitrevisions[7] for more details. + in linkgit:gitrevisions[7] for more details. Alternatively, the + patches can be provided as an mbox-formatted file via `mbox:<path>`. - `<rev1>...<rev2>`. This is equivalent to `<rev2>..<rev1> <rev1>..<rev2>`. diff --git a/range-diff.c b/range-diff.c index 124dd678c38..7c84cdbeffa 100644 --- a/range-diff.c +++ b/range-diff.c @@ -12,6 +12,7 @@ #include "userdiff.h" #include "apply.h" #include "revision.h" +#include "dir.h" struct patch_util { /* For the search for an exact match */ @@ -26,6 +27,293 @@ struct patch_util { struct object_id oid; }; +static inline int strtost(char const *s, size_t *result, const char **end) +{ + unsigned long u; + char *p; + + errno = 0; + /* negative values would be accepted by strtoul */ + if (*s == '-') + return -1; + u = strtoul(s, &p, 10); + if (errno || p == s) + return -1; + if (result) + *result = u; + *end = p; + + return 0; +} + +static int parse_hunk_header(const char *p, + size_t *old_count, size_t *new_count, + const char **end) +{ + size_t o = 1, n = 1; + + if (!skip_prefix(p, "@@ -", &p) || + strtost(p, NULL, &p) || + (*p != ' ' && (*p != ',' || strtost(p + 1, &o, &p))) || + !skip_prefix(p, " +", &p) || + strtost(p, NULL, &p) || + (*p != ' ' && (*p != ',' || strtost(p + 1, &n, &p))) || + !skip_prefix(p, " @@", &p)) + return -1; + + *old_count = o; + *new_count = n; + *end = p; + + return 0; +} + +static inline int find_eol(const char *line, size_t size) +{ + char *eol; + + eol = memchr(line, '\n', size); + if (!eol) + return size; + + if (eol != line && eol[-1] == '\r') + eol[-1] = '\0'; + else + *eol = '\0'; + + return eol + 1 - line; +} + +static int read_mbox(const char *path, struct string_list *list) +{ + struct strbuf buf = STRBUF_INIT, contents = STRBUF_INIT; + struct strbuf long_subject = STRBUF_INIT; + struct patch_util *util = NULL; + enum { + MBOX_BEFORE_HEADER, + MBOX_IN_HEADER, + MBOX_IN_COMMIT_MESSAGE, + MBOX_AFTER_TRIPLE_DASH, + MBOX_IN_DIFF + } state = MBOX_BEFORE_HEADER; + char *line, *current_filename = NULL; + int len; + size_t size, old_count = 0, new_count = 0; + const char *author = NULL, *subject = NULL; + + if (!strcmp(path, "-")) { + if (strbuf_read(&contents, STDIN_FILENO, 0) < 0) + return error_errno(_("could not read stdin")); + } else if (strbuf_read_file(&contents, path, 0) < 0) + return error_errno(_("could not read '%s'"), path); + + line = contents.buf; + size = contents.len; + for (; size > 0; size -= len, line += len) { + const char *p; + + len = find_eol(line, size); + + if (state == MBOX_BEFORE_HEADER) { + if (!skip_prefix(line, "From ", &p)) + continue; + + util = xcalloc(1, sizeof(*util)); + if (get_oid_hex(p, &util->oid) < 0) + oidcpy(&util->oid, null_oid()); + util->matching = -1; + author = subject = NULL; + + state = MBOX_IN_HEADER; + } + + if (starts_with(line, "diff --git ")) { + struct patch patch = { 0 }; + struct strbuf root = STRBUF_INIT; + int linenr = 0; + int orig_len; + + state = MBOX_IN_DIFF; + old_count = new_count = 0; + strbuf_addch(&buf, '\n'); + if (!util->diff_offset) + util->diff_offset = buf.len; + line[len - 1] = '\n'; + orig_len = len; + len = parse_git_diff_header(&root, &linenr, 1, line, + len, size, &patch); + if (len < 0) { + error(_("could not parse git header '%.*s'"), + orig_len, line); + free(util); + free(current_filename); + string_list_clear(list, 1); + strbuf_release(&buf); + strbuf_release(&contents); + strbuf_release(&long_subject); + return -1; + } + + if (patch.old_name) + skip_prefix(patch.old_name, "a/", + (const char **)&patch.old_name); + if (patch.new_name) + skip_prefix(patch.new_name, "b/", + (const char **)&patch.new_name); + + strbuf_addstr(&buf, " ## "); + if (patch.is_new > 0) + strbuf_addf(&buf, "%s (new)", patch.new_name); + else if (patch.is_delete > 0) + strbuf_addf(&buf, "%s (deleted)", patch.old_name); + else if (patch.is_rename) + strbuf_addf(&buf, "%s => %s", patch.old_name, patch.new_name); + else + strbuf_addstr(&buf, patch.new_name); + + free(current_filename); + if (patch.is_delete > 0) + current_filename = xstrdup(patch.old_name); + else + current_filename = xstrdup(patch.new_name); + + if (patch.new_mode && patch.old_mode && + patch.old_mode != patch.new_mode) + strbuf_addf(&buf, " (mode change %06o => %06o)", + patch.old_mode, patch.new_mode); + + strbuf_addstr(&buf, " ##\n"); + util->diffsize++; + } else if (state == MBOX_IN_HEADER) { + if (!line[0]) { + state = MBOX_IN_COMMIT_MESSAGE; + /* Look for an in-body From: */ + if (size > 5 && skip_prefix(line + 1, "From: ", &p)) { + size -= p - line; + line += p - line; + len = find_eol(line, size); + + while (isspace(*p)) + p++; + author = p; + } + strbuf_addstr(&buf, " ## Metadata ##\n"); + if (author) + strbuf_addf(&buf, "Author: %s\n", author); + strbuf_addstr(&buf, "\n ## Commit message ##\n"); + if (subject) + strbuf_addf(&buf, " %s\n\n", subject); + } else if (skip_prefix(line, "From: ", &p)) { + while (isspace(*p)) + p++; + author = p; + } else if (skip_prefix(line, "Subject: ", &p)) { + const char *q; + + while (isspace(*p)) + p++; + subject = p; + + if (starts_with(p, "[PATCH") && + (q = strchr(p, ']'))) { + q++; + while (isspace(*q)) + q++; + subject = q; + } + + if (len < size && line[len] == ' ') { + /* handle long subject */ + strbuf_reset(&long_subject); + strbuf_addstr(&long_subject, subject); + while (len < size && line[len] == ' ') { + line += len; + size -= len; + len = find_eol(line, size); + strbuf_addstr(&long_subject, line); + } + subject = long_subject.buf; + } + } + } else if (state == MBOX_IN_COMMIT_MESSAGE) { + if (!*line) + strbuf_addch(&buf, '\n'); + else if (strcmp(line, "---")) { + int tabs = 0; + + /* simulate tab expansion */ + while (line[tabs] == '\t') + tabs++; + strbuf_addf(&buf, "%*s%s\n", + 4 + 8 * tabs, "", line + tabs); + } else { + /* + * Trim the trailing newline that is added + * by `format-patch`. + */ + strbuf_trim_trailing_newline(&buf); + state = MBOX_AFTER_TRIPLE_DASH; + } + } else if (state == MBOX_IN_DIFF) { + switch (line[0]) { + case '\0': + continue; /* ignore empty lines after diff */ + case '+': + case '-': + case ' ': + if (!old_count && !new_count) + break; + if (old_count && line[0] != '+') + old_count--; + if (new_count && line[0] != '-') + new_count--; + /* fallthrough */ + case '\\': + strbuf_addstr(&buf, line); + strbuf_addch(&buf, '\n'); + util->diffsize++; + continue; + case '@': + if (parse_hunk_header(line, &old_count, + &new_count, &p)) + break; + + strbuf_addstr(&buf, "@@"); + if (current_filename && *p) + strbuf_addf(&buf, " %s:", + current_filename); + strbuf_addstr(&buf, p); + strbuf_addch(&buf, '\n'); + util->diffsize++; + continue; + } + + if (util) { + string_list_append(list, buf.buf)->util = util; + strbuf_reset(&buf); + } + util = xcalloc(1, sizeof(*util)); + oidcpy(&util->oid, null_oid()); + util->matching = -1; + author = subject = NULL; + state = MBOX_BEFORE_HEADER; + } + } + strbuf_release(&contents); + + if (util) { + if (state == MBOX_IN_DIFF) + string_list_append(list, buf.buf)->util = util; + else + free(util); + } + strbuf_release(&buf); + strbuf_release(&long_subject); + free(current_filename); + + return 0; +} + /* * Reads the patches into a string list, with the `util` field being populated * as struct object_id (will need to be free()d). @@ -41,6 +329,10 @@ static int read_patches(const char *range, struct string_list *list, ssize_t len; size_t size; int ret = -1; + const char *path; + + if (skip_prefix(range, "mbox:", &path)) + return read_mbox(path, list); strvec_pushl(&cp.args, "log", "--no-color", "-p", "--no-merges", "--reverse", "--date-order", "--decorate=no", @@ -424,6 +716,19 @@ static void output_pair_header(struct diff_options *diffopt, strbuf_addch(buf, ' '); pp_commit_easy(CMIT_FMT_ONELINE, commit, buf); + } else { + struct patch_util *util = b_util ? b_util : a_util; + const char *needle = "\n ## Commit message ##\n"; + const char *p = !util || !util->patch ? + NULL : strstr(util->patch, needle); + if (p) { + if (status == '!') + strbuf_addf(buf, "%s%s", color_reset, color); + + strbuf_addch(buf, ' '); + p += strlen(needle); + strbuf_add(buf, p, strchrnul(p, '\n') - p); + } } strbuf_addf(buf, "%s\n", color_reset); @@ -554,6 +859,9 @@ int show_range_diff(const char *range1, const char *range2, if (range_diff_opts->left_only && range_diff_opts->right_only) res = error(_("options '%s' and '%s' cannot be used together"), "--left-only", "--right-only"); + if (!strcmp(range1, "mbox:-") && !strcmp(range2, "mbox:-")) + res = error(_("only one mbox can be read from stdin")); + if (!res && read_patches(range1, &branch1, range_diff_opts->other_arg)) res = error(_("could not parse log for '%s'"), range1); if (!res && read_patches(range2, &branch2, range_diff_opts->other_arg)) @@ -575,10 +883,17 @@ int show_range_diff(const char *range1, const char *range2, int is_range_diff_range(const char *arg) { char *copy = xstrdup(arg); /* setup_revisions() modifies it */ - const char *argv[] = { "", copy, "--", NULL }; + const char *argv[] = { "", copy, "--", NULL }, *path; int i, positive = 0, negative = 0; struct rev_info revs; + if (skip_prefix(arg, "mbox:", &path)) { + if (!strcmp(path, "-") || file_exists(path)) + return 1; + error_errno(_("not an mbox: '%s'"), path); + return 0; + } + init_revisions(&revs, NULL); if (setup_revisions(3, argv, &revs, NULL) == 1) { for (i = 0; i < revs.pending.nr; i++) diff --git a/t/t3206-range-diff.sh b/t/t3206-range-diff.sh index 459beaf7d9c..2b64d30e8e6 100755 --- a/t/t3206-range-diff.sh +++ b/t/t3206-range-diff.sh @@ -783,6 +783,15 @@ test_expect_success 'ranges with pathspecs' ' ! grep "$topic_oid" actual ' +test_expect_success 'compare range vs mbox' ' + git format-patch --stdout topic..mode-only-change >mbox && + git range-diff topic...mode-only-change >expect && + git range-diff mode-only-change..topic mbox:./mbox >actual && + test_cmp expect actual && + git range-diff mode-only-change..topic mbox:- <mbox >actual && + test_cmp expect actual +' + test_expect_success 'submodule changes are shown irrespective of diff.submodule' ' git init sub-repo && test_commit -C sub-repo sub-first &&