diff mbox series

mailsplit add option to include sanitized subject in filename

Message ID 20240409000546.3628898-1-jacob.e.keller@intel.com (mailing list archive)
State New, archived
Headers show
Series mailsplit add option to include sanitized subject in filename | expand

Commit Message

Jacob Keller April 9, 2024, 12:05 a.m. UTC
From: Jacob Keller <jacob.keller@gmail.com>

git-am makes use of git-mailsplit to split an mbox into individual files
before attempting to apply the patches. In most cases this works fine,
but it can fail to apply patches in cases where the mbox file is not
properly sorted. This can sometimes happen due to clock skew, or other
issues with the software which saved the mbox.

For example, if you download a t.mbox.gz from a public inbox server such
as lore.kernel.org it may sort the messages in the thread by arrival
time to the list. Due to clock skew or other issues this may not be the
correct order of the patches to apply.

A savvy user may then attempt to directly use git mailsplit to split the
mailbox, only to find that the files are unhelpfully named "0001",
"0002", etc. It requires further digging to figure out which message is
which patch.

Git has a format_sanitized_subject() function which is used by code to
generate a suitable filename from a subject. Add a new --name-by-subject
option to git mailsplit. If enabled, scan for lines beginning with the
"Subject:" header when splitting mail. If found, extract the subject and
pass it to format_sanitized_subject(). Use this to create a new filename
which appends the sanitized subject to the standard sequence number. A
savvy user can invoke git mailsplit with --name-by-subject to help
analyze why the mailbox was not split the intended way.

I originally wanted to avoid the need for an option, but git-am
currently depends on the strict sequence number filenames. It is unclear
how difficult it would be to refactor git-am to work with names that
include the extra subject data.

Suggested-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Jacob Keller <jacob.keller@gmail.com>
---
 Documentation/git-mailsplit.txt |  5 +++++
 builtin/mailsplit.c             | 25 ++++++++++++++++++++++++-
 t/t5100-mailinfo.sh             | 25 +++++++++++++++++++++++++
 3 files changed, 54 insertions(+), 1 deletion(-)

Comments

Junio C Hamano April 9, 2024, 1:55 a.m. UTC | #1
Jacob Keller <jacob.e.keller@intel.com> writes:

> I originally wanted to avoid the need for an option, but git-am
> currently depends on the strict sequence number filenames.  It is
> unclear how difficult it would be to refactor git-am to work with
> names that include the extra subject data.

I am not sure if I follow.  Do you mean

	$ git am ./dir/0*.txt

in a directory where I already have these files

	$ ls dir/0*.txt
	dir/0001-Documentation-CodingGuidelines.txt
	dir/0002-quote-assigned-value.txt
	dir/0003-t-local-var.txt

that have one patch per message does not work?
Jacob Keller April 11, 2024, 3:22 a.m. UTC | #2
On Mon, Apr 8, 2024 at 6:55 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jacob Keller <jacob.e.keller@intel.com> writes:
>
> > I originally wanted to avoid the need for an option, but git-am
> > currently depends on the strict sequence number filenames.  It is
> > unclear how difficult it would be to refactor git-am to work with
> > names that include the extra subject data.
>
> I am not sure if I follow.  Do you mean
>
>         $ git am ./dir/0*.txt
>

No, I mean git am invokes git mailsplit to split a mailbox file into a
temporary directory, and then expects to find exactly "0000", "0001",
"0002" etc, but not "0001-fix-bug" and "0002-implement-feature"
Junio C Hamano April 11, 2024, 5:29 a.m. UTC | #3
Jacob Keller <jacob.keller@gmail.com> writes:

> No, I mean git am invokes git mailsplit to split a mailbox file into a
> temporary directory, and then expects to find exactly "0000", "0001",
> "0002" etc, but not "0001-fix-bug" and "0002-implement-feature"

Ah, of course.  "am" invokes mailsplit with the understanding that
its external interface is that it will get the total number as
decimal number from its standard output, and the files are named as
just numbers in the specified directory, with specified precision.
If you are mucking with mailsplit to update its output, of course
you must update the expected way "am" receives its input.
Junio C Hamano April 11, 2024, 5:55 a.m. UTC | #4
Jacob Keller <jacob.keller@gmail.com> writes:

>> > I originally wanted to avoid the need for an option, but git-am
>> > currently depends on the strict sequence number filenames.  It is
>> > unclear how difficult it would be to refactor git-am to work with
>> > names that include the extra subject data.

The change may be a bit involved but depending on where you decide
to stop, it may not be too bad.

What is your design goal of this topic?  IOW, what is the maximum
corrupted ordering of patches in a single mailbox do you want to
recover from?

The easiest and cleanest would be if you assume that the messages
are in scrambled order, but are all from the same series, correctly
numbered, without anything missing.  A mbox may have 8 patches from
a 8-patch series, with their subject lines having [1/8] to [8/8]
without duplicates or droppages, without any other message that does
not belong to the series.  If that is where you are willing to stop,
then you can still name the individual messages with just numbers
(but taken out of the subject line, not the order the input was
splitted into).  "am" does not have to even know or care what you
are doing in mailsplit in this case.

THe next level would be to still assume that you stop at the same
place (i.e. you do not support patches from multiple series in the
same mailbox), but use the number-santized-subject format.  This
would be a bit more involved, but I think all you need to update on
the "am" side is where the am_run() assigns the message file to the
local variable "mail".  You know the temporary directory where you
told "mailsplit" to create these individual messages, so you should
be able to "opendir/readdir/closedir" and create a list of numbered
files in the directory very early in "git am".  Knowing msgnum(state)
at that point in the loop, it should be trivial to change the code
that currently assumes the 4-th file is named "0004" to check for
the file whose name begins with "0004-".

I personally am not at all interested in doing that myself, because
I do not see a reasonable way to lift the limitation of allowing a
mailbox holding patches from only one series, and if we assume that
a tool (i.e. "am" driving "mailsplit" in the new mode) with such a
limitation is still useful, the source of such a scrambled mailbox
must be quite a narrow and common one.  At that point, I suspect
that fixing the scrambling at that narrow and common source (e.g.
your "t.mbox.gz from public inbox server that cannot be told to sort
the messages in any order other than the arrival timestamp") would
be a much better use of our engineering resource.
Jacob Keller April 11, 2024, 6:52 p.m. UTC | #5
On 4/10/2024 10:55 PM, Junio C Hamano wrote:
> Jacob Keller <jacob.keller@gmail.com> writes:
> 
>>>> I originally wanted to avoid the need for an option, but git-am
>>>> currently depends on the strict sequence number filenames.  It is
>>>> unclear how difficult it would be to refactor git-am to work with
>>>> names that include the extra subject data.
> 
> The change may be a bit involved but depending on where you decide
> to stop, it may not be too bad.
> 
> What is your design goal of this topic?  IOW, what is the maximum
> corrupted ordering of patches in a single mailbox do you want to
> recover from?
> 
> The easiest and cleanest would be if you assume that the messages
> are in scrambled order, but are all from the same series, correctly
> numbered, without anything missing.  A mbox may have 8 patches from
> a 8-patch series, with their subject lines having [1/8] to [8/8]
> without duplicates or droppages, without any other message that does
> not belong to the series.  If that is where you are willing to stop,
> then you can still name the individual messages with just numbers
> (but taken out of the subject line, not the order the input was
> splitted into).  "am" does not have to even know or care what you
> are doing in mailsplit in this case.

This is the main problem I'd like to solve. My original proposal was to
try and do just this, but the logic for extracting the number was bad.
Maybe just directly using the subject-based name and sorting by that
using standard alpha-numeric sort would be sufficient?

> 
> THe next level would be to still assume that you stop at the same
> place (i.e. you do not support patches from multiple series in the
> same mailbox), but use the number-santized-subject format.  This
> would be a bit more involved, but I think all you need to update on
> the "am" side is where the am_run() assigns the message file to the
> local variable "mail".  You know the temporary directory where you
> told "mailsplit" to create these individual messages, so you should
> be able to "opendir/readdir/closedir" and create a list of numbered
> files in the directory very early in "git am".  Knowing msgnum(state)
> at that point in the loop, it should be trivial to change the code
> that currently assumes the 4-th file is named "0004" to check for
> the file whose name begins with "0004-".

Yea, we pretty much just have to get the git-am process to work with the
new names. I can look at using opendir/readdir here instead.
> 
> I personally am not at all interested in doing that myself, because
> I do not see a reasonable way to lift the limitation of allowing a
> mailbox holding patches from only one series, and if we assume that
> a tool (i.e. "am" driving "mailsplit" in the new mode) with such a
> limitation is still useful, the source of such a scrambled mailbox
> must be quite a narrow and common one.  At that point, I suspect
> that fixing the scrambling at that narrow and common source (e.g.
> your "t.mbox.gz from public inbox server that cannot be told to sort
> the messages in any order other than the arrival timestamp") would
> be a much better use of our engineering resource.
> 

Ya I don't care much about multiple series. I care more about making it
handle scrambled series better than it does now. I download series off
of lore.kernel.org (public-inbox based) and those seem to routinely have
series out-of-order. I suspect this is because it bases them on arrival
date and sometimes certain mailers get it out of order when sending.
Junio C Hamano April 11, 2024, 9:25 p.m. UTC | #6
Jacob Keller <jacob.e.keller@intel.com> writes:

>> THe next level would be to still assume that you stop at the same
>> place (i.e. you do not support patches from multiple series in the
>> same mailbox), but use the number-santized-subject format.  This
>> would be a bit more involved, but I think all you need to update on
>> the "am" side is where the am_run() assigns the message file to the
>> local variable "mail".  You know the temporary directory where you
>> told "mailsplit" to create these individual messages, so you should
>> be able to "opendir/readdir/closedir" and create a list of numbered
>> files in the directory very early in "git am".  Knowing msgnum(state)
>> at that point in the loop, it should be trivial to change the code
>> that currently assumes the 4-th file is named "0004" to check for
>> the file whose name begins with "0004-".
>
> Yea, we pretty much just have to get the git-am process to work with the
> new names. I can look at using opendir/readdir here instead.

Not "here", but probably just after you called "mailsplit" and saw
it return.  After that nobody should be adding more split mail
messages to the directory, so you do it once to grab all filenames.

> Ya I don't care much about multiple series. I care more about making it
> handle scrambled series better than it does now. I download series off
> of lore.kernel.org (public-inbox based) and those seem to routinely have
> series out-of-order. I suspect this is because it bases them on arrival
> date and sometimes certain mailers get it out of order when sending.

Yeah, and that is why I said it would be a better use of the
engineering resource to fix it at the source.  Such a fix will
benefit folks with existing versions of "git am", not needing to
wait for your improved version.

Thanks.
Jacob Keller April 13, 2024, 12:07 a.m. UTC | #7
On Thu, Apr 11, 2024 at 2:25 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Jacob Keller <jacob.e.keller@intel.com> writes:
>
> >> THe next level would be to still assume that you stop at the same
> >> place (i.e. you do not support patches from multiple series in the
> >> same mailbox), but use the number-santized-subject format.  This
> >> would be a bit more involved, but I think all you need to update on
> >> the "am" side is where the am_run() assigns the message file to the
> >> local variable "mail".  You know the temporary directory where you
> >> told "mailsplit" to create these individual messages, so you should
> >> be able to "opendir/readdir/closedir" and create a list of numbered
> >> files in the directory very early in "git am".  Knowing msgnum(state)
> >> at that point in the loop, it should be trivial to change the code
> >> that currently assumes the 4-th file is named "0004" to check for
> >> the file whose name begins with "0004-".
> >
> > Yea, we pretty much just have to get the git-am process to work with the
> > new names. I can look at using opendir/readdir here instead.
>
> Not "here", but probably just after you called "mailsplit" and saw
> it return.  After that nobody should be adding more split mail
> messages to the directory, so you do it once to grab all filenames.
>
> > Ya I don't care much about multiple series. I care more about making it
> > handle scrambled series better than it does now. I download series off
> > of lore.kernel.org (public-inbox based) and those seem to routinely have
> > series out-of-order. I suspect this is because it bases them on arrival
> > date and sometimes certain mailers get it out of order when sending.
>
> Yeah, and that is why I said it would be a better use of the
> engineering resource to fix it at the source.  Such a fix will
> benefit folks with existing versions of "git am", not needing to
> wait for your improved version.
>
> Thanks.

I went and talked to the public-inbox folks, and discovered that there
is a known problem and solution, with a utility called b4 intended for
downloading mbox files from the public-inbox

https://b4.docs.kernel.org/

Thought I'd mention that here if anyone else reading this thread was
curious about an ultimate solution.

b4 will find patches in the series, sort them, remove the replies and
can do some other common cleanup operations including things like
applying tags from other messages on the list.
diff mbox series

Patch

diff --git a/Documentation/git-mailsplit.txt b/Documentation/git-mailsplit.txt
index 3f0a6662c81e..2e5ba45e1988 100644
--- a/Documentation/git-mailsplit.txt
+++ b/Documentation/git-mailsplit.txt
@@ -9,6 +9,7 @@  SYNOPSIS
 --------
 [verse]
 'git mailsplit' [-b] [-f<nn>] [-d<prec>] [--keep-cr] [--mboxrd]
+		[--name-by-subject]
 		-o<directory> [--] [(<mbox>|<Maildir>)...]
 
 DESCRIPTION
@@ -52,6 +53,10 @@  OPTIONS
 	Input is of the "mboxrd" format and "^>+From " line escaping is
 	reversed.
 
+--name-by-subject::
+	Include the sanitized subject in the generated filenames, in
+	addition to the sequence number.
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/mailsplit.c b/builtin/mailsplit.c
index 3af9ddb8ae5c..df81782d05b3 100644
--- a/builtin/mailsplit.c
+++ b/builtin/mailsplit.c
@@ -8,9 +8,10 @@ 
 #include "gettext.h"
 #include "string-list.h"
 #include "strbuf.h"
+#include "pretty.h"
 
 static const char git_mailsplit_usage[] =
-"git mailsplit [-d<prec>] [-f<n>] [-b] [--keep-cr] -o<directory> [(<mbox>|<Maildir>)...]";
+"git mailsplit [-d<prec>] [-f<n>] [-b] [--keep-cr] [--name-by-subject] -o<directory> [(<mbox>|<Maildir>)...]";
 
 static int is_from_line(const char *line, int len)
 {
@@ -46,6 +47,7 @@  static int is_from_line(const char *line, int len)
 static struct strbuf buf = STRBUF_INIT;
 static int keep_cr;
 static int mboxrd;
+static int name_by_subject;
 
 static int is_gtfrom(const struct strbuf *buf)
 {
@@ -66,6 +68,9 @@  static int is_gtfrom(const struct strbuf *buf)
  */
 static int split_one(FILE *mbox, const char *name, int allow_bare)
 {
+	struct strbuf sanitized_filename = STRBUF_INIT;
+	const char *subject_start;
+	size_t subject_len;
 	FILE *output;
 	int fd;
 	int status = 0;
@@ -101,10 +106,26 @@  static int split_one(FILE *mbox, const char *name, int allow_bare)
 			}
 			die_errno("cannot read mbox");
 		}
+
+		/* Get a sanitized filename from the subject */
+		if (name_by_subject && !sanitized_filename.len &&
+		    skip_prefix_mem(buf.buf, buf.len, "Subject:",
+				    &subject_start, &subject_len)) {
+			strbuf_addf(&sanitized_filename, "%s-", name);
+			format_sanitized_subject(&sanitized_filename,
+						 subject_start,
+						 subject_len);
+		}
+
 		if (!is_bare && is_from_line(buf.buf, buf.len))
 			break; /* done with one message */
 	}
 	fclose(output);
+
+	if (name_by_subject && sanitized_filename.len)
+		rename(name, sanitized_filename.buf);
+	strbuf_release(&sanitized_filename);
+
 	return status;
 }
 
@@ -296,6 +317,8 @@  int cmd_mailsplit(int argc, const char **argv, const char *prefix)
 			usage(git_mailsplit_usage);
 		} else if ( arg[1] == 'b' && !arg[2] ) {
 			allow_bare = 1;
+		} else if (!strcmp(arg, "--name-by-subject")) {
+			name_by_subject = 1;
 		} else if (!strcmp(arg, "--keep-cr")) {
 			keep_cr = 1;
 		} else if ( arg[1] == 'o' && arg[2] ) {
diff --git a/t/t5100-mailinfo.sh b/t/t5100-mailinfo.sh
index c8d06554541c..4826735c6033 100755
--- a/t/t5100-mailinfo.sh
+++ b/t/t5100-mailinfo.sh
@@ -44,6 +44,31 @@  do
 	'
 done
 
+test_expect_success 'split sample box with --name-by-subject' '
+	mkdir name-by-subject &&
+	git mailsplit --name-by-subject -oname-by-subject "$DATA/sample.mbox" >last &&
+	last=$(cat last) &&
+	echo total is $last &&
+	test $(cat last) = 18
+'
+
+check_mailinfo_name_by_subject () {
+	mail=$1
+	mo="$(basename "$mail" | cut -c1-4)"
+	echo "$(basename "$mail")" >"sanitized$mo" &&
+	git mailinfo -u "msg$mo" "patch$mo" <"$mail" >"info$mo" &&
+	test_cmp "$DATA/msg$mo" "msg$mo" &&
+	test_cmp "$DATA/patch$mo" "patch$mo" &&
+	test_cmp "$DATA/info$mo" "info$mo" &&
+	test_cmp "$DATA/sanitized$mo" "sanitized$mo"
+}
+
+for mail in name-by-subject/00*
+do
+	test_expect_success "check --name-by-subject $mail" '
+		check_mailinfo_name_by_subject "$mail"
+	'
+done
 
 test_expect_success 'split box with rfc2047 samples' \
 	'mkdir rfc2047 &&