Message ID | 20221209150048.2400648-2-toon@iotcl.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | cat-file: quote-format name in error when using -z | expand |
Hi Toon On 09/12/2022 15:00, Toon Claes wrote: > Since it's supported to have NUL-delimited input, introduced in > db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, > 2022-07-22), it's possible to pass paths that contain newlines. This > works great when the object is found, but when it's not, the input path > is returned in the error message. Because this can contain newlines, the > error message might get spread over multiple lines, making it harder to > machine-parse this error message. > > With this change, the input is quote-formatted in the error message, if > needed. This ensures the error message is always on a single line and > makes parsing the error more straightforward. Thanks for working on this. I'd previously suggested NUL terminating the output of "git cat-file -z" to avoid this problem [1] but quoting the object name is a better solution. The implementation looks good, thanks for adding a test - are you sure we need the FUNNYNAMES guard on the test even though it isn't creating files with funny names? Best Wishes Phillip [1] https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/ > Signed-off-by: Toon Claes <toon@iotcl.com> > --- > builtin/cat-file.c | 10 ++++++++++ > t/t1006-cat-file.sh | 8 ++++++++ > 2 files changed, 18 insertions(+) > > diff --git a/builtin/cat-file.c b/builtin/cat-file.c > index b3be58b1fb..67dd81238d 100644 > --- a/builtin/cat-file.c > +++ b/builtin/cat-file.c > @@ -14,6 +14,7 @@ > #include "tree-walk.h" > #include "oid-array.h" > #include "packfile.h" > +#include "quote.h" > #include "object-store.h" > #include "promisor-remote.h" > #include "mailmap.h" > @@ -475,6 +476,13 @@ static void batch_one_object(const char *obj_name, > result = get_oid_with_context(the_repository, obj_name, > flags, &data->oid, &ctx); > if (result != FOUND) { > + struct strbuf quoted = STRBUF_INIT; > + > + if (opt->nul_terminated) { > + quote_c_style(obj_name, "ed, NULL, 0); > + obj_name = quoted.buf; > + } > + > switch (result) { > case MISSING_OBJECT: > printf("%s missing\n", obj_name); > @@ -499,6 +507,8 @@ static void batch_one_object(const char *obj_name, > result); > break; > } > + > + strbuf_release("ed); > fflush(stdout); > return; > } > diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh > index 23b8942edb..232bfd1723 100755 > --- a/t/t1006-cat-file.sh > +++ b/t/t1006-cat-file.sh > @@ -447,6 +447,14 @@ test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' > test_cmp expect actual > ' > > +test_expect_success FUNNYNAMES '--batch-check, -z with newline in non-existent named object' ' > + printf "HEAD:newline${LF}embedded" >in && > + git cat-file --batch-check -z <in >actual && > + > + printf "\"HEAD:newline\\\\nembedded\" missing\n" >expect && > + test_cmp expect actual > +' > + > batch_command_multiple_info="info $hello_sha1 > info $tree_sha1 > info $commit_sha1 > -- > 2.39.0.rc0.57
Phillip Wood <phillip.wood123@gmail.com> writes: > Hi Toon > > On 09/12/2022 15:00, Toon Claes wrote: >> Since it's supported to have NUL-delimited input, introduced in >> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, >> 2022-07-22), it's possible to pass paths that contain newlines. This >> works great when the object is found, but when it's not, the input path >> is returned in the error message. Because this can contain newlines, the >> error message might get spread over multiple lines, making it harder to >> machine-parse this error message. >> With this change, the input is quote-formatted in the error message, >> if >> needed. This ensures the error message is always on a single line and >> makes parsing the error more straightforward. > > Thanks for working on this. I'd previously suggested NUL terminating > the output of "git cat-file -z" to avoid this problem [1] but quoting > the object name is a better solution. Hmph. My knee-jerk reaction was that it is utterly disgusting if we quote when we do NUL-terminated. Is your "quoting is OK over NUL terminating" because "-z" applies only to the input? If so, then I would agree that is OK, but shouldn't the quoting apply regardless of how the input is formulated? Why do we call the cquote helper only under "-z"?
On 09/12/2022 23:58, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> Hi Toon >> >> On 09/12/2022 15:00, Toon Claes wrote: >>> Since it's supported to have NUL-delimited input, introduced in >>> db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, >>> 2022-07-22), it's possible to pass paths that contain newlines. This >>> works great when the object is found, but when it's not, the input path >>> is returned in the error message. Because this can contain newlines, the >>> error message might get spread over multiple lines, making it harder to >>> machine-parse this error message. >>> With this change, the input is quote-formatted in the error message, >>> if >>> needed. This ensures the error message is always on a single line and >>> makes parsing the error more straightforward. >> >> Thanks for working on this. I'd previously suggested NUL terminating >> the output of "git cat-file -z" to avoid this problem [1] but quoting >> the object name is a better solution. > > Hmph. My knee-jerk reaction was that it is utterly disgusting if we > quote when we do NUL-terminated. Is your "quoting is OK over NUL > terminating" because "-z" applies only to the input? Yes, if the object exists then delimiting the output with newlines is fine, it is only if the object is missing and its name contains a newline that there is a problem. It also makes adopting "-z" in existing scripts easier as there is no change required when parsing the output. As "-z" was added in 2.38 there is also a pragmatic reason to prefer quoting over NUL terminated output as it allows us to fix this issue without changing the output delimiter of an existing option. > If so, then I > would agree that is OK, but shouldn't the quoting apply regardless > of how the input is formulated? Why do we call the cquote helper > only under "-z"? Without "-z" you cannot pass object names that contain newlines so not quoting the output does not cause a problem. We could start quoting the object name without "-z" but we'd be changing the output without a huge benefit. Best Wishes Phillip
Phillip Wood <phillip.wood123@gmail.com> writes: > Without "-z" you cannot pass object names that contain newlines so not > quoting the output does not cause a problem. We could start quoting > the object name without "-z" but we'd be changing the output without a > huge benefit. That's fair. The next question is from a devil's advocate: is switching to the full cquote the best thing to do? If we were using the full cquote from the very beginning, of course it is, simply because that is what is used in all other places in Git. Using the full cquote does mean a LF byte will be protected (i.e. instead of shown literally in the middle of other letters around LF, "other\nletters around LF" would be shown), but pathnames with backslashes and double quotes in them that have been shown without problems would be shown differently and will break existing parsers, which are written lazily with the assumption that they are perfectly happy to be "simple" thans to not having to deal with LF (because in their environment a path with LF in it do not matter). A bit safer thing to do is to replace LF (and not any other bytes that would be quoted with full cquote) in the path given in these messages with something else (like NUL to truncate the output there). As these answers are given in order, the object names are not absolutely needed to identify and match up the input and the output, and properly written parsers would be prepared to see a response with an object name that it did not request and handle it sanely, such a change may not break such a parser for a path with any byte that are modified with full cquote. The above is with a devil's adovocate hat on, and I do not care too much, as I do not think butchering backslash with full cquote would not hurt even existing Windows users (if "HEAD:t\README.txt" named the same blob as "HEAD:t/README.txt" on a platform, doubling the backslashes in the output would have made quite a lot of damage, but I do not think we allow backslashes to name tree paths). By the way, there is another use of obj_name in batch_object_write() that can show whatever byte in it literally to the output. Thanks.
Junio C Hamano <gitster@pobox.com> writes: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> Without "-z" you cannot pass object names that contain newlines so not >> quoting the output does not cause a problem. We could start quoting >> the object name without "-z" but we'd be changing the output without a >> huge benefit. > > That's fair. The next question is from a devil's advocate: > is switching to the full cquote the best thing to do? > > If we were using the full cquote from the very beginning, of course > it is, simply because that is what is used in all other places in > Git. Using the full cquote does mean a LF byte will be protected > (i.e. instead of shown literally in the middle of other letters > around LF, "other\nletters around LF" would be shown), but pathnames > with backslashes and double quotes in them that have been shown > without problems would be shown differently and will break existing > parsers, which are written lazily with the assumption that they are > perfectly happy to be "simple" thans to not having to deal with LF > (because in their environment a path with LF in it do not matter). > > A bit safer thing to do is to replace LF (and not any other bytes > that would be quoted with full cquote) in the path given in these > messages with something else (like NUL to truncate the output > there). So object name "HEAD:other\nletters around LF" would give the error message "other missing"? That error message would also occur when the user does not provide -z. I think that might be confusing. > As these answers are given in order, the object names are > not absolutely needed to identify and match up the input and the > output, and properly written parsers would be prepared to see a > response with an object name that it did not request and handle it > sanely, such a change may not break such a parser for a path with > any byte that are modified with full cquote. Yes, the answers are returned in order, so personally I don't care too much about the returned object name format. I even would be fine with a generic error message that omits the input name, for example "object missing". As long as it's clear that the requested object is not found. For your information, there is an extreme edge case a user could fake an object, and that's what we want to avoid as well. For example the command (line break included): printf "aabbccddeeff00112233445566778899aabbccdd blob 26 this object is not" | git cat-file --batch -z Would print: aabbccddeeff00112233445566778899aabbccdd blob 26 this object is not missing This is perfectly valid git-cat-file output. Luckily I don't see any way how this can be abused. Generally I think it's a good idea to not return the input as-is in any situation. We could only replace newlines, but cquoting already sanitizes the input, so why not use that? > The above is with a devil's adovocate hat on, and I do not care too > much, as I do not think butchering backslash with full cquote would > not hurt even existing Windows users (if "HEAD:t\README.txt" named > the same blob as "HEAD:t/README.txt" on a platform, doubling the > backslashes in the output would have made quite a lot of damage, but > I do not think we allow backslashes to name tree paths). > By the way, there is another use of obj_name in batch_object_write() > that can show whatever byte in it literally to the output. Ah thanks! I will include in the next version, when we reach a consensus on when or what to cquote. -- Toon
Toon Claes <toon@to1.studio> writes: > Ah thanks! I will include in the next version, when we reach a consensus > on when or what to cquote. I think we already have consensus on when---"-z" may benefit, otherwise we should not quote. I am not sure what you mean by "what to cquote"---are there things other than input object names that may be ambiguous and require quoting?
On 12/12/2022 00:11, Junio C Hamano wrote: > Phillip Wood <phillip.wood123@gmail.com> writes: > >> Without "-z" you cannot pass object names that contain newlines so not >> quoting the output does not cause a problem. We could start quoting >> the object name without "-z" but we'd be changing the output without a >> huge benefit. > > That's fair. The next question is from a devil's advocate: > is switching to the full cquote the best thing to do? > > If we were using the full cquote from the very beginning, of course > it is, simply because that is what is used in all other places in > Git. Using the full cquote does mean a LF byte will be protected > (i.e. instead of shown literally in the middle of other letters > around LF, "other\nletters around LF" would be shown), but pathnames > with backslashes and double quotes in them that have been shown > without problems would be shown differently and will break existing > parsers, which are written lazily with the assumption that they are > perfectly happy to be "simple" thans to not having to deal with LF > (because in their environment a path with LF in it do not matter). > > A bit safer thing to do is to replace LF (and not any other bytes > that would be quoted with full cquote) in the path given in these > messages with something else (like NUL to truncate the output > there). As these answers are given in order, the object names are > not absolutely needed to identify and match up the input and the > output, and properly written parsers would be prepared to see a > response with an object name that it did not request and handle it > sanely, such a change may not break such a parser for a path with > any byte that are modified with full cquote. > > The above is with a devil's adovocate hat on, and I do not care too > much, as I do not think butchering backslash with full cquote would > not hurt even existing Windows users (if "HEAD:t\README.txt" named > the same blob as "HEAD:t/README.txt" on a platform, doubling the > backslashes in the output would have made quite a lot of damage, but > I do not think we allow backslashes to name tree paths). By default quoting also affects names that contain non-ascii characters. If we're worried about that we could only quote names that contain LF but as you point out the answers are given in order so I don't think there should be a problem with calling cquote unconditionally. (As this option has only existed for one release I suspect there aren't that many users yet as well) Best Wishes Phillip > By the way, there is another use of obj_name in batch_object_write() > that can show whatever byte in it literally to the output. > > Thanks.
Phillip Wood <phillip.wood123@gmail.com> writes: > By default quoting also affects names that contain non-ascii > characters. If we're worried about that we could only quote names that > contain LF but as you point out the answers are given in order so I > don't think there should be a problem with calling cquote > unconditionally. (As this option has only existed for one release I > suspect there aren't that many users yet as well) Good point.
Phillip Wood <phillip.wood123@gmail.com> writes: > Thanks for working on this. I'd previously suggested NUL terminating the output > of "git cat-file -z" to avoid this problem [1] > > [1] > https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/ What happened to this proposal? I don't see any replies to that. That's a bit sad, because it would have been nice to have it this behavior from the start. > but quoting the object name is a better solution. I would not say it's a better solution, but it's a less invasive solution that /minimizes/ breaking changes. Ideally I'd like to have NUL terminated output for "git cat-file -z". In a success situation I assume this would return: <oid> SP <type> SP <size> NUL <contents> NUL In a failure situation something like: <object> SP missing NUL So when you pass -z you can keep reading until the first NUL and then you'll know if you should read for contents as well. Would you consider change behavior to this now? -- Toon
Hi Toon On 20/12/2022 05:31, Toon Claes wrote: > > Phillip Wood <phillip.wood123@gmail.com> writes: > >> Thanks for working on this. I'd previously suggested NUL terminating the output >> of "git cat-file -z" to avoid this problem [1] >> >> [1] >> https://lore.kernel.org/git/66b71194-ad0e-18d0-e43b-71e5c47ba111@gmail.com/ > > What happened to this proposal? I don't see any replies to that. That's > a bit sad, because it would have been nice to have it this behavior from > the start. Yes it would have been nice, unfortunately it feel through the cracks due to a combination of unfortunate timing and lack of time. I think the patch was probably in next by the time I realized the problem and wrote that email. Taylor was on holiday at the time and then I went away around the time he got back. I know it was on his todo list but I think he was busy catching up from being away getting ready for git merge. I had other things I was working on and unfortunately didn't get round to following it up. >> but quoting the object name is a better solution. > > I would not say it's a better solution, but it's a less invasive > solution that /minimizes/ breaking changes. Ideally I'd like to have NUL > terminated output for "git cat-file -z". In a success situation I > assume this would return: > > <oid> SP <type> SP <size> NUL > <contents> NUL > > In a failure situation something like: > > <object> SP missing NUL > > So when you pass -z you can keep reading until the first NUL and then > you'll know if you should read for contents as well. > > Would you consider change behavior to this now? Hmm I'm not sure (and luckily it isn't up to me to take the final decision!). I really don't know how many people are using "-z" I suspect it is not many and so the fallout wouldn't be too bad but we'd still be inconveniencing some users. I theory we could so "sorry we made a mistake when implementing this feature and have decided to change it" but we have a solution in your patch which hopefully does not break any users (I say hopefully because think if one is careful and keep track of which responses you've read it is possible to detect missing objects now even if their name contains a new line but it will be messy and probably slightly inefficient but anyone doing that will notice the change in output). Overall I'd say it is tempting but risky and inconvenient to any existing users of "-z" (assuming someone else is actually using it). Quoting the object name is definitively the safer option at this point. Best Wishes Phillip
Phillip Wood <phillip.wood123@gmail.com> writes: > Yes it would have been nice, unfortunately it feel through the cracks due to a > combination of unfortunate timing and lack of time. I think the patch was > probably in next by the time I realized the problem and wrote that email. Taylor > was on holiday at the time and then I went away around the time he got back. I > know it was on his todo list but I think he was busy catching up from being away > getting ready for git merge. I had other things I was working on and > unfortunately didn't get round to following it up. Oh sorry, I didn't want to put blame on anyone. Things happen. I just wasn't sure I missed anything. > Overall I'd say it is tempting but risky and inconvenient to any existing users > of "-z" (assuming someone else is actually using it). Quoting the object name is > definitively the safer option at this point. I assume Taylor would be one of the consumers of output of "-z", so thanks for putting him in Cc to get his take on this. -- Toon
diff --git a/builtin/cat-file.c b/builtin/cat-file.c index b3be58b1fb..67dd81238d 100644 --- a/builtin/cat-file.c +++ b/builtin/cat-file.c @@ -14,6 +14,7 @@ #include "tree-walk.h" #include "oid-array.h" #include "packfile.h" +#include "quote.h" #include "object-store.h" #include "promisor-remote.h" #include "mailmap.h" @@ -475,6 +476,13 @@ static void batch_one_object(const char *obj_name, result = get_oid_with_context(the_repository, obj_name, flags, &data->oid, &ctx); if (result != FOUND) { + struct strbuf quoted = STRBUF_INIT; + + if (opt->nul_terminated) { + quote_c_style(obj_name, "ed, NULL, 0); + obj_name = quoted.buf; + } + switch (result) { case MISSING_OBJECT: printf("%s missing\n", obj_name); @@ -499,6 +507,8 @@ static void batch_one_object(const char *obj_name, result); break; } + + strbuf_release("ed); fflush(stdout); return; } diff --git a/t/t1006-cat-file.sh b/t/t1006-cat-file.sh index 23b8942edb..232bfd1723 100755 --- a/t/t1006-cat-file.sh +++ b/t/t1006-cat-file.sh @@ -447,6 +447,14 @@ test_expect_success FUNNYNAMES '--batch-check, -z with newline in input' ' test_cmp expect actual ' +test_expect_success FUNNYNAMES '--batch-check, -z with newline in non-existent named object' ' + printf "HEAD:newline${LF}embedded" >in && + git cat-file --batch-check -z <in >actual && + + printf "\"HEAD:newline\\\\nembedded\" missing\n" >expect && + test_cmp expect actual +' + batch_command_multiple_info="info $hello_sha1 info $tree_sha1 info $commit_sha1
Since it's supported to have NUL-delimited input, introduced in db9d67f2e9 (builtin/cat-file.c: support NUL-delimited input with `-z`, 2022-07-22), it's possible to pass paths that contain newlines. This works great when the object is found, but when it's not, the input path is returned in the error message. Because this can contain newlines, the error message might get spread over multiple lines, making it harder to machine-parse this error message. With this change, the input is quote-formatted in the error message, if needed. This ensures the error message is always on a single line and makes parsing the error more straightforward. Signed-off-by: Toon Claes <toon@iotcl.com> --- builtin/cat-file.c | 10 ++++++++++ t/t1006-cat-file.sh | 8 ++++++++ 2 files changed, 18 insertions(+) -- 2.39.0.rc0.57