Message ID | fe028981d353158e9840eb035194ca15e6a2c15e.1692165840.git.ps@pks.im (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | upload-pack: fix exit code when denying fetch of unreachable object ID | expand |
Patrick Steinhardt <ps@pks.im> writes: > In 7ba7c52d76 (upload-pack: fix race condition in error messages, > 2023-08-10), we have fixed a race in t5516-fetch-push.sh where sometimes > error messages got intermingled. This was done by splitting up the call > to `die()` such that we print the error message before writing to the > remote side, followed by a call to `exit(1)` afterwards. > ... > We have found this issue in our CI pipeline in Gitaly, which explicitly > checks for the error code. It is very much debatable whether we should > even be doing that in the first place given that error codes are not > even explicitly documented here. But I think this is worth fixing anyway > given that it seems like an unintended side effect to me and might be > biting others, as well. > > If you folks agree with my reasoning, then I think we should pick this > up before Git v2.42.0 is released. Thanks. The change to the code sounds sensible in that it is a move to restore the status quo, and we know that the original never intended to "fix" the exit status from 128 to 1. The test change etches the status quo in stone, which is a bit more than that and might be debatable, but when we someday formally declare that users should not be relying on the exit status that are not documented, we would hunt for the uses of test_expect_code in the tests and turn this one back, and probably do the same to others that are too strict on the exact exit status, so I think that half of the change is OK, at least for now. Comments? > t/t5703-upload-pack-ref-in-want.sh | 2 +- > upload-pack.c | 2 +- > 2 files changed, 2 insertions(+), 2 deletions(-) > > diff --git a/t/t5703-upload-pack-ref-in-want.sh b/t/t5703-upload-pack-ref-in-want.sh > index df74f80061..fe6e056700 100755 > --- a/t/t5703-upload-pack-ref-in-want.sh > +++ b/t/t5703-upload-pack-ref-in-want.sh > @@ -483,7 +483,7 @@ test_expect_success 'server is initially ahead - no ref in want' ' > rm -rf local && > cp -r "$LOCAL_PRISTINE" local && > inconsistency main $(test_oid numeric) && > - test_must_fail git -C local fetch 2>err && > + test_expect_code 128 git -C local fetch 2>err && > test_i18ngrep "fatal: remote error: upload-pack: not our ref" err > ' > > diff --git a/upload-pack.c b/upload-pack.c > index ece111c629..15f3318d6d 100644 > --- a/upload-pack.c > +++ b/upload-pack.c > @@ -782,7 +782,7 @@ static void check_non_tip(struct upload_pack_data *data) > packet_writer_error(&data->writer, > "upload-pack: not our ref %s", > oid_to_hex(&o->oid)); > - exit(1); > + exit(128); > } > } > }
Junio C Hamano <gitster@pobox.com> writes: > The change to the code sounds sensible in that it is a move to > restore the status quo, and we know that the original never intended > to "fix" the exit status from 128 to 1. The test change etches the > status quo in stone, which is a bit more than that and might be > debatable, but when we someday formally declare that users should > not be relying on the exit status that are not documented, we would > hunt for the uses of test_expect_code in the tests and turn this one > back, and probably do the same to others that are too strict on the > exact exit status, so I think that half of the change is OK, at > least for now. > > Comments? An alternative to make this "fix" without setting any policy is to do this. That is, to remove the change to the test part and then to rephrase the tail end of the proposed commit log message. I can go either way. I personally prefer our tests not to be overly strict about behaviors they test, especially the ones we do not document. 1: 77d0f01405 ! 1: 5f33a843de upload-pack: fix exit code when denying fetch of unreachable object ID @@ Commit message seems rather clear that this is an unintended side effect of the change given that this change in behaviour was not mentioned at all. - Fix this regression by exiting with 128 again and tighten one of our - tests to catch such unintended side effects. + Restore the status-quo by exiting with 128. The test in t5703 to + ensure that "git fetch" fails by using test_must_fail, which does + not care between exiting 1 and 128, so this changes will not affect + any test. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com> - ## t/t5703-upload-pack-ref-in-want.sh ## -@@ t/t5703-upload-pack-ref-in-want.sh: test_expect_success 'server is initially ahead - no ref in want' ' - rm -rf local && - cp -r "$LOCAL_PRISTINE" local && - inconsistency main $(test_oid numeric) && -- test_must_fail git -C local fetch 2>err && -+ test_expect_code 128 git -C local fetch 2>err && - test_i18ngrep "fatal: remote error: upload-pack: not our ref" err - ' - - ## upload-pack.c ## @@ upload-pack.c: static void check_non_tip(struct upload_pack_data *data) packet_writer_error(&data->writer,
Patrick earlier found that Gitaly's CI pipeline was being overly picky and complained about the unintended change of the exit code of "git fetch" in the affected codepath from 128 to 1 in a recent change that went to 'next', made by 7ba7c52d (upload-pack: fix race condition in error messages, 2023-08-10). The thing is, we follow that exiting with 0 is success, and exiting with non-zero means failure, and we generally do not specify which non-zero value is used for the latter in our documentation. This particular case (i.e. "git fetch") certainly does not say anything about how failure is signaled to the calling program. "git help git" does not have "EXIT CODES" section, and it is assumed that the "common sense" of older generation (gee, this project is more than 18 years old) that exiting with 0 is success and non-zero is failure is shared among its users, which might not be warranted these days. We could either * Be more prescriptive and add "EXIT CODES" section to each and every document to describe how we fail in the current code. or * Describe "In general, 0 is success, non-zero is failure, but some commands may signal more than that with its non-zero exit codes" in "git help git", and add "EXIT CODES" section to the manual page of the commands whose exit codes matter (there are a handful, like "git diff --exit-code" that explicitly says "1" is the signal that it found difference as opposed to it failing). I'd prefer if community agrees that we should do the latter, but I am OK if the community consensus goes the other way. If we were to go the former, the task becomes larger but it would be embarrassingly parallelizable. Folks with extra time on their hand can lend their eyes to tackle each command, find what exit code we use in the current code, add "EXIT CODES" section to the documentation, and extend test coverage to ensure their findings will stay unless we deliberately change it in the future. If we were to go the latter, the task will be significantly smaller. We need to come up with a careful phrasing to add to "git help git" and/or "git help cli" to give the general principle of zero vs non-zero whose exact values are left unspecified and should not be depended upon. We also need to identify the special cases where the exact values have meanings (like the "git diff --exit-code" example above), describe them by adding "EXIT CODES" section to the manual pages of these selected commands, and extend test coverage to ensure these values are kept intact across future changes. Comments?
Derrick Stolee <derrickstolee@github.com> writes: [jc: the message I am responding to may not be on the list archive, as it was multipart/alternative with text/html in it, but I think the main point from you can be seen by others only from the parts I quoted here]. > While I don't think we should document that the exit code has > a special meaning for the builtin, adding the test will help > prevent another accidental change in the future. If the patch is > worth taking (to fix the accidental change) then I think the test > should stay, so we don't make this change accidentally again. I think my stance is a bit more nuanced, in that the first half of the patch to make us exit with 128 is worth taking, simply because we did not have to and did not intend to change the exit status, but the other half of the patch, using test_expect_code in the test suite, sends a wrong message that somehow exact value of non-zero exit status in this particular case matters. To put it another way, if your patch to shuffle the calls for two error messages, concluded with a call to exit(), were written in the ideal world, you would have passed 128 to exit(), *and* you wouldn't have added any test that says "fetch should exit with 128 and not 1 when it fails". I aimed to massage Patrick's patch so that the original patch from you will become that patch in the ideal world when it is squashed in. > To my view, test cases can change in the future as long as > there is good justification in doing so. Having existing tests > helps to demonstrate a change in behavior. I agree with that 100%, but the thing is that the error shuffling patch will not escape 'next' until the upcoming release, at which time we can rewind and redo 'next'. I think the first half of Patrick's fix would be a good material to squash into that patch, which would make the result identical to the one that would have been written in the ideal world I described above. And the other half would not have a place to be in that patch in the ideal world. IOW, there is no "change in behaviour" we want to demonstrate here, as we will pretend nothing bad happened after the upcoming release ;-) Thanks.
On Wed, Aug 16, 2023 at 09:44:04AM -0700, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > The change to the code sounds sensible in that it is a move to > > restore the status quo, and we know that the original never intended > > to "fix" the exit status from 128 to 1. The test change etches the > > status quo in stone, which is a bit more than that and might be > > debatable, but when we someday formally declare that users should > > not be relying on the exit status that are not documented, we would > > hunt for the uses of test_expect_code in the tests and turn this one > > back, and probably do the same to others that are too strict on the > > exact exit status, so I think that half of the change is OK, at > > least for now. > > > > Comments? > > An alternative to make this "fix" without setting any policy is to > do this. That is, to remove the change to the test part and then to > rephrase the tail end of the proposed commit log message. > > I can go either way. I personally prefer our tests not to be overly > strict about behaviors they test, especially the ones we do not > document. FWIW, my gut feeling agrees with you. I do not mind restoring the previous "128" exit code for consistency and continuity, and there is no reason to prefer "1" here instead. But I don't know that it's something we should be promising or keeping track of with a test. In fact, I would say that most of these die() calls in upload-pack are pointless. The stderr of the server-side process is frequently not even visible to the user (because it is on the other side of an http, etc, connection). So in practice the message is going to /dev/null or perhaps polluting some daemon log. From the perspective of a server operator, I would even go so far as to suggest that upload-pack should return "0" here. The client broke protocol, but we told them over the correct channel and then exited cleanly. There is no error on the server side. One could argue that a large-scale server operator may want to keep track of protocol breaks like this, because they could be a sign of a bug. But they'd probably be better off with instrumenting upload-pack (or any proxy that sits in front of it) to count messages coming over the ERR channel. I've never really made patches in that direction because for the most part the existing die() calls aren't hurting anything, and nobody cares much about the exit code either way. But adding a more specific test feels like going in the wrong direction (yes, I know t5516 is already checking for a failed code, but modulo the race problems we have with reading ERR packets, we really would be better off checking what the client sees, or talking directly to upload-pack itself, the way t5530 does). -Peff
On Wed, Aug 16, 2023 at 10:04:28AM -0700, Junio C Hamano wrote: > We could either > > * Be more prescriptive and add "EXIT CODES" section to each and > every document to describe how we fail in the current code. > > or > > * Describe "In general, 0 is success, non-zero is failure, but some > commands may signal more than that with its non-zero exit codes" > in "git help git", and add "EXIT CODES" section to the manual > page of the commands whose exit codes matter (there are a > handful, like "git diff --exit-code" that explicitly says "1" is > the signal that it found difference as opposed to it failing). > > I'd prefer if community agrees that we should do the latter, but I > am OK if the community consensus goes the other way. I left some notes on upload-pack specifically elsewhere in the thread, in which I argue that we should definitely not lock ourselves into its current behavior. But in the more general sense, yeah, I think that trying to document specific exit codes for each command is a mistake. It is not just "let's find which exit codes they use and document them". I suspect it is a rat's nest of unplanned behaviors that come from unexpected die() calls deep in the stack. We would not necessarily want to make promises about what is happening in those, nor do I think it would even be sensible to find every possible exit. We _could_ document "128 means something really unexpected happened and we called die() deep in the code". But even that seems misleading to me, as we also die() for everyday shallow things (like "the name you gave is not valid"). The value really means very little in practice, and the biggest reason not to change it is that we know it doesn't conflict with any codes that programs _do_ promise are meaningful (like "1" from "diff --exit-code"). So saying "0 is success, non-zero is failure, and some commands may document specific codes" is the closest thing to the reality of what we as developers know and have planned. (Of course another project is not just to figure out the possible situations/codes but to catalogue and organize them. But that seems like an order of magnitude more work, if not several orders). -Peff
On Wed, Aug 16, 2023 at 10:04:28AM -0700, Junio C Hamano wrote: >"git help git" does not have "EXIT CODES" section, and it is assumed >that the "common sense" of older generation [...] that exiting with 0 >is success and non-zero is failure is shared among its users, which >might not be warranted these days. > well, that's actually a standard (exit(3) has things to say about it), and given how shell scripts treat exit codes, there is no wiggle room here. just about every shell intro tutorial explains it. >We could either > > * Be more prescriptive and add "EXIT CODES" section to each and > every document to describe how we fail in the current code. > >or > > * Describe "In general, 0 is success, non-zero is failure, but some > commands may signal more than that with its non-zero exit codes" > in "git help git", and add "EXIT CODES" section to the manual > page of the commands whose exit codes matter (there are a > handful, like "git diff --exit-code" that explicitly says "1" is > the signal that it found difference as opposed to it failing). > i'd go with the second, with some minor modifications: - 1 is the by far most common non-zero error code (and it matches EXIT_FAILURE on all relevant systems), so it's ok to state that. it may be wise to actually check that commands don't deviate from it needlessly. - the canonical name of the section appears to be "EXIT STATUS" regards
On Wed, Aug 16, 2023 at 10:12:18PM -0700, Junio C Hamano wrote: > Derrick Stolee <derrickstolee@github.com> writes: > > [jc: the message I am responding to may not be on the list archive, > as it was multipart/alternative with text/html in it, but I think > the main point from you can be seen by others only from the parts > I quoted here]. > > > While I don't think we should document that the exit code has > > a special meaning for the builtin, adding the test will help > > prevent another accidental change in the future. If the patch is > > worth taking (to fix the accidental change) then I think the test > > should stay, so we don't make this change accidentally again. > > I think my stance is a bit more nuanced, in that the first half of > the patch to make us exit with 128 is worth taking, simply because > we did not have to and did not intend to change the exit status, but > the other half of the patch, using test_expect_code in the test > suite, sends a wrong message that somehow exact value of non-zero > exit status in this particular case matters. > > To put it another way, if your patch to shuffle the calls for two > error messages, concluded with a call to exit(), were written in the > ideal world, you would have passed 128 to exit(), *and* you wouldn't > have added any test that says "fetch should exit with 128 and not 1 > when it fails". I aimed to massage Patrick's patch so that the > original patch from you will become that patch in the ideal world > when it is squashed in. I tend to agree with Derrick -- if we think that it is important enough to restore the exit code, whether that change was intentional or not, then I think it makes sense to also add a test. The benefit of that test wouldn't be to say "This is cast into stone", but rather to indicate to the developer that a change that they have just been doing has an unintentional side effect. The problem I see with my own stance though is that if you extend it to the extreme, every single `test_must_fail` would need to do exact error code checking. The benefit of this would be kind of dubious though as long as we do not decide to attach meaning to specific error codes. In general I often wish that we had better ways to transport the circumstances of why a specific command has failed to the caller. In Gitaly, we often have to fall back to parsing the standard error stream of a command in order to figure out the failure cause, which does not exactly feel great given that these are rather intended to be consumed by a user rather than a program. Whether that information should be transported via exit codes though... I don't know. An exit code can only convey so much information and they often feel fragile to me. Documenting them explicitly would of course already go a long way, but that wouldn't quite help the fact that this mechanism still can't convey more information than "The command has failed because of a specific root cause". Many commands perform more than a single unit of work though, so even if we know the root cause we still wouldn't necessarily know where exactly it has failed. One way to fix this would be to give commands a way to return structured error data to the caller instead of relying on exit codes. But that is of course a bigger topic, and I feel like I'm digressing. Patrick > > To my view, test cases can change in the future as long as > > there is good justification in doing so. Having existing tests > > helps to demonstrate a change in behavior. > > I agree with that 100%, but the thing is that the error shuffling > patch will not escape 'next' until the upcoming release, at which > time we can rewind and redo 'next'. I think the first half of > Patrick's fix would be a good material to squash into that patch, > which would make the result identical to the one that would have > been written in the ideal world I described above. > > And the other half would not have a place to be in that patch in the > ideal world. IOW, there is no "change in behaviour" we want to > demonstrate here, as we will pretend nothing bad happened after the > upcoming release ;-) > > Thanks.
Jeff King <peff@peff.net> writes: > We _could_ document "128 means something really unexpected happened and > we called die() deep in the code". But even that seems misleading to me, > as we also die() for everyday shallow things (like "the name you gave is > not valid"). The value really means very little in practice, and the > biggest reason not to change it is that we know it doesn't conflict with > any codes that programs _do_ promise are meaningful (like "1" from "diff > --exit-code"). Yeah, I forgot to say that we should mention 128 to tell the users that it is a meaningless positive number chosen to signal a general error and it is set sufficiently high so that it won't conflict with a range of low positive numbers certain subcommands use to convey specific meaning in their errors. And you said it nicely above. With that clarified, my vote still goes to the "do not overly tied to what the current implementation happens to do" route. Thanks.
diff --git a/t/t5703-upload-pack-ref-in-want.sh b/t/t5703-upload-pack-ref-in-want.sh index df74f80061..fe6e056700 100755 --- a/t/t5703-upload-pack-ref-in-want.sh +++ b/t/t5703-upload-pack-ref-in-want.sh @@ -483,7 +483,7 @@ test_expect_success 'server is initially ahead - no ref in want' ' rm -rf local && cp -r "$LOCAL_PRISTINE" local && inconsistency main $(test_oid numeric) && - test_must_fail git -C local fetch 2>err && + test_expect_code 128 git -C local fetch 2>err && test_i18ngrep "fatal: remote error: upload-pack: not our ref" err ' diff --git a/upload-pack.c b/upload-pack.c index ece111c629..15f3318d6d 100644 --- a/upload-pack.c +++ b/upload-pack.c @@ -782,7 +782,7 @@ static void check_non_tip(struct upload_pack_data *data) packet_writer_error(&data->writer, "upload-pack: not our ref %s", oid_to_hex(&o->oid)); - exit(1); + exit(128); } } }
In 7ba7c52d76 (upload-pack: fix race condition in error messages, 2023-08-10), we have fixed a race in t5516-fetch-push.sh where sometimes error messages got intermingled. This was done by splitting up the call to `die()` such that we print the error message before writing to the remote side, followed by a call to `exit(1)` afterwards. This causes a subtle regression though as `die()` causes us to exit with exit code 128, whereas we now call `exit(1)`. It's not really clear whether we want to guarantee any specific error code in this case, and neither do we document anything like that. But on the other hand, it seems rather clear that this is an unintended side effect of the change given that this change in behaviour was not mentioned at all. Fix this regression by exiting with 128 again and tighten one of our tests to catch such unintended side effects. Signed-off-by: Patrick Steinhardt <ps@pks.im> --- We have found this issue in our CI pipeline in Gitaly, which explicitly checks for the error code. It is very much debatable whether we should even be doing that in the first place given that error codes are not even explicitly documented here. But I think this is worth fixing anyway given that it seems like an unintended side effect to me and might be biting others, as well. If you folks agree with my reasoning, then I think we should pick this up before Git v2.42.0 is released. Patrick t/t5703-upload-pack-ref-in-want.sh | 2 +- upload-pack.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-)