Message ID | a4472d6d1551e7c25540c4c8361bcb6b1c9f92ff.1729084997.git.ps@pks.im (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | builtin/credential-cache--daemon: fix error when "exit"ing on Cygwin | expand |
On Wed, Oct 16, 2024 at 04:21:36PM +0200, Patrick Steinhardt wrote: > diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c > index bc22f5c6d24..5a09df5c167 100644 > --- a/builtin/credential-cache--daemon.c > +++ b/builtin/credential-cache--daemon.c > @@ -156,13 +156,11 @@ static void serve_one_client(FILE *in, FILE *out) > } > else if (!strcmp(action.buf, "exit")) { > /* > - * It's important that we clean up our socket first, and then > - * signal the client only once we have finished the cleanup. > - * Calling exit() directly does this, because we clean up in > - * our atexit() handler, and then signal the client when our > - * process actually ends, which closes the socket and gives > - * them EOF. > + * We must close our file handles before we exit such that the > + * client will receive an EOF. > */ > + fclose(in); > + fclose(out); > exit(0); > } This breaks the thing the comment was trying to protect against. We want to unlink() the socket file before closing the descriptors. From 7d5e9c9849 (which you can find with blame or "git log --follow -Satexit builtin/credential-cache--daemon.c"): credential-cache--daemon: clarify "exit" action semantics When this code was originally written, there wasn't much thought given to the timing between a client asking for "exit", the daemon signaling that the action is done (with EOF), and the actual cleanup of the socket. However, we need to care about this so that our test scripts do not end up racy (e.g., by asking for an exit and checking that the socket was cleaned up). The code that is already there happens to behave very reasonably; let's add a comment to make it clear that any changes should retain the same behavior. So with the proposed change, t0301 will now fail racily. We need that unlink() to happen before the fclose(). Just calling exit() does things in the right order, but it should also be OK to do an explicit: delete_tempfile(&socket_file); fclose(in); fclose(out); That would probably require making socket_file a global variable. (You can't just return out of the serving loop, since that closes the sockets first before deleting the tempfile). > Clients can signal the git-credential-cache(1) daemon that it is > supposed to exit by sending it an "exit" command. The details around > how exactly the daemon exits seem to be rather intricate as spelt out by > a comment surrounding our call to exit(3p), as we need to be mindful > around closing the client streams before we signal the client. > > The logic is broken on Cygwin though: when a client asks the daemon to > exit, they won't see the EOF and will instead get an error message: > > fatal: read error from cache daemon: Software caused connection abort > > This issue is known in Cygwin, see for example [1], but the exact root > cause is not known. > [...] > [1]: https://github.com/cygporter/git/issues/51 I don't see any details at that issue, but I'm not sure how it would fix things. From the client's perspective, they are going to see the descriptor either way. Unless there is some magic that fclose() does that a normal descriptor-close-on-exit does not do. That "Software caused connection abort" thing seems like a weird not-standard-Unix errno value. Searching for it mostly yields people complaining about getting it from ssh under cygwin. :) If the magic that cygwin needs is actually "fclose before unlink", then that is in opposition to other platforms (and I suspect would make t0301 racy there). > As it turns out, we can avoid the issue by explicitly closing the client > streams via fclose(3p). I'm not sure at all where the claimed atexit(3p) > handler mentioned in the comment is supposed to live, but from all I can > see we do not have any installed that would close the sockets for us. So > this leaves me with a bit of a sour taste overall. The mention of atexit is a little oblique these days. We moved from a manual atexit handler to using the tempfile.c handler in 9e9033166b (credential-cache--daemon: use tempfile module, 2015-08-10). > That being said, I couldn't spot anything obviously wrong with closing > the streams ourselves, and it does fix the issue on Cygwin without any > regressions on other platforms as far as I can see. So let's go for this > fix, even though I cannot properly explain it. Running t0301 with --stress on Linux failed for me after a minute or so. -Peff
On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote: > > The logic is broken on Cygwin though: when a client asks the daemon to > > exit, they won't see the EOF and will instead get an error message: > > > > fatal: read error from cache daemon: Software caused connection abort > > > > This issue is known in Cygwin, see for example [1], but the exact root > > cause is not known. > > [...] > > [1]: https://github.com/cygporter/git/issues/51 > > I don't see any details at that issue, but I'm not sure how it would fix > things. From the client's perspective, they are going to see the > descriptor either way. Unless there is some magic that fclose() does > that a normal descriptor-close-on-exit does not do. > > That "Software caused connection abort" thing seems like a weird > not-standard-Unix errno value. Searching for it mostly yields people > complaining about getting it from ssh under cygwin. :) > > If the magic that cygwin needs is actually "fclose before unlink", then > that is in opposition to other platforms (and I suspect would make t0301 > racy there). This all seemed eerily familiar. Try this thread: https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ It looks like the conclusion was that we should adjust errno handling on the client side, but nobody ever followed up with an actual patch. -Peff
On 16/10/2024 15:21, Patrick Steinhardt wrote: > Clients can signal the git-credential-cache(1) daemon that it is > supposed to exit by sending it an "exit" command. The details around > how exactly the daemon exits seem to be rather intricate as spelt out by > a comment surrounding our call to exit(3p), as we need to be mindful > around closing the client streams before we signal the client. > > The logic is broken on Cygwin though: when a client asks the daemon to > exit, they won't see the EOF and will instead get an error message: > > fatal: read error from cache daemon: Software caused connection abort > > This issue is known in Cygwin, see for example [1], but the exact root > cause is not known. > > As it turns out, we can avoid the issue by explicitly closing the client > streams via fclose(3p). I'm not sure at all where the claimed atexit(3p) > handler mentioned in the comment is supposed to live, but from all I can > see we do not have any installed that would close the sockets for us. So > this leaves me with a bit of a sour taste overall. > > That being said, I couldn't spot anything obviously wrong with closing > the streams ourselves, and it does fix the issue on Cygwin without any > regressions on other platforms as far as I can see. So let's go for this > fix, even though I cannot properly explain it. > > [1]: https://github.com/cygporter/git/issues/51 > > Signed-off-by: Patrick Steinhardt <ps@pks.im> > --- > > I've Cc'd Adam, who is the maintainer of the Git package in Cygwin, as > well as Peff, who is the original author of the below comment. I'd be > really happy if one of you could enlighten me here :) > > Patrick > > builtin/credential-cache--daemon.c | 10 ++++------ > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c > index bc22f5c6d24..5a09df5c167 100644 > --- a/builtin/credential-cache--daemon.c > +++ b/builtin/credential-cache--daemon.c > @@ -156,13 +156,11 @@ static void serve_one_client(FILE *in, FILE *out) > } > else if (!strcmp(action.buf, "exit")) { > /* > - * It's important that we clean up our socket first, and then > - * signal the client only once we have finished the cleanup. > - * Calling exit() directly does this, because we clean up in > - * our atexit() handler, and then signal the client when our > - * process actually ends, which closes the socket and gives > - * them EOF. > + * We must close our file handles before we exit such that the > + * client will receive an EOF. > */ > + fclose(in); > + fclose(out); > exit(0); > } > else if (!strcmp(action.buf, "erase")) Heh, this is very familiar! :) See, for example, the mailing list discussion here[1]. If memory serves, Jeff was against this solution. For what it is worth, my recommended solution is '[RFC PATCH 1A] credential-cache: also handle ECONNABORTED connection closure'. I still have those patches in my local cygwin repo. Ah, let me just add the patch below (this was last rebased onto v2.46.0, but it should (hopefully) be fine to apply to master - famous last words). ATB, Ramsay Jones [1] https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ -------->8-------- From 89526b2b87e39985dc8cd80661662ff087dc1078 Mon Sep 17 00:00:00 2001 From: Ramsay Jones <ramsay@ramsayjones.plus.com> Date: Wed, 22 Jun 2022 19:24:44 +0100 Subject: [PATCH] credential-cache: also handle ECONNABORTED connection closure Signed-off-by: Ramsay Jones <ramsay@ramsayjones.plus.com> --- builtin/credential-cache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/builtin/credential-cache.c b/builtin/credential-cache.c index 3db8df70a9..defbcc845c 100644 --- a/builtin/credential-cache.c +++ b/builtin/credential-cache.c @@ -30,7 +30,7 @@ static int connection_fatally_broken(int error) static int connection_closed(int error) { - return (error == ECONNRESET); + return (error == ECONNRESET) || (error == ECONNABORTED); } static int connection_fatally_broken(int error)
On 16/10/2024 16:09, Jeff King wrote: > On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote: > >>> The logic is broken on Cygwin though: when a client asks the daemon to >>> exit, they won't see the EOF and will instead get an error message: >>> >>> fatal: read error from cache daemon: Software caused connection abort >>> >>> This issue is known in Cygwin, see for example [1], but the exact root >>> cause is not known. >>> [...] >>> [1]: https://github.com/cygporter/git/issues/51 >> >> I don't see any details at that issue, but I'm not sure how it would fix >> things. From the client's perspective, they are going to see the >> descriptor either way. Unless there is some magic that fclose() does >> that a normal descriptor-close-on-exit does not do. >> >> That "Software caused connection abort" thing seems like a weird >> not-standard-Unix errno value. Searching for it mostly yields people >> complaining about getting it from ssh under cygwin. :) >> >> If the magic that cygwin needs is actually "fclose before unlink", then >> that is in opposition to other platforms (and I suspect would make t0301 >> racy there). > > This all seemed eerily familiar. Try this thread: > > https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > > It looks like the conclusion was that we should adjust errno handling on > the client side, but nobody ever followed up with an actual patch. Heh, our emails crossed. Yes, I was hoping that, given that Adam had identified the cygwin commit that caused the regression, some resolution on the cygwin side would fix things up. I waited ... and then put it on my TODO list! :) I did look at the cygwin commit and it wasn't at all obvious what happened. In fact there was a comment about making sure that errno values didn't change as a side-effect! Sorry for being tardy, again ... ATB, Ramsay Jones
On Wed, Oct 16, 2024 at 11:09:22AM -0400, Jeff King wrote: > On Wed, Oct 16, 2024 at 10:55:40AM -0400, Jeff King wrote: > > > > The logic is broken on Cygwin though: when a client asks the daemon to > > > exit, they won't see the EOF and will instead get an error message: > > > > > > fatal: read error from cache daemon: Software caused connection abort > > > > > > This issue is known in Cygwin, see for example [1], but the exact root > > > cause is not known. > > > [...] > > > [1]: https://github.com/cygporter/git/issues/51 > > > > I don't see any details at that issue, but I'm not sure how it would fix > > things. From the client's perspective, they are going to see the > > descriptor either way. Unless there is some magic that fclose() does > > that a normal descriptor-close-on-exit does not do. > > > > That "Software caused connection abort" thing seems like a weird > > not-standard-Unix errno value. Searching for it mostly yields people > > complaining about getting it from ssh under cygwin. :) > > > > If the magic that cygwin needs is actually "fclose before unlink", then > > that is in opposition to other platforms (and I suspect would make t0301 > > racy there). > > This all seemed eerily familiar. Try this thread: > > https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > > It looks like the conclusion was that we should adjust errno handling on > the client side, but nobody ever followed up with an actual patch. Thanks for digging. It would be great if you both and Ramsay could unify on an agreeable path forward here. In the meantime, I picked this up as 'ps/credential-cache-exit-cygwin' in my tree, but let's hold it out from 'seen' as you note it racily fails t0301. Thanks, Taylor
On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote: > > This all seemed eerily familiar. Try this thread: > > > > https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > > > > It looks like the conclusion was that we should adjust errno handling on > > the client side, but nobody ever followed up with an actual patch. > > Thanks for digging. It would be great if you both and Ramsay could unify > on an agreeable path forward here. I think the patch Ramsay posted elsewhere is the right way forward. Hopefully he can fill out a commit message with the summary and then we can proceed. (I'm happy to help with explaining the credential-cache side, but I would just be hand-waving on the cygwin parts). -Peff
On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote: > On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote: > > > > This all seemed eerily familiar. Try this thread: > > > > > > https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > > > > > > It looks like the conclusion was that we should adjust errno handling on > > > the client side, but nobody ever followed up with an actual patch. > > > > Thanks for digging. It would be great if you both and Ramsay could unify > > on an agreeable path forward here. > > I think the patch Ramsay posted elsewhere is the right way forward. > Hopefully he can fill out a commit message with the summary and then we > can proceed. > > (I'm happy to help with explaining the credential-cache side, but I > would just be hand-waving on the cygwin parts). Sounds great to me -- in that case, let's just drop my patch. I was basically just trying to get somebody to have a look at the issue, and it very much seems like I succeeded :) Ramsay, do you want to polish your patch with a commit message or shall I pick it up and iterate on it? Thanks all for chiming in! Patrick
On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote: > On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote: > > > > This all seemed eerily familiar. Try this thread: > > > > > > https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > > > > > > It looks like the conclusion was that we should adjust errno handling on > > > the client side, but nobody ever followed up with an actual patch. > > > > Thanks for digging. It would be great if you both and Ramsay could unify > > on an agreeable path forward here. > > I think the patch Ramsay posted elsewhere is the right way forward. > Hopefully he can fill out a commit message with the summary and then we > can proceed. Yeah, that's exactly what I was hoping for ;-). Thanks, Taylor
On 17/10/2024 09:46, Patrick Steinhardt wrote: > On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote: >> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote: >> >>>> This all seemed eerily familiar. Try this thread: >>>> >>>> https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ >>>> >>>> It looks like the conclusion was that we should adjust errno handling on >>>> the client side, but nobody ever followed up with an actual patch. >>> >>> Thanks for digging. It would be great if you both and Ramsay could unify >>> on an agreeable path forward here. >> >> I think the patch Ramsay posted elsewhere is the right way forward. >> Hopefully he can fill out a commit message with the summary and then we >> can proceed. >> >> (I'm happy to help with explaining the credential-cache side, but I >> would just be hand-waving on the cygwin parts). > > Sounds great to me -- in that case, let's just drop my patch. I was > basically just trying to get somebody to have a look at the issue, and > it very much seems like I succeeded :) > > Ramsay, do you want to polish your patch with a commit message or shall > I pick it up and iterate on it? I can't get to it before the weekend, at the earliest, sorry! :( If you fancy picking it up, please be my guest! :) ATB, Ramsay Jones
On Thu, Oct 17, 2024 at 11:58:33PM +0100, Ramsay Jones wrote: > > > On 17/10/2024 09:46, Patrick Steinhardt wrote: > > On Wed, Oct 16, 2024 at 10:33:19PM -0400, Jeff King wrote: > >> On Wed, Oct 16, 2024 at 04:28:49PM -0400, Taylor Blau wrote: > >> > >>>> This all seemed eerily familiar. Try this thread: > >>>> > >>>> https://lore.kernel.org/git/9dc3e85f-a532-6cff-de11-1dfb2e4bc6b6@ramsayjones.plus.com/ > >>>> > >>>> It looks like the conclusion was that we should adjust errno handling on > >>>> the client side, but nobody ever followed up with an actual patch. > >>> > >>> Thanks for digging. It would be great if you both and Ramsay could unify > >>> on an agreeable path forward here. > >> > >> I think the patch Ramsay posted elsewhere is the right way forward. > >> Hopefully he can fill out a commit message with the summary and then we > >> can proceed. > >> > >> (I'm happy to help with explaining the credential-cache side, but I > >> would just be hand-waving on the cygwin parts). > > > > Sounds great to me -- in that case, let's just drop my patch. I was > > basically just trying to get somebody to have a look at the issue, and > > it very much seems like I succeeded :) > > > > Ramsay, do you want to polish your patch with a commit message or shall > > I pick it up and iterate on it? > > I can't get to it before the weekend, at the earliest, sorry! :( > > If you fancy picking it up, please be my guest! :) I've sent v2 using that hack now, noting your original authorship. I want to finally get on top of all of those platform-specific failures :) Thanks! Patrick
diff --git a/builtin/credential-cache--daemon.c b/builtin/credential-cache--daemon.c index bc22f5c6d24..5a09df5c167 100644 --- a/builtin/credential-cache--daemon.c +++ b/builtin/credential-cache--daemon.c @@ -156,13 +156,11 @@ static void serve_one_client(FILE *in, FILE *out) } else if (!strcmp(action.buf, "exit")) { /* - * It's important that we clean up our socket first, and then - * signal the client only once we have finished the cleanup. - * Calling exit() directly does this, because we clean up in - * our atexit() handler, and then signal the client when our - * process actually ends, which closes the socket and gives - * them EOF. + * We must close our file handles before we exit such that the + * client will receive an EOF. */ + fclose(in); + fclose(out); exit(0); } else if (!strcmp(action.buf, "erase"))
Clients can signal the git-credential-cache(1) daemon that it is supposed to exit by sending it an "exit" command. The details around how exactly the daemon exits seem to be rather intricate as spelt out by a comment surrounding our call to exit(3p), as we need to be mindful around closing the client streams before we signal the client. The logic is broken on Cygwin though: when a client asks the daemon to exit, they won't see the EOF and will instead get an error message: fatal: read error from cache daemon: Software caused connection abort This issue is known in Cygwin, see for example [1], but the exact root cause is not known. As it turns out, we can avoid the issue by explicitly closing the client streams via fclose(3p). I'm not sure at all where the claimed atexit(3p) handler mentioned in the comment is supposed to live, but from all I can see we do not have any installed that would close the sockets for us. So this leaves me with a bit of a sour taste overall. That being said, I couldn't spot anything obviously wrong with closing the streams ourselves, and it does fix the issue on Cygwin without any regressions on other platforms as far as I can see. So let's go for this fix, even though I cannot properly explain it. [1]: https://github.com/cygporter/git/issues/51 Signed-off-by: Patrick Steinhardt <ps@pks.im> --- I've Cc'd Adam, who is the maintainer of the Git package in Cygwin, as well as Peff, who is the original author of the below comment. I'd be really happy if one of you could enlighten me here :) Patrick builtin/credential-cache--daemon.c | 10 ++++------ 1 file changed, 4 insertions(+), 6 deletions(-)