diff mbox

CIFS mount fails if I ctrl-c a long-running find process (Linux mounting Windows share)

Message ID 20121223091034.5e313ee1@tlielax.poochiereds.net (mailing list archive)
State New, archived
Headers show

Commit Message

Jeff Layton Dec. 23, 2012, 2:10 p.m. UTC
On Thu, 20 Dec 2012 09:38:06 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Wed, 19 Dec 2012 11:30:32 -0800 (PST)
> Tim Perry <tim.perry@lifetime.oregonstate.edu> wrote:
> 
> > Dear Jeff, et. al.,
> > 
> > 
> > I can reproduce the problem by starting "find . -name \*.ext"and killing it when connected to either of our two Windows 2003 Servers. I can *not* reproduce it doing the same thing connected to a windows 7 box.
> > 
> > $ uname -a
> > Linux servername 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012 i686 i686 i386 GNU/Linux
> > $ cat /proc/version
> > 
> > Linux version 3.2.0-34-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012
> > $ lsb_release -a
> > No LSB modules are available.
> > Distributor ID: Ubuntu
> > Description:    Ubuntu 12.04.1 LTS
> > Release:        12.04
> > Codename:       precise
> > 
> > 
> > I tried using strace but hitting ctrl-c killed strace (obviously, oops), but interestingly, this did *not* hang the file system. I will try and kill the find command (kill -9 perhaps?) and see if I can recreate the error that way.
> > 
> > CONTINUING HERE:
> > I don't think strace on the find command will help because it isn't making the network connections. CIFS is making the network connections. Maybe I can cause the mount to happen with an strace version of CIFS?  How would I do that?
> > 
> > Anyhow, I opened two terminal windows and proceeded as follows:
> > 
> > In terminal 1:
> > 
> > $ strace find . -name \*adzzz >& ~/straceFind.txt
> > 
> > 
> > In terminal 2:
> > $ ps aux | grep find | grep -v strace
> > perry     2583 12.6  0.0   4792  1088 pts/5    R+   11:27   0:00 find . -name *adzzz
> > perry     2585  0.0  0.0   4388   828 pts/2    S+   11:27   0:00 grep find
> > $ kill -9 2583
> > 
> > File system dies.
> > 
> > I've attaced the straceFind.txt, but it just shows find walking the filesystem tree:
> > statat64(AT_FDCWD, "0010", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > openat(AT_FDCWD, "0010", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > fchdir(5)                               = 0
> > getdents64(5, /* 14 entries */, 32768)  = 448
> > getdents64(5, /* 0 entries */, 32768)   = 0
> > close(5)                                = 0
> > fstatat64(AT_FDCWD, "_vti_cnf", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > openat(AT_FDCWD, "_vti_cnf", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > fchdir(5)                               = 0
> > getdents64(5, /* 13 entries */, 32768)  = 416
> > getdents64(5, /* 0 entries */, 32768)   = 0
> > close(5)                                = 0
> > open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5
> > fstat64(5, {st_mode=S_IFDIR|0777, st_size=0, ...}) = 0
> > fchdir(
> > 
> > 
> > Ideas?
> > 
> 
> That kernel is pretty old, so you may want to try a more recent one.
> 
> You may first want to start by tracing with wireshark -- see what's
> happening on the wire before and after the signal is delivered.
> 
> If it works against win7 then it's likely that win7 disconnects the
> socket when the signatures are wrong. With that, we'd reestablish the
> connection and things would start working again. I suspect that win2k8
> just starts returning an error that we map to -EACCES.
> 
> It's possible that we should disconnect the client when the signatures
> start looking wrong, but I think we need to understand why signals are
> causing this issue in the first place.
> 
> There are some places where we do interruptible sleeps (vs. killable
> ones). It's possible that SIGINT (which is what ^c generally delivers)
> is causing havok there.
> 

I had a look at the code today and suspect that I know what the problem
is. When the kernel goes to send a request, it first signs it and then
bumps the sequence numbers that it tracks. If the request doesn't
actually make it out onto the wire, like when the task catches a
signal, those sequence numbers remain high even though the request
didn't go out.

Here's an untested patch that might help tell whether this is the
case. You may want to try it and see if it does. Note that this fix is
a bit of a kludge and is not suitable for merging!

A better fix would involve changing when the sequence number gets
bumped in the first place. If this patch seems to help things, then
I'll look at coding up that up.



--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Jeff Layton Dec. 24, 2012, 2:14 p.m. UTC | #1
On Sun, 23 Dec 2012 09:10:34 -0500
Jeff Layton <jlayton@redhat.com> wrote:

> On Thu, 20 Dec 2012 09:38:06 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
> 
> > On Wed, 19 Dec 2012 11:30:32 -0800 (PST)
> > Tim Perry <tim.perry@lifetime.oregonstate.edu> wrote:
> > 
> > > Dear Jeff, et. al.,
> > > 
> > > 
> > > I can reproduce the problem by starting "find . -name \*.ext"and killing it when connected to either of our two Windows 2003 Servers. I can *not* reproduce it doing the same thing connected to a windows 7 box.
> > > 
> > > $ uname -a
> > > Linux servername 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012 i686 i686 i386 GNU/Linux
> > > $ cat /proc/version
> > > 
> > > Linux version 3.2.0-34-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012
> > > $ lsb_release -a
> > > No LSB modules are available.
> > > Distributor ID: Ubuntu
> > > Description:    Ubuntu 12.04.1 LTS
> > > Release:        12.04
> > > Codename:       precise
> > > 
> > > 
> > > I tried using strace but hitting ctrl-c killed strace (obviously, oops), but interestingly, this did *not* hang the file system. I will try and kill the find command (kill -9 perhaps?) and see if I can recreate the error that way.
> > > 
> > > CONTINUING HERE:
> > > I don't think strace on the find command will help because it isn't making the network connections. CIFS is making the network connections. Maybe I can cause the mount to happen with an strace version of CIFS?  How would I do that?
> > > 
> > > Anyhow, I opened two terminal windows and proceeded as follows:
> > > 
> > > In terminal 1:
> > > 
> > > $ strace find . -name \*adzzz >& ~/straceFind.txt
> > > 
> > > 
> > > In terminal 2:
> > > $ ps aux | grep find | grep -v strace
> > > perry     2583 12.6  0.0   4792  1088 pts/5    R+   11:27   0:00 find . -name *adzzz
> > > perry     2585  0.0  0.0   4388   828 pts/2    S+   11:27   0:00 grep find
> > > $ kill -9 2583
> > > 
> > > File system dies.
> > > 
> > > I've attaced the straceFind.txt, but it just shows find walking the filesystem tree:
> > > statat64(AT_FDCWD, "0010", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > > openat(AT_FDCWD, "0010", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > > fchdir(5)                               = 0
> > > getdents64(5, /* 14 entries */, 32768)  = 448
> > > getdents64(5, /* 0 entries */, 32768)   = 0
> > > close(5)                                = 0
> > > fstatat64(AT_FDCWD, "_vti_cnf", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0
> > > openat(AT_FDCWD, "_vti_cnf", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5
> > > fchdir(5)                               = 0
> > > getdents64(5, /* 13 entries */, 32768)  = 416
> > > getdents64(5, /* 0 entries */, 32768)   = 0
> > > close(5)                                = 0
> > > open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5
> > > fstat64(5, {st_mode=S_IFDIR|0777, st_size=0, ...}) = 0
> > > fchdir(
> > > 
> > > 
> > > Ideas?
> > > 
> > 
> > That kernel is pretty old, so you may want to try a more recent one.
> > 
> > You may first want to start by tracing with wireshark -- see what's
> > happening on the wire before and after the signal is delivered.
> > 
> > If it works against win7 then it's likely that win7 disconnects the
> > socket when the signatures are wrong. With that, we'd reestablish the
> > connection and things would start working again. I suspect that win2k8
> > just starts returning an error that we map to -EACCES.
> > 
> > It's possible that we should disconnect the client when the signatures
> > start looking wrong, but I think we need to understand why signals are
> > causing this issue in the first place.
> > 
> > There are some places where we do interruptible sleeps (vs. killable
> > ones). It's possible that SIGINT (which is what ^c generally delivers)
> > is causing havok there.
> > 
> 
> I had a look at the code today and suspect that I know what the problem
> is. When the kernel goes to send a request, it first signs it and then
> bumps the sequence numbers that it tracks. If the request doesn't
> actually make it out onto the wire, like when the task catches a
> signal, those sequence numbers remain high even though the request
> didn't go out.
> 
> Here's an untested patch that might help tell whether this is the
> case. You may want to try it and see if it does. Note that this fix is
> a bit of a kludge and is not suitable for merging!
> 
> A better fix would involve changing when the sequence number gets
> bumped in the first place. If this patch seems to help things, then
> I'll look at coding up that up.
> 
> diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
> index 76d974c..4520234 100644
> --- a/fs/cifs/transport.c
> +++ b/fs/cifs/transport.c
> @@ -334,10 +334,14 @@ uncork:
>  		server->tcpStatus = CifsNeedReconnect;
>  	}
>  
> -	if (rc < 0 && rc != -EINTR)
> -		cERROR(1, "Error %d sending data on socket to server", rc);
> -	else
> +	if (rc < 0) {
> +		if (rc == -EINTR)
> +			server->sequence_number -= 2;
> +		else
> +			cERROR(1, "Error %d sending data on socket to server", rc);
> +	} else {
>  		rc = 0;
> +	}
>  
>  	return rc;
>  }
> 
> 

I was able to reproduce this, and I don't think the above patch will
fix it (at least not completely). The problem seems to be that the NT
cancel command is screwing up the sequence numbers. We'll have to do
some research to figure out why that's occurring.
Ben Hutchings Dec. 29, 2012, 12:24 a.m. UTC | #2
On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
> On Sun, 23 Dec 2012 09:10:34 -0500
> Jeff Layton <jlayton@redhat.com> wrote:
[...]
> > I had a look at the code today and suspect that I know what the problem
> > is. When the kernel goes to send a request, it first signs it and then
> > bumps the sequence numbers that it tracks. If the request doesn't
> > actually make it out onto the wire, like when the task catches a
> > signal, those sequence numbers remain high even though the request
> > didn't go out.
> > 
> > Here's an untested patch that might help tell whether this is the
> > case. You may want to try it and see if it does. Note that this fix is
> > a bit of a kludge and is not suitable for merging!
> > 
> > A better fix would involve changing when the sequence number gets
> > bumped in the first place. If this patch seems to help things, then
> > I'll look at coding up that up.
[...]
> I was able to reproduce this, and I don't think the above patch will
> fix it (at least not completely). The problem seems to be that the NT
> cancel command is screwing up the sequence numbers. We'll have to do
> some research to figure out why that's occurring.

Jeff, we got a bug report in Debian which seems to be the same problem:
<http://bugs.debian.org/695492>.  Please cc John Darrah and the bug
address as above.

Ben.
Jeff Layton Dec. 29, 2012, 3:01 a.m. UTC | #3
On Sat, 29 Dec 2012 01:24:36 +0100
Ben Hutchings <ben@decadent.org.uk> wrote:

> On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote:
> > On Sun, 23 Dec 2012 09:10:34 -0500
> > Jeff Layton <jlayton@redhat.com> wrote:
> [...]
> > > I had a look at the code today and suspect that I know what the problem
> > > is. When the kernel goes to send a request, it first signs it and then
> > > bumps the sequence numbers that it tracks. If the request doesn't
> > > actually make it out onto the wire, like when the task catches a
> > > signal, those sequence numbers remain high even though the request
> > > didn't go out.
> > > 
> > > Here's an untested patch that might help tell whether this is the
> > > case. You may want to try it and see if it does. Note that this fix is
> > > a bit of a kludge and is not suitable for merging!
> > > 
> > > A better fix would involve changing when the sequence number gets
> > > bumped in the first place. If this patch seems to help things, then
> > > I'll look at coding up that up.
> [...]
> > I was able to reproduce this, and I don't think the above patch will
> > fix it (at least not completely). The problem seems to be that the NT
> > cancel command is screwing up the sequence numbers. We'll have to do
> > some research to figure out why that's occurring.
> 
> Jeff, we got a bug report in Debian which seems to be the same problem:
> <http://bugs.debian.org/695492>.  Please cc John Darrah and the bug
> address as above.
> 
> Ben.
> 

You may want to try this patch. It seems to fix the problem for me, but
I think there is probably some more work to do in this area.

http://www.spinics.net/lists/linux-cifs/msg07576.html
Matthew M. DeLoera March 13, 2013, 2 p.m. UTC | #4
Hello,

I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones.

Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values?

Any suggestions would be most appreciated!

Best regards,
- Matthew--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jeff Layton March 13, 2013, 2:23 p.m. UTC | #5
On Wed, 13 Mar 2013 10:00:10 -0400
"Matthew M. DeLoera" <mdeloera@exacq.com> wrote:

> Hello,
> 
> I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones.
> 
> Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values?
> 
> Any suggestions would be most appreciated!
> 
> Best regards,
> - Matthew--

Like most mount helpers, the return codes are typically what the kernel
returns on the mount() syscall. Error code '5' is likely EIO, which is
a very generic error.

Cranking up debugging may give you more info as to what went wrong:

    https://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
Matthew M. DeLoera March 15, 2013, 12:44 a.m. UTC | #6
Jeff,

Thanks for the info. I should clarify that it's not that I've got an error. I'm trying to document expected error codes for my project.

Though I am doing some testing to assemble at least some error cases and associated codes. Agreed about error 5 in this case, though was hoping there might be some sort of list out there somewhere about at least some of the various CIFS failures that get rolled together under error 5.

Best regards,
- Matthew


On Mar 13, 2013, at 10:23 AM, Jeff Layton wrote:

> On Wed, 13 Mar 2013 10:00:10 -0400
> "Matthew M. DeLoera" <mdeloera@exacq.com> wrote:
> 
>> Hello,
>> 
>> I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones.
>> 
>> Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values?
>> 
>> Any suggestions would be most appreciated!
>> 
>> Best regards,
>> - Matthew--
> 
> Like most mount helpers, the return codes are typically what the kernel
> returns on the mount() syscall. Error code '5' is likely EIO, which is
> a very generic error.
> 
> Cranking up debugging may give you more info as to what went wrong:
> 
>    https://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
> 
> -- 
> Jeff Layton <jlayton@redhat.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c
index 76d974c..4520234 100644
--- a/fs/cifs/transport.c
+++ b/fs/cifs/transport.c
@@ -334,10 +334,14 @@  uncork:
 		server->tcpStatus = CifsNeedReconnect;
 	}
 
-	if (rc < 0 && rc != -EINTR)
-		cERROR(1, "Error %d sending data on socket to server", rc);
-	else
+	if (rc < 0) {
+		if (rc == -EINTR)
+			server->sequence_number -= 2;
+		else
+			cERROR(1, "Error %d sending data on socket to server", rc);
+	} else {
 		rc = 0;
+	}
 
 	return rc;
 }