Message ID | 20121223091034.5e313ee1@tlielax.poochiereds.net (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Sun, 23 Dec 2012 09:10:34 -0500 Jeff Layton <jlayton@redhat.com> wrote: > On Thu, 20 Dec 2012 09:38:06 -0500 > Jeff Layton <jlayton@redhat.com> wrote: > > > On Wed, 19 Dec 2012 11:30:32 -0800 (PST) > > Tim Perry <tim.perry@lifetime.oregonstate.edu> wrote: > > > > > Dear Jeff, et. al., > > > > > > > > > I can reproduce the problem by starting "find . -name \*.ext"and killing it when connected to either of our two Windows 2003 Servers. I can *not* reproduce it doing the same thing connected to a windows 7 box. > > > > > > $ uname -a > > > Linux servername 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012 i686 i686 i386 GNU/Linux > > > $ cat /proc/version > > > > > > Linux version 3.2.0-34-generic (buildd@roseapple) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #53-Ubuntu SMP Thu Nov 15 10:49:02 UTC 2012 > > > $ lsb_release -a > > > No LSB modules are available. > > > Distributor ID: Ubuntu > > > Description: Ubuntu 12.04.1 LTS > > > Release: 12.04 > > > Codename: precise > > > > > > > > > I tried using strace but hitting ctrl-c killed strace (obviously, oops), but interestingly, this did *not* hang the file system. I will try and kill the find command (kill -9 perhaps?) and see if I can recreate the error that way. > > > > > > CONTINUING HERE: > > > I don't think strace on the find command will help because it isn't making the network connections. CIFS is making the network connections. Maybe I can cause the mount to happen with an strace version of CIFS? How would I do that? > > > > > > Anyhow, I opened two terminal windows and proceeded as follows: > > > > > > In terminal 1: > > > > > > $ strace find . -name \*adzzz >& ~/straceFind.txt > > > > > > > > > In terminal 2: > > > $ ps aux | grep find | grep -v strace > > > perry 2583 12.6 0.0 4792 1088 pts/5 R+ 11:27 0:00 find . -name *adzzz > > > perry 2585 0.0 0.0 4388 828 pts/2 S+ 11:27 0:00 grep find > > > $ kill -9 2583 > > > > > > File system dies. > > > > > > I've attaced the straceFind.txt, but it just shows find walking the filesystem tree: > > > statat64(AT_FDCWD, "0010", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 > > > openat(AT_FDCWD, "0010", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5 > > > fchdir(5) = 0 > > > getdents64(5, /* 14 entries */, 32768) = 448 > > > getdents64(5, /* 0 entries */, 32768) = 0 > > > close(5) = 0 > > > fstatat64(AT_FDCWD, "_vti_cnf", {st_mode=S_IFDIR|0777, st_size=0, ...}, AT_SYMLINK_NOFOLLOW) = 0 > > > openat(AT_FDCWD, "_vti_cnf", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 5 > > > fchdir(5) = 0 > > > getdents64(5, /* 13 entries */, 32768) = 416 > > > getdents64(5, /* 0 entries */, 32768) = 0 > > > close(5) = 0 > > > open("..", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_NOFOLLOW) = 5 > > > fstat64(5, {st_mode=S_IFDIR|0777, st_size=0, ...}) = 0 > > > fchdir( > > > > > > > > > Ideas? > > > > > > > That kernel is pretty old, so you may want to try a more recent one. > > > > You may first want to start by tracing with wireshark -- see what's > > happening on the wire before and after the signal is delivered. > > > > If it works against win7 then it's likely that win7 disconnects the > > socket when the signatures are wrong. With that, we'd reestablish the > > connection and things would start working again. I suspect that win2k8 > > just starts returning an error that we map to -EACCES. > > > > It's possible that we should disconnect the client when the signatures > > start looking wrong, but I think we need to understand why signals are > > causing this issue in the first place. > > > > There are some places where we do interruptible sleeps (vs. killable > > ones). It's possible that SIGINT (which is what ^c generally delivers) > > is causing havok there. > > > > I had a look at the code today and suspect that I know what the problem > is. When the kernel goes to send a request, it first signs it and then > bumps the sequence numbers that it tracks. If the request doesn't > actually make it out onto the wire, like when the task catches a > signal, those sequence numbers remain high even though the request > didn't go out. > > Here's an untested patch that might help tell whether this is the > case. You may want to try it and see if it does. Note that this fix is > a bit of a kludge and is not suitable for merging! > > A better fix would involve changing when the sequence number gets > bumped in the first place. If this patch seems to help things, then > I'll look at coding up that up. > > diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c > index 76d974c..4520234 100644 > --- a/fs/cifs/transport.c > +++ b/fs/cifs/transport.c > @@ -334,10 +334,14 @@ uncork: > server->tcpStatus = CifsNeedReconnect; > } > > - if (rc < 0 && rc != -EINTR) > - cERROR(1, "Error %d sending data on socket to server", rc); > - else > + if (rc < 0) { > + if (rc == -EINTR) > + server->sequence_number -= 2; > + else > + cERROR(1, "Error %d sending data on socket to server", rc); > + } else { > rc = 0; > + } > > return rc; > } > > I was able to reproduce this, and I don't think the above patch will fix it (at least not completely). The problem seems to be that the NT cancel command is screwing up the sequence numbers. We'll have to do some research to figure out why that's occurring.
On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote: > On Sun, 23 Dec 2012 09:10:34 -0500 > Jeff Layton <jlayton@redhat.com> wrote: [...] > > I had a look at the code today and suspect that I know what the problem > > is. When the kernel goes to send a request, it first signs it and then > > bumps the sequence numbers that it tracks. If the request doesn't > > actually make it out onto the wire, like when the task catches a > > signal, those sequence numbers remain high even though the request > > didn't go out. > > > > Here's an untested patch that might help tell whether this is the > > case. You may want to try it and see if it does. Note that this fix is > > a bit of a kludge and is not suitable for merging! > > > > A better fix would involve changing when the sequence number gets > > bumped in the first place. If this patch seems to help things, then > > I'll look at coding up that up. [...] > I was able to reproduce this, and I don't think the above patch will > fix it (at least not completely). The problem seems to be that the NT > cancel command is screwing up the sequence numbers. We'll have to do > some research to figure out why that's occurring. Jeff, we got a bug report in Debian which seems to be the same problem: <http://bugs.debian.org/695492>. Please cc John Darrah and the bug address as above. Ben.
On Sat, 29 Dec 2012 01:24:36 +0100 Ben Hutchings <ben@decadent.org.uk> wrote: > On Mon, 2012-12-24 at 09:14 -0500, Jeff Layton wrote: > > On Sun, 23 Dec 2012 09:10:34 -0500 > > Jeff Layton <jlayton@redhat.com> wrote: > [...] > > > I had a look at the code today and suspect that I know what the problem > > > is. When the kernel goes to send a request, it first signs it and then > > > bumps the sequence numbers that it tracks. If the request doesn't > > > actually make it out onto the wire, like when the task catches a > > > signal, those sequence numbers remain high even though the request > > > didn't go out. > > > > > > Here's an untested patch that might help tell whether this is the > > > case. You may want to try it and see if it does. Note that this fix is > > > a bit of a kludge and is not suitable for merging! > > > > > > A better fix would involve changing when the sequence number gets > > > bumped in the first place. If this patch seems to help things, then > > > I'll look at coding up that up. > [...] > > I was able to reproduce this, and I don't think the above patch will > > fix it (at least not completely). The problem seems to be that the NT > > cancel command is screwing up the sequence numbers. We'll have to do > > some research to figure out why that's occurring. > > Jeff, we got a bug report in Debian which seems to be the same problem: > <http://bugs.debian.org/695492>. Please cc John Darrah and the bug > address as above. > > Ben. > You may want to try this patch. It seems to fix the problem for me, but I think there is probably some more work to do in this area. http://www.spinics.net/lists/linux-cifs/msg07576.html
Hello, I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones. Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values? Any suggestions would be most appreciated! Best regards, - Matthew-- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 13 Mar 2013 10:00:10 -0400 "Matthew M. DeLoera" <mdeloera@exacq.com> wrote: > Hello, > > I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones. > > Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values? > > Any suggestions would be most appreciated! > > Best regards, > - Matthew-- Like most mount helpers, the return codes are typically what the kernel returns on the mount() syscall. Error code '5' is likely EIO, which is a very generic error. Cranking up debugging may give you more info as to what went wrong: https://wiki.samba.org/index.php/LinuxCIFS_troubleshooting
Jeff, Thanks for the info. I should clarify that it's not that I've got an error. I'm trying to document expected error codes for my project. Though I am doing some testing to assemble at least some error cases and associated codes. Agreed about error 5 in this case, though was hoping there might be some sort of list out there somewhere about at least some of the various CIFS failures that get rolled together under error 5. Best regards, - Matthew On Mar 13, 2013, at 10:23 AM, Jeff Layton wrote: > On Wed, 13 Mar 2013 10:00:10 -0400 > "Matthew M. DeLoera" <mdeloera@exacq.com> wrote: > >> Hello, >> >> I've been searching to find a list of return codes from mount.cifs, with no luck. They're not listed in the manpage. I can google individual errors, but I'm hoping for something like a list of all possible errors. Or at least the most common ones. >> >> Short of looking at code, is there any possibility that someone's documented details on something like causes for error code 5? Or other return values? >> >> Any suggestions would be most appreciated! >> >> Best regards, >> - Matthew-- > > Like most mount helpers, the return codes are typically what the kernel > returns on the mount() syscall. Error code '5' is likely EIO, which is > a very generic error. > > Cranking up debugging may give you more info as to what went wrong: > > https://wiki.samba.org/index.php/LinuxCIFS_troubleshooting > > -- > Jeff Layton <jlayton@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe linux-cifs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/cifs/transport.c b/fs/cifs/transport.c index 76d974c..4520234 100644 --- a/fs/cifs/transport.c +++ b/fs/cifs/transport.c @@ -334,10 +334,14 @@ uncork: server->tcpStatus = CifsNeedReconnect; } - if (rc < 0 && rc != -EINTR) - cERROR(1, "Error %d sending data on socket to server", rc); - else + if (rc < 0) { + if (rc == -EINTR) + server->sequence_number -= 2; + else + cERROR(1, "Error %d sending data on socket to server", rc); + } else { rc = 0; + } return rc; }