Message ID | 70e91b0b-4bca-60ea-19cf-3df0f49d4e5a@citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Failure to reconnect after cluster failvoer | expand |
Couple quick thoughts. Does this work on current kernels (5.0 for example). Was thinking about patches that might affect this like: - "cifs: connect to servername instead of IP for IPC$ share" - "smb3: on reconnect set PreviousSessionId field" - Paulo's patches (has cifs-utils coreq) to reconnect to new IP address if hostname's IP address changed and his add support for failover - Paulo's patch to remove trailing slashes from server UNC name On Thu, Feb 21, 2019 at 10:58 AM Ross Lagerwall <ross.lagerwall@citrix.com> wrote: > > Hi, > > I have an issue with SMB cluster failover. There are two Windows 2012 R2 > Datacenter servers in the cluster. If the primary server is turned off, > then the secondary server becomes the primary. However, when this > happens the kernel client is not able to recover the mount. > > Here is the reconnection network trace: > > Time Source Destination Protocol Length Info > 16.640530 10.71.217.53 10.71.217.50 SMB2 172 Negotiate Protocol > Request > 16.641723 10.71.217.50 10.71.217.53 SMB2 318 Negotiate Protocol > Response > 16.641799 10.71.217.53 10.71.217.50 SMB2 190 Session Setup > Request, NTLMSSP_NEGOTIATE > 16.642148 10.71.217.50 10.71.217.53 SMB2 442 Session Setup > Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE > 16.642201 10.71.217.53 10.71.217.50 SMB2 562 Session Setup > Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator > 16.656407 10.71.217.50 10.71.217.53 SMB2 142 Session Setup Response > 16.656492 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 16.656916 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 16.659249 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 16.659635 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 20.224591 10.71.217.53 10.71.217.50 SMB2 182 Tree Connect Request > Tree: \\10.71.217.50\IPC$ > 20.225344 10.71.217.50 10.71.217.53 SMB2 150 Tree Connect Response > 20.225449 10.71.217.53 10.71.217.50 SMB2 216 Ioctl Request > FSCTL_VALIDATE_NEGOTIATE_INFO > 20.225934 10.71.217.50 10.71.217.53 SMB2 206 Ioctl Response > FSCTL_VALIDATE_NEGOTIATE_INFO > 20.225975 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 20.226355 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 22.240595 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 22.241159 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 24.256590 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 24.257380 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > ... > 40.384609 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 40.385135 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 41.772006 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 41.772562 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_NETWORK_NAME_DELETED > 41.772641 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 41.773037 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_NETWORK_NAME_DELETED > 42.400589 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > ... > > After the secondary server takes over (presumably once it stops > returning STATUS_BAD_NETWORK_NAME), it then returns > STATUS_NETWORK_NAME_DELETED indefinitely. > > This can be fixed by delaying the tree connect to IPC$ until after the > tree connect to the share succeeds. The server then no longer returns > STATUS_NETWORK_NAME_DELETED and instead responds successfully. I'm not > sure why the server behaves like this and I'm not sure if the client is > doing something wrong. I found this out because it used to work on older > kernels before b327a717e506 ("CIFS: make IPC a regular tcon"). > > Here is the patch that makes it work: > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c > index dba986524917..1f97ed6459bf 100644 > --- a/fs/cifs/smb2pdu.c > +++ b/fs/cifs/smb2pdu.c > @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work) > > spin_unlock(&cifs_tcp_ses_lock); > > + rc = 0; > list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) { > + if (rc) { > + list_del_init(&tcon->rlist); > + cifs_put_tcon(tcon); > + continue; > + } > + > rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon); > if (!rc) > cifs_reopen_persistent_handles(tcon); > > Can anyone give any more info on this oddity and whether this is a > useful patch? > > Thanks, > -- > Ross Lagerwall
The reconnect is apparently using a dotted-quad as the servername, and you can see the auth is forced to NTLM as a consequence. Is that the way you initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? -----Original Message----- From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French Sent: Thursday, February 21, 2019 9:07 AM To: Ross Lagerwall <ross.lagerwall@citrix.com> Cc: CIFS <linux-cifs@vger.kernel.org> Subject: Re: Failure to reconnect after cluster failvoer Couple quick thoughts. Does this work on current kernels (5.0 for example). Was thinking about patches that might affect this like: - "cifs: connect to servername instead of IP for IPC$ share" - "smb3: on reconnect set PreviousSessionId field" - Paulo's patches (has cifs-utils coreq) to reconnect to new IP address if hostname's IP address changed and his add support for failover - Paulo's patch to remove trailing slashes from server UNC name On Thu, Feb 21, 2019 at 10:58 AM Ross Lagerwall <ross.lagerwall@citrix.com> wrote: > > Hi, > > I have an issue with SMB cluster failover. There are two Windows 2012 R2 > Datacenter servers in the cluster. If the primary server is turned off, > then the secondary server becomes the primary. However, when this > happens the kernel client is not able to recover the mount. > > Here is the reconnection network trace: > > Time Source Destination Protocol Length Info > 16.640530 10.71.217.53 10.71.217.50 SMB2 172 Negotiate Protocol > Request > 16.641723 10.71.217.50 10.71.217.53 SMB2 318 Negotiate Protocol > Response > 16.641799 10.71.217.53 10.71.217.50 SMB2 190 Session Setup > Request, NTLMSSP_NEGOTIATE > 16.642148 10.71.217.50 10.71.217.53 SMB2 442 Session Setup > Response, Error: STATUS_MORE_PROCESSING_REQUIRED, NTLMSSP_CHALLENGE > 16.642201 10.71.217.53 10.71.217.50 SMB2 562 Session Setup > Request, NTLMSSP_AUTH, User: clusterad.local7337\Administrator > 16.656407 10.71.217.50 10.71.217.53 SMB2 142 Session Setup Response > 16.656492 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 16.656916 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 16.659249 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 16.659635 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 20.224591 10.71.217.53 10.71.217.50 SMB2 182 Tree Connect Request > Tree: \\10.71.217.50\IPC$ > 20.225344 10.71.217.50 10.71.217.53 SMB2 150 Tree Connect Response > 20.225449 10.71.217.53 10.71.217.50 SMB2 216 Ioctl Request > FSCTL_VALIDATE_NEGOTIATE_INFO > 20.225934 10.71.217.50 10.71.217.53 SMB2 206 Ioctl Response > FSCTL_VALIDATE_NEGOTIATE_INFO > 20.225975 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 20.226355 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 22.240595 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 22.241159 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 24.256590 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 24.257380 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > ... > 40.384609 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 40.385135 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_BAD_NETWORK_NAME > 41.772006 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 41.772562 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_NETWORK_NAME_DELETED > 41.772641 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > 41.773037 10.71.217.50 10.71.217.53 SMB2 143 Tree Connect > Response, Error: STATUS_NETWORK_NAME_DELETED > 42.400589 10.71.217.53 10.71.217.50 SMB2 190 Tree Connect Request > Tree: \\10.71.217.50\smbshare > ... > > After the secondary server takes over (presumably once it stops > returning STATUS_BAD_NETWORK_NAME), it then returns > STATUS_NETWORK_NAME_DELETED indefinitely. > > This can be fixed by delaying the tree connect to IPC$ until after the > tree connect to the share succeeds. The server then no longer returns > STATUS_NETWORK_NAME_DELETED and instead responds successfully. I'm not > sure why the server behaves like this and I'm not sure if the client is > doing something wrong. I found this out because it used to work on older > kernels before b327a717e506 ("CIFS: make IPC a regular tcon"). > > Here is the patch that makes it work: > > diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c > index dba986524917..1f97ed6459bf 100644 > --- a/fs/cifs/smb2pdu.c > +++ b/fs/cifs/smb2pdu.c > @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work) > > spin_unlock(&cifs_tcp_ses_lock); > > + rc = 0; > list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) { > + if (rc) { > + list_del_init(&tcon->rlist); > + cifs_put_tcon(tcon); > + continue; > + } > + > rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon); > if (!rc) > cifs_reopen_persistent_handles(tcon); > > Can anyone give any more info on this oddity and whether this is a > useful patch? > > Thanks, > -- > Ross Lagerwall
On 2/21/19 5:59 PM, Tom Talpey wrote: > The reconnect is apparently using a dotted-quad as the servername, and you can see the auth is forced to NTLM as a consequence. Is that the way you initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? > > -----Original Message----- > From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> On Behalf Of Steve French > Sent: Thursday, February 21, 2019 9:07 AM > To: Ross Lagerwall <ross.lagerwall@citrix.com> > Cc: CIFS <linux-cifs@vger.kernel.org> > Subject: Re: Failure to reconnect after cluster failvoer > > Couple quick thoughts. > > Does this work on current kernels (5.0 for example). > > Was thinking about patches that might affect this like: > - "cifs: connect to servername instead of IP for IPC$ share" > - "smb3: on reconnect set PreviousSessionId field" > - Paulo's patches (has cifs-utils coreq) to reconnect to new IP > address if hostname's IP address changed and his add support for > failover > - Paulo's patch to remove trailing slashes from server UNC name > I've reproduced this with 5.0-rc7 and the latest cifs-utils from git. The share was mounted as follows (yes, by IP): mount.cifs -o vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z '//10.71.217.31/smbshare' /mnt Here is the tcpdump when it fails to reconnect properly: http://s000.tinyupload.com/index.php?file_id=55518118986864684971 The initial connection is at timestamp 0s, reconnection at 13s, STATUS_NETWORK_NAME_DELETED at 60s. For comparison, here is a tcpdump using the "fix" from my previous mail: http://s000.tinyupload.com/index.php?file_id=04243963024741599425 The initial connection is at timestamp 0s, reconnection at 34s, successful read request at 215s. Note that the tree connect for IPC$ only happens _after_ the tree connect for the share succeeds. Thanks,
> -----Original Message----- > From: Ross Lagerwall <ross.lagerwall@citrix.com> > Sent: Friday, February 22, 2019 9:17 AM > To: Tom Talpey <ttalpey@microsoft.com>; Steve French > <smfrench@gmail.com> > Cc: CIFS <linux-cifs@vger.kernel.org> > Subject: Re: Failure to reconnect after cluster failvoer > > On 2/21/19 5:59 PM, Tom Talpey wrote: > > The reconnect is apparently using a dotted-quad as the servername, and you > can see the auth is forced to NTLM as a consequence. Is that the way you > initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? > > > > -----Original Message----- > > From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> > On Behalf Of Steve French > > Sent: Thursday, February 21, 2019 9:07 AM > > To: Ross Lagerwall <ross.lagerwall@citrix.com> > > Cc: CIFS <linux-cifs@vger.kernel.org> > > Subject: Re: Failure to reconnect after cluster failvoer > > > > Couple quick thoughts. > > > > Does this work on current kernels (5.0 for example). > > > > Was thinking about patches that might affect this like: > > - "cifs: connect to servername instead of IP for IPC$ share" > > - "smb3: on reconnect set PreviousSessionId field" > > - Paulo's patches (has cifs-utils coreq) to reconnect to new IP > > address if hostname's IP address changed and his add support for > > failover > > - Paulo's patch to remove trailing slashes from server UNC name > > > I've reproduced this with 5.0-rc7 and the latest cifs-utils from git. > The share was mounted as follows (yes, by IP): > > mount.cifs -o > vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z > '//10.71.217.31/smbshare' /mnt > > Here is the tcpdump when it fails to reconnect properly: ... > > The initial connection is at timestamp 0s, reconnection at 13s, > STATUS_NETWORK_NAME_DELETED at 60s. > > For comparison, here is a tcpdump using the "fix" from my previous mail: ... > > The initial connection is at timestamp 0s, reconnection at 34s, > successful read request at 215s. > > Note that the tree connect for IPC$ only happens _after_ the tree > connect for the share succeeds. Thanks for the full traces, they clarify the situation. But, I don’t see any meaningful difference in the client behavior. The ordering of the two treeconnects is the same between the two - initially, "IPC$" then "smbshare", and on reconnect, the other way around. So, I'm unclear whether your patch did anything. The STATUS_NETWORK_NAME_DELETED is a consequence of the failed re-establishment of the tree connect, and is not itself the problem. The server is simply timing out the treeid, since the client did not successfully reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue. Are you sure the clustered server is recovering properly when you are forcing the failover? For example, if it's a two-node cluster, maybe node A can take over node B, but node B has issues taking over node A. Is there anything relevant in the server logs? Tom.
On 2/22/19 11:25 PM, Tom Talpey wrote: >> -----Original Message----- >> From: Ross Lagerwall <ross.lagerwall@citrix.com> >> Sent: Friday, February 22, 2019 9:17 AM >> To: Tom Talpey <ttalpey@microsoft.com>; Steve French >> <smfrench@gmail.com> >> Cc: CIFS <linux-cifs@vger.kernel.org> >> Subject: Re: Failure to reconnect after cluster failvoer >> >> On 2/21/19 5:59 PM, Tom Talpey wrote: >>> The reconnect is apparently using a dotted-quad as the servername, and you >> can see the auth is forced to NTLM as a consequence. Is that the way you >> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? >>> >>> -----Original Message----- >>> From: linux-cifs-owner@vger.kernel.org <linux-cifs-owner@vger.kernel.org> >> On Behalf Of Steve French >>> Sent: Thursday, February 21, 2019 9:07 AM >>> To: Ross Lagerwall <ross.lagerwall@citrix.com> >>> Cc: CIFS <linux-cifs@vger.kernel.org> >>> Subject: Re: Failure to reconnect after cluster failvoer >>> >>> Couple quick thoughts. >>> >>> Does this work on current kernels (5.0 for example). >>> >>> Was thinking about patches that might affect this like: >>> - "cifs: connect to servername instead of IP for IPC$ share" >>> - "smb3: on reconnect set PreviousSessionId field" >>> - Paulo's patches (has cifs-utils coreq) to reconnect to new IP >>> address if hostname's IP address changed and his add support for >>> failover >>> - Paulo's patch to remove trailing slashes from server UNC name >>> >> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git. >> The share was mounted as follows (yes, by IP): >> >> mount.cifs -o >> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z >> '//10.71.217.31/smbshare' /mnt >> >> Here is the tcpdump when it fails to reconnect properly: > ... >> >> The initial connection is at timestamp 0s, reconnection at 13s, >> STATUS_NETWORK_NAME_DELETED at 60s. >> >> For comparison, here is a tcpdump using the "fix" from my previous mail: > ... >> >> The initial connection is at timestamp 0s, reconnection at 34s, >> successful read request at 215s. >> >> Note that the tree connect for IPC$ only happens _after_ the tree >> connect for the share succeeds. > > Thanks for the full traces, they clarify the situation. But, I don’t see any > meaningful difference in the client behavior. The ordering of the two > treeconnects is the same between the two - initially, "IPC$" then > "smbshare", and on reconnect, the other way around. So, I'm unclear > whether your patch did anything. There is definitely a difference. Before the patch, on reconnect the client: * Connects to "smbshare" which fails * Then connects to "IPC$" which succeeds * Then tries again to connect to smbshare which fails repeatedly After the patch, on reconnect the client: * Connects to "smbshare" which fails * Then tries again to connect to "smbshare" which succeeds after several retries * Then tries to connect to "IPC$" which succeeds This subtle reordering somehow makes it work. It may indeed be a server bug rather than a client bug. I was hoping someone could shed some light on this. > > The STATUS_NETWORK_NAME_DELETED is a consequence of the failed > re-establishment of the tree connect, and is not itself the problem. The > server is simply timing out the treeid, since the client did not successfully > reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue. > > Are you sure the clustered server is recovering properly when you are > forcing the failover? For example, if it's a two-node cluster, maybe node A > can take over node B, but node B has issues taking over node A. Is there > anything relevant in the server logs? > It's a two node cluster. The behaviour happens reliably when failing over either way. After failover, the server state is consistent. E.g. after a failover from node A to node B, node B shows itself as the primary server and the node A is marked as down. I couldn't find anything interesting in the server logs. Thanks,
> -----Original Message----- > From: Ross Lagerwall <ross.lagerwall@citrix.com> > Sent: Monday, February 25, 2019 8:14 AM > To: Tom Talpey <ttalpey@microsoft.com>; Steve French > <smfrench@gmail.com> > Cc: CIFS <linux-cifs@vger.kernel.org> > Subject: Re: Failure to reconnect after cluster failvoer > > On 2/22/19 11:25 PM, Tom Talpey wrote: > >> -----Original Message----- > >> From: Ross Lagerwall <ross.lagerwall@citrix.com> > >> Sent: Friday, February 22, 2019 9:17 AM > >> To: Tom Talpey <ttalpey@microsoft.com>; Steve French > >> <smfrench@gmail.com> > >> Cc: CIFS <linux-cifs@vger.kernel.org> > >> Subject: Re: Failure to reconnect after cluster failvoer > >> > >> On 2/21/19 5:59 PM, Tom Talpey wrote: > >>> The reconnect is apparently using a dotted-quad as the servername, and > you > >> can see the auth is forced to NTLM as a consequence. Is that the way you > >> initially mounted the share (i.e. mount 10.71.217.50:/smbshare /mnt)? > >>> > >>> -----Original Message----- > >>> From: linux-cifs-owner@vger.kernel.org <linux-cifs- > owner@vger.kernel.org> > >> On Behalf Of Steve French > >>> Sent: Thursday, February 21, 2019 9:07 AM > >>> To: Ross Lagerwall <ross.lagerwall@citrix.com> > >>> Cc: CIFS <linux-cifs@vger.kernel.org> > >>> Subject: Re: Failure to reconnect after cluster failvoer > >>> > >>> Couple quick thoughts. > >>> > >>> Does this work on current kernels (5.0 for example). > >>> > >>> Was thinking about patches that might affect this like: > >>> - "cifs: connect to servername instead of IP for IPC$ share" > >>> - "smb3: on reconnect set PreviousSessionId field" > >>> - Paulo's patches (has cifs-utils coreq) to reconnect to new IP > >>> address if hostname's IP address changed and his add support for > >>> failover > >>> - Paulo's patch to remove trailing slashes from server UNC name > >>> > >> I've reproduced this with 5.0-rc7 and the latest cifs-utils from git. > >> The share was mounted as follows (yes, by IP): > >> > >> mount.cifs -o > >> vers=3.0,cache=loose,actimeo=0,username=x,domain=y,password=z > >> '//10.71.217.31/smbshare' /mnt > >> > >> Here is the tcpdump when it fails to reconnect properly: > > ... > >> > >> The initial connection is at timestamp 0s, reconnection at 13s, > >> STATUS_NETWORK_NAME_DELETED at 60s. > >> > >> For comparison, here is a tcpdump using the "fix" from my previous mail: > > ... > >> > >> The initial connection is at timestamp 0s, reconnection at 34s, > >> successful read request at 215s. > >> > >> Note that the tree connect for IPC$ only happens _after_ the tree > >> connect for the share succeeds. > > > > Thanks for the full traces, they clarify the situation. But, I don’t see any > > meaningful difference in the client behavior. The ordering of the two > > treeconnects is the same between the two - initially, "IPC$" then > > "smbshare", and on reconnect, the other way around. So, I'm unclear > > whether your patch did anything. > > There is definitely a difference. Before the patch, on reconnect the client: I'm still not so sure the difference is relevant. The timing is a bit different, but in itself the IPC$ treeconnect isn't actually used, and in any case it succeeds in both scenarios. So, I'm thinking it's either the timing, or coincidence. > * Connects to "smbshare" which fails > * Then connects to "IPC$" which succeeds > * Then tries again to connect to smbshare which fails repeatedly Here's what I see: Event / timestamp / etc Connection lost / 25.97 / Server sends many RST to client Connection reestablished / 34.17 Treeconnect to smbshare / 34.17 / STATUS_B_N_N (retries with same result every 2 sec) Treeconnect to IPC$ / 34.18 / success Treeconnect to smbshare / 60.38 / STATUS_N_N_D (etc) > After the patch, on reconnect the client: > > * Connects to "smbshare" which fails > * Then tries again to connect to "smbshare" which succeeds after several > retries > * Then tries to connect to "IPC$" which succeeds This time: Connection lost / 9.81 / Server sends RST Connection reestablished / 9.82 / status 0xc0000466 (some weird disk hardware status) Connection lost / 13.53 / Server sends RST Connection reestablished / 13.53 Treeconnect to smbshare / 13.63 / STATUS_B_N_N (retries with same result every 2 sec) Treeconnect to smbshare / 43.90 / success (about 30 secs, 17 retries elapsed) Treeconnect to IPC$ / 43.90 / success So, the main effect of your patch is that the IPC$ attempt happens a lot *later*, it certainly didn't affect the success of the smbshare treeconnect - it happened only after that succeeded! And I don't see how deferring an unrelated treeconnect would help that. I bet it would have the same result if the IPC$ didn't happen at all. I really think there's something wrong with your server, and not because of a bug. Unfortunately both Steve and I are at FAST'19 and Vault here in Boston, so we're not able to get much done. I'd love to understand this better, though... Tom. > This subtle reordering somehow makes it work. It may indeed be a server > bug rather than a client bug. I was hoping someone could shed some light > on this. > > > > > The STATUS_NETWORK_NAME_DELETED is a consequence of the failed > > re-establishment of the tree connect, and is not itself the problem. The > > server is simply timing out the treeid, since the client did not successfully > > reclaim it. The repeated STATUS_BAD_NETWORK_NAME is the issue. > > > > Are you sure the clustered server is recovering properly when you are > > forcing the failover? For example, if it's a two-node cluster, maybe node A > > can take over node B, but node B has issues taking over node A. Is there > > anything relevant in the server logs? > > > > It's a two node cluster. The behaviour happens reliably when failing > over either way. After failover, the server state is consistent. E.g. > after a failover from node A to node B, node B shows itself as the > primary server and the node A is marked as down. I couldn't find > anything interesting in the server logs. > > Thanks, > -- > Ross Lagerwall
diff --git a/fs/cifs/smb2pdu.c b/fs/cifs/smb2pdu.c index dba986524917..1f97ed6459bf 100644 --- a/fs/cifs/smb2pdu.c +++ b/fs/cifs/smb2pdu.c @@ -2864,7 +2864,14 @@ void smb2_reconnect_server(struct work_struct *work) spin_unlock(&cifs_tcp_ses_lock); + rc = 0; list_for_each_entry_safe(tcon, tcon2, &tmp_list, rlist) { + if (rc) { + list_del_init(&tcon->rlist); + cifs_put_tcon(tcon); + continue; + } + rc = smb2_reconnect(SMB2_INTERNAL_CMD, tcon); if (!rc) cifs_reopen_persistent_handles(tcon);