From patchwork Sat Dec 14 11:50:54 2013
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Jeff Layton <jlayton@redhat.com>
X-Patchwork-Id: 3348301
Return-Path: <linux-cifs-owner@kernel.org>
X-Original-To: patchwork-cifs-client@patchwork.kernel.org
Delivered-To: patchwork-parsemail@patchwork2.web.kernel.org
Received: from mail.kernel.org (mail.kernel.org [198.145.19.201])
	by patchwork2.web.kernel.org (Postfix) with ESMTP id A513CC0D4A
	for <patchwork-cifs-client@patchwork.kernel.org>;
	Sat, 14 Dec 2013 11:51:07 +0000 (UTC)
Received: from mail.kernel.org (localhost [127.0.0.1])
	by mail.kernel.org (Postfix) with ESMTP id DC6C620764
	for <patchwork-cifs-client@patchwork.kernel.org>;
	Sat, 14 Dec 2013 11:51:02 +0000 (UTC)
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by mail.kernel.org (Postfix) with ESMTP id C920D20762
	for <patchwork-cifs-client@patchwork.kernel.org>;
	Sat, 14 Dec 2013 11:51:01 +0000 (UTC)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753115Ab3LNLvA (ORCPT
	<rfc822;patchwork-cifs-client@patchwork.kernel.org>);
	Sat, 14 Dec 2013 06:51:00 -0500
Received: from mx1.redhat.com ([209.132.183.28]:26738 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1752794Ab3LNLu7 (ORCPT <rfc822;linux-cifs@vger.kernel.org>);
	Sat, 14 Dec 2013 06:50:59 -0500
Received: from int-mx02.intmail.prod.int.phx2.redhat.com
	(int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id rBEBovYs002211
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK);
	Sat, 14 Dec 2013 06:50:58 -0500
Received: from tlielax.poochiereds.net (vpn-60-170.rdu2.redhat.com
	[10.10.60.170])
	by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with
	ESMTP id rBEBovvC013972; Sat, 14 Dec 2013 06:50:57 -0500
Date: Sat, 14 Dec 2013 06:50:54 -0500
From: Jeff Layton <jlayton@redhat.com>
To: Patrick Noffke <patrick@noffke.me>
Cc: linux-cifs@vger.kernel.org
Subject: Re: hung system with cifsd, cannot reduce timeout
Message-ID: <20131214065054.421eddbc@tlielax.poochiereds.net>
In-Reply-To: <3540393.5GuVZWBWRs@dingofarm.qg.com>
References: <52A0EA24.9000600@noffke.me> <52A0F1CB.5060709@noffke.me>
	<1774674.CzhmRQMMgZ@dingofarm.qg.com>
	<3540393.5GuVZWBWRs@dingofarm.qg.com>
Mime-Version: 1.0
X-Scanned-By: MIMEDefang 2.67 on 10.5.11.12
Sender: linux-cifs-owner@vger.kernel.org
Precedence: bulk
List-ID: <linux-cifs.vger.kernel.org>
X-Mailing-List: linux-cifs@vger.kernel.org
X-Spam-Status: No, score=-7.3 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_HI,
	RP_MATCHES_RCVD,
	UNPARSEABLE_RELAY autolearn=unavailable version=3.3.1
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on mail.kernel.org
X-Virus-Scanned: ClamAV using ClamSMTP

On Tue, 10 Dec 2013 09:15:41 -0600
Patrick Noffke <patrick@noffke.me> wrote:

> On Friday, December 06, 2013 09:50:27 AM Patrick Noffke wrote:
> > 
> > 
> > It seems like the "Getting info on" log message is missing a filename.  Would 
> > this explain why it's hung until the server sends a RST?
> > 
> > There are also logs like the following before the cable is reattached:
> > fs/cifs/misc.c: Null buffer passed to cifs_small_buf_release
> > 
> 
> In inode.c, in the cifs_revalidate_dentry_attr function, if I add a check for an empty full_path (returning -ENOENT if empty), then the system recovers much faster, but my process gets an error opening the file it was trying to open when the cable was disconnected.  I noticed cifs logs following trying to call cifs_revalidate_dentry_attr with the last file path, but there is a NULL inode for several such calls.
> 
> I added this right after the check for a NULL full_path:
> 
> 	if (0 != strcmp(full_path, "")) {
> 		cFYI(1, "Got empty full_path");
> 		rc = -ENOENT;
> 		goto out;
> 	}
> 
> I don't propose this as a patch -- I'm just curious if I'm on the right track.  Could I return something else, or is there some other way to help the file system recover faster?
> 

Complex problem...

I don't think you're on the right track there. An empty path is likely
just an indicator that it's trying to fetch info from the root of the
share. I think that's just happenstance. cifs.ko does a lot of dentry
revalidation by default, and that's probably where the syscalls are
when this occurs.

The problem you're describing seems more tied up with how the transport
layer and socket handling are working. I'd suggest focusing your
efforts there.

Basically, cifsd will just try to reconnect the socket if the
server->lstrp value is too far in the last. That's just a timestamp
that tells us when we last got a response from the server. The async
echoes just kick off to try and keep that going. That code needs
some work.

One thing that looks wrong to me immediately is that lstrp is reset to
jiffies before we reconnect to the server. If the reconnect takes a
while, then that value could easily be too far in the past by the time
the connect succeeds. I'll attach an untested patch. It probably won't
make much difference for this test, but looks like a reasonable thing
to fix while we're in here.

As to why the client would try to send an echo on the old socket after
connecting the new one, I'm not sure. I'd suspect that has more to do
with the socket layer. Just because we shut down the socket doesn't
mean we purge the data that we attempted to send on the old one.

Note too that there are a number of hardcoded timeouts in this code
(e.g., the msleep(3000) in cifs_reconnect, and the socket timeouts).
Reducing SMB_ECHO_TIMEOUT to 5s is likely to mean that you end up
overlapping with some of them. That's may give you odd results.

Also, 3.9.0 is pretty old at this point I'd suggest working with
something closer to mainline while troubleshooting this so you're not
treading over problems that have already been fixed.

Patch for lstrp problem follows. It compiles, but is otherwise untested.

-----------------------[snip]--------------------

[PATCH] cifs: reset server->lstrp after reconnecting

Currently we reset it before trying to reconnect. If the reconnect
attempt takes a long time, then this value could be too far in the past
by the time the connect succeeds. Reset it only after the connect
succeeds.

Cc: Patrick Noffke <patrick@noffke.me>
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/cifs/connect.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index 8813ff7..6e83e8e 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -353,7 +353,6 @@ cifs_reconnect(struct TCP_Server_Info *server)
 	kfree(server->session_key.response);
 	server->session_key.response = NULL;
 	server->session_key.len = 0;
-	server->lstrp = jiffies;
 	mutex_unlock(&server->srv_mutex);
 
 	/* mark submitted MIDs for retry and issue callback */
@@ -390,6 +389,7 @@ cifs_reconnect(struct TCP_Server_Info *server)
 			if (server->tcpStatus != CifsExiting)
 				server->tcpStatus = CifsNeedNegotiate;
 			spin_unlock(&GlobalMid_Lock);
+			server->lstrp = jiffies;
 		}
 		mutex_unlock(&server->srv_mutex);
 	} while (server->tcpStatus == CifsNeedReconnect);