From patchwork Tue May 17 23:27:19 2011
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Sage Weil <sage@newdream.net>
X-Patchwork-Id: 792702
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by demeter2.kernel.org (8.14.4/8.14.3) with ESMTP id p4HNQHrZ023225
	for <patchwork-ceph-devel@patchwork.kernel.org>;
	Tue, 17 May 2011 23:26:18 GMT
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756346Ab1EQX0Q (ORCPT
	<rfc822;patchwork-ceph-devel@patchwork.kernel.org>);
	Tue, 17 May 2011 19:26:16 -0400
Received: from cobra.newdream.net ([66.33.216.30]:47549 "EHLO
	cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751984Ab1EQX0Q (ORCPT
	<rfc822; ceph-devel@vger.kernel.org>); Tue, 17 May 2011 19:26:16 -0400
Received: from cobra.newdream.net (localhost [127.0.0.1])
	by cobra.newdream.net (Postfix) with ESMTP id 737ECBC6C4;
	Tue, 17 May 2011 16:27:20 -0700 (PDT)
DomainKey-Signature: a=rsa-sha1; c=nofws; d=newdream.net; h=date:from:to:cc
	:subject:in-reply-to:message-id:references:mime-version:
	content-type; q=dns; s=newdream.net; b=xhA1zm0Y0e8IhZSm7P8FS1A6m
	yv6tTX3KNqBENNQOhyR/eYnCQIOfetn+EqMGNiJ8cBOWKXOh5zM5rRlHExeGAqxs
	FvDJSB6MhXHNwMm0Rl9XY+MgnH0h+2LFgRROu475jkpRMoshdGvn7UCAwpZOp9X2
	RPQE4iTeif7POcEYfQ=
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=newdream.net; h=date:from
	:to:cc:subject:in-reply-to:message-id:references:mime-version:
	content-type; s=newdream.net; bh=8PoeJt/P7hC6tfZBj2G1tlUxyfQ=;
	b=Jd2p3mP6y0W7hYlJ6CHaBmE2W/Esye3yhCBNXeE78JqE82pu20hYFmy8kj6QA
	4RdMuKEbLHdlgUKu3scmA/eTuOMfATc3JLVknocSzvyDGq0Cn860ajksDHDivGxk
	z6ZwtlXVckyEO5iQuFDc9qKNDdrTwDTM5hj3f0F8+BSNAs=
Received: by cobra.newdream.net (Postfix, from userid 1031)
	id 4A1BEBC7E4; Tue, 17 May 2011 16:27:20 -0700 (PDT)
Received: from localhost (localhost [127.0.0.1])
	by cobra.newdream.net (Postfix) with ESMTP id 38241BC6C4;
	Tue, 17 May 2011 16:27:20 -0700 (PDT)
Date: Tue, 17 May 2011 16:27:19 -0700 (PDT)
From: Sage Weil <sage@newdream.net>
To: Jim Schutt <jaschut@sandia.gov>
cc: ceph-devel@vger.kernel.org
Subject: Re: [PATCH 2/2] libceph: fix handle_timeout() racing with
	con_work()/try_write()
In-Reply-To: <4DD2F773.30508@sandia.gov>
Message-ID: <Pine.LNX.4.64.1105171613120.30781@cobra.newdream.net>
References: <Pine.LNX.4.64.1105051215050.26811@cobra.newdream.net>
	<1305235954-9860-2-git-send-email-jaschut@sandia.gov>
	<Pine.LNX.4.64.1105161033150.12367@cobra.newdream.net>
	<4DD2F773.30508@sandia.gov>
MIME-Version: 1.0
Sender: ceph-devel-owner@vger.kernel.org
Precedence: bulk
List-ID: <ceph-devel.vger.kernel.org>
X-Mailing-List: ceph-devel@vger.kernel.org
X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by
	milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]);
	Tue, 17 May 2011 23:26:18 +0000 (UTC)

On Tue, 17 May 2011, Jim Schutt wrote:
> Hi Sage,
> 
> Sage Weil wrote:
> > Hi Jim,
> 
> > This picks up the second resent message.  But and OPENING is set.. so how is
> > it that we were in try_write but we didn't pass through the OPENING test in
> > con_work?
> > 
> > > > > con_work()/try_write() start again on new message.
> > > > > Note that con_work() did not notice that the connection was marked to
> > > > > be closed/reopened!  This is where things went bad.
> > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > OPENING
> > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > ffff8801955c0030 sock ffff8801825a8c00
> > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > ffff8801955c0030 state = 2076 sk_state = 4
> > 
> > Ok, now we see it, but this should have executed before we ever got to
> > try_write!
> > 
> > There are a handful of places that drop and re-take con->mutex... I think it
> > has to be one of those.  They are:
> > 
> > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > - fail_protocol (before calling ->bad_proto callback)
> > - process_connect RESETSESSION case (before calling ->reset callback)
> > - process_message (before calling ->dispatch callback)
> > 
> > I think after retaking the lock in each of these cases we need to test for
> > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > restart.
> > 
> > Does that make sense?  Can you confirm which case of lock droppage inside
> > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > taking the lock too would make it more clear what's going on; currently it
> > only prints in all the random test_bit cases).
> 
> So I've got logs now that show that connection state changes
> across a call to process_message.  That seems like it proves
> the concept.
> 
> > 
> > I'm thinking something like the below... does that seem reasonable?  I only
> > did the prepare_authorizer case to start, but the others should be fixed up
> > too.  Compile tested only.
> > 
> 
> Is there any reason not to fix all the cases at once, in try_read?
> It's less code, but far away from where the mutex dropping occurs,
> so we add a comment:

That covers the process_message and process_connect cases, provided you 
also set ret properly, and do the con_close_socket work that the con_work 
checks would have done.  e.g.,


It's a little bit unweildly because we repeat the checks at the top of 
con_work, though.  It is probably cleaner to always fall back to that case 
by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
retake con->mutex, and then checking try_read/write return values like the 
patch from yesterday does.  That way we only react to the bits in one 
place.

Have you verified that this fixes the problem for you?  It makes sense the 
race is over ->dispatch, since that is a somewhat long-running call.

sage


> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index a9bd424..11d6a61 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
>  	dout("try_read start on %p state %lu\n", con, con->state);
> 
>  more:
>  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
>  	     con->state, (int)con->in_tag, con->in_base_pos);
> +
> +	/* process_connect, process_message need to drop con->mutex,
> +	 * so connection state might have changed on us.  Check it,
> +	 * leave if necessary so new state can be processed.
> +	 */
> +	if (test_bit(CLOSED, &con->state) ||
> +	    test_bit(OPENING, &con->state))
> +		goto out;
> +
>  	if (test_bit(CONNECTING, &con->state)) {
>  		if (!test_bit(NEGOTIATING, &con->state)) {
>  			dout("try_read %p connecting\n", con);
>  			ret = read_partial_banner(con);
>  			if (ret <= 0)
> 
> 
> What am I missing?
> 
> -- Jim
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>
---
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e15a82c..7ef9eb0 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1810,6 +1810,21 @@ static int try_read(struct ceph_connection *con)
 more:
        dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
             con->in_base_pos);
+
+       /*
+        * process_connect and process_message drop and re-take
+        * con->mutex.  make sure we handle a racing close or reopen.
+        */
+       if (test_bit(CLOSED, &con->state)) {
+               ret = -EAGAIN;
+               goto out;
+       }
+       if (test_bit(OPENING, &con->state)) {
+               con_close_socket(con);
+               ret = 0;  /* so we continue to try_write */
+               goto out;
+       }
+
        if (test_bit(CONNECTING, &con->state)) {
                if (!test_bit(NEGOTIATING, &con->state)) {
                        dout("try_read connecting\n");