diff mbox

[2/2] libceph: fix handle_timeout() racing with con_work()/try_write()

Message ID Pine.LNX.4.64.1105171613120.30781@cobra.newdream.net (mailing list archive)
State New, archived
Headers show

Commit Message

Sage Weil May 17, 2011, 11:27 p.m. UTC
On Tue, 17 May 2011, Jim Schutt wrote:
> Hi Sage,
> 
> Sage Weil wrote:
> > Hi Jim,
> 
> > This picks up the second resent message.  But and OPENING is set.. so how is
> > it that we were in try_write but we didn't pass through the OPENING test in
> > con_work?
> > 
> > > > > con_work()/try_write() start again on new message.
> > > > > Note that con_work() did not notice that the connection was marked to
> > > > > be closed/reopened!  This is where things went bad.
> > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > OPENING
> > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > ffff8801955c0030 sock ffff8801825a8c00
> > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > ffff8801955c0030 state = 2076 sk_state = 4
> > 
> > Ok, now we see it, but this should have executed before we ever got to
> > try_write!
> > 
> > There are a handful of places that drop and re-take con->mutex... I think it
> > has to be one of those.  They are:
> > 
> > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > - fail_protocol (before calling ->bad_proto callback)
> > - process_connect RESETSESSION case (before calling ->reset callback)
> > - process_message (before calling ->dispatch callback)
> > 
> > I think after retaking the lock in each of these cases we need to test for
> > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > restart.
> > 
> > Does that make sense?  Can you confirm which case of lock droppage inside
> > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > taking the lock too would make it more clear what's going on; currently it
> > only prints in all the random test_bit cases).
> 
> So I've got logs now that show that connection state changes
> across a call to process_message.  That seems like it proves
> the concept.
> 
> > 
> > I'm thinking something like the below... does that seem reasonable?  I only
> > did the prepare_authorizer case to start, but the others should be fixed up
> > too.  Compile tested only.
> > 
> 
> Is there any reason not to fix all the cases at once, in try_read?
> It's less code, but far away from where the mutex dropping occurs,
> so we add a comment:

That covers the process_message and process_connect cases, provided you 
also set ret properly, and do the con_close_socket work that the con_work 
checks would have done.  e.g.,


It's a little bit unweildly because we repeat the checks at the top of 
con_work, though.  It is probably cleaner to always fall back to that case 
by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
retake con->mutex, and then checking try_read/write return values like the 
patch from yesterday does.  That way we only react to the bits in one 
place.

Have you verified that this fixes the problem for you?  It makes sense the 
race is over ->dispatch, since that is a somewhat long-running call.

sage



> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index a9bd424..11d6a61 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
>  	dout("try_read start on %p state %lu\n", con, con->state);
> 
>  more:
>  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
>  	     con->state, (int)con->in_tag, con->in_base_pos);
> +
> +	/* process_connect, process_message need to drop con->mutex,
> +	 * so connection state might have changed on us.  Check it,
> +	 * leave if necessary so new state can be processed.
> +	 */
> +	if (test_bit(CLOSED, &con->state) ||
> +	    test_bit(OPENING, &con->state))
> +		goto out;
> +
>  	if (test_bit(CONNECTING, &con->state)) {
>  		if (!test_bit(NEGOTIATING, &con->state)) {
>  			dout("try_read %p connecting\n", con);
>  			ret = read_partial_banner(con);
>  			if (ret <= 0)
> 
> 
> What am I missing?
> 
> -- Jim
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Sage Weil May 17, 2011, 11:38 p.m. UTC | #1
On Tue, 17 May 2011, Sage Weil wrote:
> On Tue, 17 May 2011, Jim Schutt wrote:
> > Hi Sage,
> > 
> > Sage Weil wrote:
> > > Hi Jim,
> > 
> > > This picks up the second resent message.  But and OPENING is set.. so how is
> > > it that we were in try_write but we didn't pass through the OPENING test in
> > > con_work?
> > > 
> > > > > > con_work()/try_write() start again on new message.
> > > > > > Note that con_work() did not notice that the connection was marked to
> > > > > > be closed/reopened!  This is where things went bad.
> > > > > > Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
> > > > [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
> > > > OPENING
> > > > [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
> > > > ffff8801955c0030 sock ffff8801825a8c00
> > > > [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
> > > > ffff8801955c0030 state = 2076 sk_state = 4
> > > 
> > > Ok, now we see it, but this should have executed before we ever got to
> > > try_write!
> > > 
> > > There are a handful of places that drop and re-take con->mutex... I think it
> > > has to be one of those.  They are:
> > > 
> > > - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
> > > - fail_protocol (before calling ->bad_proto callback)
> > > - process_connect RESETSESSION case (before calling ->reset callback)
> > > - process_message (before calling ->dispatch callback)
> > > 
> > > I think after retaking the lock in each of these cases we need to test for
> > > the CLOSED or OPENING bits and, if set, bail out so that con_work can
> > > restart.
> > > 
> > > Does that make sense?  Can you confirm which case of lock droppage inside
> > > con_work is at fault?  (Maybe adding a printk at the top of con_work after
> > > taking the lock too would make it more clear what's going on; currently it
> > > only prints in all the random test_bit cases).
> > 
> > So I've got logs now that show that connection state changes
> > across a call to process_message.  That seems like it proves
> > the concept.
> > 
> > > 
> > > I'm thinking something like the below... does that seem reasonable?  I only
> > > did the prepare_authorizer case to start, but the others should be fixed up
> > > too.  Compile tested only.
> > > 
> > 
> > Is there any reason not to fix all the cases at once, in try_read?
> > It's less code, but far away from where the mutex dropping occurs,
> > so we add a comment:
> 
> That covers the process_message and process_connect cases, provided you 
> also set ret properly, and do the con_close_socket work that the con_work 
> checks would have done.  e.g.,

I pushed a patch to the msgr_race branch that catches all four cases (I 
think).  Does the fix make sense given what you saw?

Thanks!
sage



> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index e15a82c..7ef9eb0 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1810,6 +1810,21 @@ static int try_read(struct ceph_connection *con)
>  more:
>         dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
>              con->in_base_pos);
> +
> +       /*
> +        * process_connect and process_message drop and re-take
> +        * con->mutex.  make sure we handle a racing close or reopen.
> +        */
> +       if (test_bit(CLOSED, &con->state)) {
> +               ret = -EAGAIN;
> +               goto out;
> +       }
> +       if (test_bit(OPENING, &con->state)) {
> +               con_close_socket(con);
> +               ret = 0;  /* so we continue to try_write */
> +               goto out;
> +       }
> +
>         if (test_bit(CONNECTING, &con->state)) {
>                 if (!test_bit(NEGOTIATING, &con->state)) {
>                         dout("try_read connecting\n");
> 
> It's a little bit unweildly because we repeat the checks at the top of 
> con_work, though.  It is probably cleaner to always fall back to that case 
> by doing CLOSED|OPENING bit checks and bailing with -EAGAIN any time we 
> retake con->mutex, and then checking try_read/write return values like the 
> patch from yesterday does.  That way we only react to the bits in one 
> place.
> 
> Have you verified that this fixes the problem for you?  It makes sense the 
> race is over ->dispatch, since that is a somewhat long-running call.
> 
> sage
> 
> 
> 
> > 
> > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> > index a9bd424..11d6a61 100644
> > --- a/net/ceph/messenger.c
> > +++ b/net/ceph/messenger.c
> > @@ -1816,10 +1820,19 @@ static int try_read(struct ceph_connection *con)
> >  	dout("try_read start on %p state %lu\n", con, con->state);
> > 
> >  more:
> >  	dout("try_read %p state %lu tag %d in_base_pos %d\n", con,
> >  	     con->state, (int)con->in_tag, con->in_base_pos);
> > +
> > +	/* process_connect, process_message need to drop con->mutex,
> > +	 * so connection state might have changed on us.  Check it,
> > +	 * leave if necessary so new state can be processed.
> > +	 */
> > +	if (test_bit(CLOSED, &con->state) ||
> > +	    test_bit(OPENING, &con->state))
> > +		goto out;
> > +
> >  	if (test_bit(CONNECTING, &con->state)) {
> >  		if (!test_bit(NEGOTIATING, &con->state)) {
> >  			dout("try_read %p connecting\n", con);
> >  			ret = read_partial_banner(con);
> >  			if (ret <= 0)
> > 
> > 
> > What am I missing?
> > 
> > -- Jim
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jim Schutt May 18, 2011, 2:34 p.m. UTC | #2
Sage Weil wrote:
> On Tue, 17 May 2011, Sage Weil wrote:
>> On Tue, 17 May 2011, Jim Schutt wrote:
>>> Hi Sage,
>>>
>>> Sage Weil wrote:
>>>> Hi Jim,
>>>> This picks up the second resent message.  But and OPENING is set.. so how is
>>>> it that we were in try_write but we didn't pass through the OPENING test in
>>>> con_work?
>>>>
>>>>>>> con_work()/try_write() start again on new message.
>>>>>>> Note that con_work() did not notice that the connection was marked to
>>>>>>> be closed/reopened!  This is where things went bad.
>>>>>>> Note the connection state: 8220 = 0x201c, and 0x2000 is OPENING.
>>>>> [ 2412.643178] libceph:   messenger.c:1969 : con_work ffff8801955c0030
>>>>> OPENING
>>>>> [ 2412.643182] libceph:   messenger.c:292  : con_close_socket on
>>>>> ffff8801955c0030 sock ffff8801825a8c00
>>>>> [ 2412.643197] libceph:   messenger.c:162  : ceph_state_change
>>>>> ffff8801955c0030 state = 2076 sk_state = 4
>>>> Ok, now we see it, but this should have executed before we ever got to
>>>> try_write!
>>>>
>>>> There are a handful of places that drop and re-take con->mutex... I think it
>>>> has to be one of those.  They are:
>>>>
>>>> - prepare_connect_authorizer (before calling ->prepare_authorizer callback)
>>>> - fail_protocol (before calling ->bad_proto callback)
>>>> - process_connect RESETSESSION case (before calling ->reset callback)
>>>> - process_message (before calling ->dispatch callback)
>>>>
>>>> I think after retaking the lock in each of these cases we need to test for
>>>> the CLOSED or OPENING bits and, if set, bail out so that con_work can
>>>> restart.
>>>>
>>>> Does that make sense?  Can you confirm which case of lock droppage inside
>>>> con_work is at fault?  (Maybe adding a printk at the top of con_work after
>>>> taking the lock too would make it more clear what's going on; currently it
>>>> only prints in all the random test_bit cases).
>>> So I've got logs now that show that connection state changes
>>> across a call to process_message.  That seems like it proves
>>> the concept.
>>>
>>>> I'm thinking something like the below... does that seem reasonable?  I only
>>>> did the prepare_authorizer case to start, but the others should be fixed up
>>>> too.  Compile tested only.
>>>>
>>> Is there any reason not to fix all the cases at once, in try_read?
>>> It's less code, but far away from where the mutex dropping occurs,
>>> so we add a comment:
>> That covers the process_message and process_connect cases, provided you 
>> also set ret properly, and do the con_close_socket work that the con_work 
>> checks would have done.  e.g.,
> 
> I pushed a patch to the msgr_race branch that catches all four cases (I 
> think).  Does the fix make sense given what you saw?

Yes, I think so.

In every case I logged I saw that prepare_write_message was
getting called at the wrong time - I finally am understanding
that happened because reset_connection was able to run when
con->mutex was dropped, and it thus resets con->out_msg at
the wrong time.

I'll test your fix out today and let you know if I see
any more of this.

-- Jim

> 
> Thanks!
> sage
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e15a82c..7ef9eb0 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1810,6 +1810,21 @@  static int try_read(struct ceph_connection *con)
 more:
        dout("try_read tag %d in_base_pos %d\n", (int)con->in_tag,
             con->in_base_pos);
+
+       /*
+        * process_connect and process_message drop and re-take
+        * con->mutex.  make sure we handle a racing close or reopen.
+        */
+       if (test_bit(CLOSED, &con->state)) {
+               ret = -EAGAIN;
+               goto out;
+       }
+       if (test_bit(OPENING, &con->state)) {
+               con_close_socket(con);
+               ret = 0;  /* so we continue to try_write */
+               goto out;
+       }
+
        if (test_bit(CONNECTING, &con->state)) {
                if (!test_bit(NEGOTIATING, &con->state)) {
                        dout("try_read connecting\n");