diff mbox

RPC Race Condition

Message ID 4FA345DA4F4AE44899BD2B03EEEC2FA90928C74A@SACEXCMBX04-PRD.hq.netapp.com (mailing list archive)
State New, archived
Headers show

Commit Message

Trond Myklebust Oct. 23, 2012, 4:02 p.m. UTC
On Tue, 2012-10-23 at 09:45 -0400, Chris Perl wrote:
> On Mon, Oct 22, 2012 at 09:20:34PM +0000, Myklebust, Trond wrote:

> > > This is what I changed:

> > > 

> > > --- a/net/sunrpc/xprtsock.c     2012-08-14 08:47:16.000000000 -0400

> > > +++ b/net/sunrpc/xprtsock.c     2012-10-22 14:50:09.237725498 -0400

> > > @@ -1571,7 +1571,8 @@

> > >         dprintk("RPC:       %s client %p...\n"

> > >                         "RPC:       error %d\n",

> > >                         __func__, xprt, sk->sk_err);

> > > -       xprt_wake_pending_tasks(xprt, -EAGAIN);

> > > +       if (!xprt_connecting(xprt))

> > > +               xprt_wake_pending_tasks(xprt, -EAGAIN);

> > >  out:

> > >         read_unlock_bh(&sk->sk_callback_lock);

> > >  }

> > > 

> > > Unfortunately it does not resolve it.  It appears that does stop the wakeup

> > > from happening where I detailed it before, but we still wake up too early.  The

> > > following is the call stack from SystemTap catching the call to `rpc_make_runnable':

> > > 

> > >  0xffffffffa0434840 : rpc_make_runnable+0x0/0x80 [sunrpc]

> > >  0xffffffffa0434b5e : rpc_wake_up_task_queue_locked+0x18e/0x270 [sunrpc]

> > >  0xffffffffa0434c8e : rpc_wake_up_status+0x4e/0x80 [sunrpc]

> > >  0xffffffffa042d8ec : xprt_wake_pending_tasks+0x2c/0x30 [sunrpc]

> > >  0xffffffffa042da56 : xprt_disconnect_done+0x46/0x70 [sunrpc]

> > >  0xffffffffa04305b0 : xs_tcp_setup_socket+0x330/0x4c0 [sunrpc]

> > >  0xffffffff8108c760 : worker_thread+0x170/0x2a0 [kernel]

> > >  0xffffffff81091d66 : kthread+0x96/0xa0 [kernel]

> > >  0xffffffff8100c14a : child_rip+0xa/0x20 [kernel]

> > > 

> > > However, its mildly confusing to me since I cannot follow that exact

> > > call stack in the code.  The nearest I can find to that is the

> > > following:

> > > 

> > >   rpc_make_runnable

> > >   rpc_wake_up_task_queue_locked

> > >   rpc_wake_up_status

> > >   xprt_wake_pending_tasks

> > >   xprt_disconnect_done

> > >   xs_sock_mark_closed

> > >   xs_abort_connection

> > >   xs_tcp_reuse_connection

> > >   xs_tcp_setup_socket

> > > 

> > > As far as I can tell, we get here because the call to `kernel_connect'

> > > from within `xs_abort_connection' by rpciod is returning 0 sometimes

> > > (and EINPROGRESS others).  When it returns 0, we enter this block:

> > > 

> > >   if (!result)

> > >           xs_sock_mark_closed(&transport->xprt);

> > >   else

> > >     dprintk("RPC:       AF_UNSPEC connect return code %d\n", result);

> > > 

> > > and that eventually gets us woken up via the call stack above.  The end

> > > result is the same, we wake up and move on to `call_transmit' before

> > > rpciod is done connecting the socket.

> > > 

> > > I'll do some more tracing and see what else I can come up with.

> > > 

> > > Chris

> > 

> > OK, then let's just get rid of xs_error_report(). All the cases that we

> > care about will call tcp_done(), which will induce a state change when

> > needed.

> > 

> > Cheers,

> >   Trond

> > 8<---------------------------------------------------------------------

> > From 047ee9ea072698f1d35573c1f26dc02d13a4af7f Mon Sep 17 00:00:00 2001

> > From: Trond Myklebust <Trond.Myklebust@netapp.com>

> > Date: Mon, 22 Oct 2012 17:14:36 -0400

> > Subject: [PATCH] SUNRPC: Get rid of the xs_error_report socket callback

> > 

> > Chris Perl reports that we're seeing races between the wakeup call in

> > xs_error_report and the connect attempts. Basically, Chris has shown

> > that in certain circumstances, the call to xs_error_report causes the

> > rpc_task that is responsible for reconnecting to wake up early, thus

> > triggering a disconnect and retry.

> > 

> > Since the sk->sk_error_report() calls in the socket layer are always

> > followed by a tcp_done() in the cases where we care about waking up

> > the rpc_tasks, just let the state_change callbacks take responsibility

> > for those wake ups.

> > 

> > Reported-by: Chris Perl <chris.perl@gmail.com>

> > Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>

> > ---

> >  net/sunrpc/xprtsock.c | 25 -------------------------

> >  1 file changed, 25 deletions(-)

> > 

> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c

> > index aaaadfb..89bde21 100644

> > --- a/net/sunrpc/xprtsock.c

> > +++ b/net/sunrpc/xprtsock.c

> > @@ -254,7 +254,6 @@ struct sock_xprt {

> >  	void			(*old_data_ready)(struct sock *, int);

> >  	void			(*old_state_change)(struct sock *);

> >  	void			(*old_write_space)(struct sock *);

> > -	void			(*old_error_report)(struct sock *);

> >  };

> >  

> >  /*

> > @@ -781,7 +780,6 @@ static void xs_save_old_callbacks(struct sock_xprt *transport, struct sock *sk)

> >  	transport->old_data_ready = sk->sk_data_ready;

> >  	transport->old_state_change = sk->sk_state_change;

> >  	transport->old_write_space = sk->sk_write_space;

> > -	transport->old_error_report = sk->sk_error_report;

> >  }

> >  

> >  static void xs_restore_old_callbacks(struct sock_xprt *transport, struct sock *sk)

> > @@ -789,7 +787,6 @@ static void xs_restore_old_callbacks(struct sock_xprt *transport, struct sock *s

> >  	sk->sk_data_ready = transport->old_data_ready;

> >  	sk->sk_state_change = transport->old_state_change;

> >  	sk->sk_write_space = transport->old_write_space;

> > -	sk->sk_error_report = transport->old_error_report;

> >  }

> >  

> >  static void xs_reset_transport(struct sock_xprt *transport)

> > @@ -1540,25 +1537,6 @@ static void xs_tcp_state_change(struct sock *sk)

> >  	read_unlock_bh(&sk->sk_callback_lock);

> >  }

> >  

> > -/**

> > - * xs_error_report - callback mainly for catching socket errors

> > - * @sk: socket

> > - */

> > -static void xs_error_report(struct sock *sk)

> > -{

> > -	struct rpc_xprt *xprt;

> > -

> > -	read_lock_bh(&sk->sk_callback_lock);

> > -	if (!(xprt = xprt_from_sock(sk)))

> > -		goto out;

> > -	dprintk("RPC:       %s client %p...\n"

> > -			"RPC:       error %d\n",

> > -			__func__, xprt, sk->sk_err);

> > -	xprt_wake_pending_tasks(xprt, -EAGAIN);

> > -out:

> > -	read_unlock_bh(&sk->sk_callback_lock);

> > -}

> > -

> >  static void xs_write_space(struct sock *sk)

> >  {

> >  	struct socket *sock;

> > @@ -1858,7 +1836,6 @@ static int xs_local_finish_connecting(struct rpc_xprt *xprt,

> >  		sk->sk_user_data = xprt;

> >  		sk->sk_data_ready = xs_local_data_ready;

> >  		sk->sk_write_space = xs_udp_write_space;

> > -		sk->sk_error_report = xs_error_report;

> >  		sk->sk_allocation = GFP_ATOMIC;

> >  

> >  		xprt_clear_connected(xprt);

> > @@ -1983,7 +1960,6 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)

> >  		sk->sk_user_data = xprt;

> >  		sk->sk_data_ready = xs_udp_data_ready;

> >  		sk->sk_write_space = xs_udp_write_space;

> > -		sk->sk_error_report = xs_error_report;

> >  		sk->sk_no_check = UDP_CSUM_NORCV;

> >  		sk->sk_allocation = GFP_ATOMIC;

> >  

> > @@ -2098,7 +2074,6 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)

> >  		sk->sk_data_ready = xs_tcp_data_ready;

> >  		sk->sk_state_change = xs_tcp_state_change;

> >  		sk->sk_write_space = xs_tcp_write_space;

> > -		sk->sk_error_report = xs_error_report;

> >  		sk->sk_allocation = GFP_ATOMIC;

> >  

> >  		/* socket options */

> 

> Hi Trond,

> 

> First, thanks for being so responsive!  I never expected to get such

> quick feedback and help.

> 

> I think perhaps you misread my last email.  Adding the check for

> `xprt_connecting(xprt)' to `xs_error_report' _did_ stop the early wake

> up from happening at that point in the code.  Removing the

> entire call back with your patch works just as well.


OK. I did misunderstand that.

> I have confirmed that I can still reproduce the problem with this

> patch.

> 

> Now that we're not waking up early because of `xs_error_report', we're

> waking up early in another part of the code as I mentioned above.  I'm

> not entirely clear about all the details here yet.

> 

> What I do know is that we're woken up with `tk_status' set to EGAIN, and

> since `call_connect_status' doesn't do anything special for that, we move

> on to `call_transmit'.  Meanwhile, rpciod is still trying to connect the

> socket (eventually succeeding) and so we're racing.

> 

> Question: should `call_connect_status' treat EAGAIN specially and reset

> `tk_action' to `call_connect' again (rather than just moving on to

> `call_transmit')?  I don't think this alone would fix things properly,

> but seems logical to me.  Perhaps I'm missing something.


call_transmit() should normally just cause the rpc_task to detect that
the connection is down, and should cause it to go to call_connect in
order to fix everything.

If it is causing a race, then I'm thinking that is probably due to the
call to xs_tcp_shutdown() when we get an EPIPE error in
xs_tcp_send_request(). That shutdown seems to be needed only in the case
when we're in TCP_CLOSE_WAIT, in which case what we really want is to
force the rpc_task to go to xprt_connect, which means clearing
xprt_connected().

Hmm.... Can you try applying the attached 2 patches?

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

Comments

Chris Perl Oct. 23, 2012, 8:08 p.m. UTC | #1
On Tue, Oct 23, 2012 at 04:02:27PM +0000, Myklebust, Trond wrote:
> call_transmit() should normally just cause the rpc_task to detect that
> the connection is down, and should cause it to go to call_connect in
> order to fix everything.
> 
> If it is causing a race, then I'm thinking that is probably due to the
> call to xs_tcp_shutdown() when we get an EPIPE error in
> xs_tcp_send_request(). That shutdown seems to be needed only in the case
> when we're in TCP_CLOSE_WAIT, in which case what we really want is to
> force the rpc_task to go to xprt_connect, which means clearing
> xprt_connected().
> 
> Hmm.... Can you try applying the attached 2 patches?

I rebuilt with those two patches as well as the first one that removed
the `xs_error_report' callback.

Good news, it works!

However, looking at the interactions, it still seems a little bit racy.
Clearly you know more about this than I do, so I'll defer to your better
judgement, but let me lay out what I believe is happening:

            Stat Thread                                rpciod
           -------------                              --------
  - woken up with tk_status set to
    EAGAIN via xs_tcp_state_change

  - tries call_transmit again, which
    calls xs_tcp_send_request which
    comes back with ECONNRESET (b/c this
    is what sk_err is)

  - call_status sees this and goes to
    sleep for 3 seconds and then sets
    tk_action to call_bind

  - call_bind

  - call_connect, queues execution of
    xs_tcp_setup_socket with rpciod and
    goes to sleep
    			
					   - starts doing its thing, but wakes
					     up our other thread before its done
					     trying to connect the socket (or
					     failing to do so) via the call
					     stack mentioned before (via
					     xprt_disconnect_done)
					     
  - tk_status is set to EAGAIN, so
    call_connect_status moves on to
    call_transmit

  - call_transmit calls
    xs_tcp_send_request again which
    finds the socket with sk_shutdown !=
    0 and so returns EPIPE

  - call_status sees the EPIPE and sets
    tk_action back to call_bind

					   - continues on and updates the state
					     of the socket to TCP_SYN_SENT

					   - When the connection is established,
					     xs_tcp_state_change takes care of
					     marking the xprt connected

  - call_bind

  - call_connect, however this time it
    schedules no work because it finds
    that the transport is already
    connected

  - call_transmit, succeeds, we recover

This is how the recovery went when I was testing, but I could imagine a
situation where different timing comes into play and we wind up asking
rpciod to execute xs_tcp_setup_socket twice (because the first connect
hasn't been established yet).  I haven't thought too hard about what
that would mean, but I imagine it probably wouldn't be good.

Like I said, you know way more about this than I do, I just wanted to point this
out.  Its entirely possible that I'm missing or misunderstanding something.

Thanks!

Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

From ed42390cf42a0ea483d7a8d7e2e2b397ec0a8f00 Mon Sep 17 00:00:00 2001
From: Trond Myklebust <Trond.Myklebust@netapp.com>
Date: Tue, 23 Oct 2012 11:40:02 -0400
Subject: [PATCH 2/2] Revert "SUNRPC: Ensure we close the socket on EPIPE
 errors too..."

This reverts commit 55420c24a0d4d1fce70ca713f84aa00b6b74a70e.
Now that we clear the connected flag when entering TCP_CLOSE_WAIT,
the deadlock described in this commit is no longer possible.
Instead, the resulting call to xs_tcp_shutdown() can interfere
with pending reconnection attempts.

Reported-by: Chris Perl <chris.perl@gmail.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
Cc: stable@vger.kernel.org
---
 net/sunrpc/xprtsock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 6e6967d..7e2dd0d 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -737,10 +737,10 @@  static int xs_tcp_send_request(struct rpc_task *task)
 		dprintk("RPC:       sendmsg returned unrecognized error %d\n",
 			-status);
 	case -ECONNRESET:
-	case -EPIPE:
 		xs_tcp_shutdown(xprt);
 	case -ECONNREFUSED:
 	case -ENOTCONN:
+	case -EPIPE:
 		clear_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags);
 	}
 
-- 
1.7.11.7