diff mbox

[10/19] NET: set PF_FSTRANS while holding sk_lock

Message ID 20140416040336.10604.96000.stgit@notabene.brown (mailing list archive)
State New, archived
Headers show

Commit Message

NeilBrown April 16, 2014, 4:03 a.m. UTC
sk_lock can be taken while reclaiming memory (in nfsd for loop-back
NFS mounts, and presumably in nfs), and memory can be allocated while
holding sk_lock, at least via:

 inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc

So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.

This deadlock was found by lockdep.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/net/sock.h |    1 +
 net/core/sock.c    |    2 ++
 2 files changed, 3 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet April 16, 2014, 5:13 a.m. UTC | #1
On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> NFS mounts, and presumably in nfs), and memory can be allocated while
> holding sk_lock, at least via:
> 
>  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> 
> So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> 
> This deadlock was found by lockdep.

Wow, this is adding expensive stuff in fast path, only for nfsd :(

BTW, why should the current->flags should be saved on a socket field,
and not a current->save_flags. This really looks a thread property, not
a socket one.

Why nfsd could not have PF_FSTRANS in its current->flags ?

For applications handling millions of sockets, this makes a difference.


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown April 16, 2014, 5:47 a.m. UTC | #2
On Tue, 15 Apr 2014 22:13:46 -0700 Eric Dumazet <eric.dumazet@gmail.com>
wrote:

> On Wed, 2014-04-16 at 14:03 +1000, NeilBrown wrote:
> > sk_lock can be taken while reclaiming memory (in nfsd for loop-back
> > NFS mounts, and presumably in nfs), and memory can be allocated while
> > holding sk_lock, at least via:
> > 
> >  inet_listen -> inet_csk_listen_start ->reqsk_queue_alloc
> > 
> > So to avoid deadlocks, always set PF_FSTRANS while holding sk_lock.
> > 
> > This deadlock was found by lockdep.
> 
> Wow, this is adding expensive stuff in fast path, only for nfsd :(

Yes, this was probably one part that I was least comfortable about.

> 
> BTW, why should the current->flags should be saved on a socket field,
> and not a current->save_flags. This really looks a thread property, not
> a socket one.
> 
> Why nfsd could not have PF_FSTRANS in its current->flags ?

nfsd does have PF_FSTRANS set in current->flags.  But some other processes
might not.

If any process takes sk_lock, allocates memory, and then blocks in reclaim it
could be waiting for nfsd.  If nfsd waits for that sk_lock, it would cause a
deadlock.

Thinking a bit more carefully .... I suspect that any socket that nfsd
created would only ever be locked by nfsd.  If that is the case then the
problem can be resolved entirely within nfsd.  We would need to tell lockdep
that there are two sorts of sk_locks, those which nfsd uses and all the
rest.  That might get a little messy, but wouldn't impact performance.

Is it justified to assume that sockets created by nfsd threads would only
ever be locked by nfsd threads (and interrupts, which won't be allocating
memory so don't matter), or might there be locked by other threads - e.g for
'netstat -a' etc??


> 
> For applications handling millions of sockets, this makes a difference.
> 

Thanks,
NeilBrown
David Miller April 16, 2014, 1 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 15 Apr 2014 22:13:46 -0700

> For applications handling millions of sockets, this makes a difference.

Indeed, this really is not acceptable.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
NeilBrown April 17, 2014, 2:38 a.m. UTC | #4
On Wed, 16 Apr 2014 09:00:02 -0400 (EDT) David Miller <davem@davemloft.net>
wrote:

> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 15 Apr 2014 22:13:46 -0700
> 
> > For applications handling millions of sockets, this makes a difference.
> 
> Indeed, this really is not acceptable.

As you say...
I've just discovered that I can get rid of the lockdep message (and hence
presumably the deadlock risk) with a well placed:

		newsock->sk->sk_allocation = GFP_NOFS;

which surprised me as it seemed to be an explicit GFP_KERNEL allocation that
was mentioned in the lockdep trace.  Obviously these traces require quite
some sophistication to understand.

So - thanks for the feedback, patch can be ignored.

Thanks,
NeilBrown
diff mbox

Patch

diff --git a/include/net/sock.h b/include/net/sock.h
index b9586a137cad..27c355637e44 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -324,6 +324,7 @@  struct sock {
 #define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr
 
 	socket_lock_t		sk_lock;
+	unsigned int		sk_pflags; /* process flags before taking lock */
 	struct sk_buff_head	sk_receive_queue;
 	/*
 	 * The backlog queue is special, it is always used with
diff --git a/net/core/sock.c b/net/core/sock.c
index cf9bd24e4099..8bc677ef072e 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2341,6 +2341,7 @@  void lock_sock_nested(struct sock *sk, int subclass)
 	/*
 	 * The sk_lock has mutex_lock() semantics here:
 	 */
+	current_set_flags_nested(&sk->sk_pflags, PF_FSTRANS);
 	mutex_acquire(&sk->sk_lock.dep_map, subclass, 0, _RET_IP_);
 	local_bh_enable();
 }
@@ -2352,6 +2353,7 @@  void release_sock(struct sock *sk)
 	 * The sk_lock has mutex_unlock() semantics:
 	 */
 	mutex_release(&sk->sk_lock.dep_map, 1, _RET_IP_);
+	current_restore_flags_nested(&sk->sk_pflags, PF_FSTRANS);
 
 	spin_lock_bh(&sk->sk_lock.slock);
 	if (sk->sk_backlog.tail)