Message ID | 20240610223501.73191-2-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | af_unix: Remove spin_lock_nested() and convert to lock_cmp_fn. | expand |
On Mon, Jun 10, 2024 at 03:34:51PM -0700, Kuniyuki Iwashima wrote: > When created, AF_UNIX socket is put into net->unx.table.buckets[], > and the hash is stored in sk->sk_hash. > > * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD > > When bind() is called, the socket could be moved to another bucket. > > * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD > * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 > > Then, we call unix_table_double_lock() which locks a single bucket > or two. > > Let's define the order as unix_table_lock_cmp_fn() instead of using > spin_lock_nested(). > > The locking is always done in ascending order of sk->sk_hash, which > is the index of buckets/locks array allocated by kvmalloc_array(). > > sk_hash_A < sk_hash_B > <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map > > So, the relation of two sk->sk_hash can be derived from the addresses > of dep_map in the array of locks. > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > net/unix/af_unix.c | 10 +++++++++- > 1 file changed, 9 insertions(+), 1 deletion(-) > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > index 3821f8945b1e..b0a9891c0384 100644 > --- a/net/unix/af_unix.c > +++ b/net/unix/af_unix.c > @@ -126,6 +126,13 @@ static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2]; > * hash table is protected with spinlock. > * each socket state is protected by separate spinlock. > */ > +#ifdef CONFIG_PROVE_LOCKING > +static int unix_table_lock_cmp_fn(const struct lockdep_map *a, > + const struct lockdep_map *b) > +{ > + return a < b ? -1 : 0; > +} > +#endif This should be a proper comparison function: -1 for less than, 0 for equal, 1 for greater than. I've got a cmp_int() macro in bcachefs that does this nicely.
From: Kent Overstreet <kent.overstreet@linux.dev> Date: Mon, 10 Jun 2024 18:43:44 -0400 > On Mon, Jun 10, 2024 at 03:34:51PM -0700, Kuniyuki Iwashima wrote: > > When created, AF_UNIX socket is put into net->unx.table.buckets[], > > and the hash is stored in sk->sk_hash. > > > > * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD > > > > When bind() is called, the socket could be moved to another bucket. > > > > * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD > > * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 > > > > Then, we call unix_table_double_lock() which locks a single bucket > > or two. > > > > Let's define the order as unix_table_lock_cmp_fn() instead of using > > spin_lock_nested(). > > > > The locking is always done in ascending order of sk->sk_hash, which > > is the index of buckets/locks array allocated by kvmalloc_array(). > > > > sk_hash_A < sk_hash_B > > <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map > > > > So, the relation of two sk->sk_hash can be derived from the addresses > > of dep_map in the array of locks. > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > --- > > net/unix/af_unix.c | 10 +++++++++- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > > index 3821f8945b1e..b0a9891c0384 100644 > > --- a/net/unix/af_unix.c > > +++ b/net/unix/af_unix.c > > @@ -126,6 +126,13 @@ static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2]; > > * hash table is protected with spinlock. > > * each socket state is protected by separate spinlock. > > */ > > +#ifdef CONFIG_PROVE_LOCKING > > +static int unix_table_lock_cmp_fn(const struct lockdep_map *a, > > + const struct lockdep_map *b) > > +{ > > + return a < b ? -1 : 0; > > +} > > +#endif > > This should be a proper comparison function: -1 for less than, 0 for > equal, 1 for greater than. > > I've got a cmp_int() macro in bcachefs that does this nicely. So, should this be : a < b ? -1 : 1 ? or ((a > b) - (b < a)) ? I think most double_lock functions eliminate the a == b case beforehand, and even ->cmp_fn() is not called for such a recursive case because debug_spin_lock_before() triggers BUG() then. Initially I added the same macro, but checkpatch complains about it, and I thought the current form is easier to understand because it's the actual comparison used in the double lock part. Also, there is a case, where we just want to return an error without classifying it into 0 or 1. I rather think we should define something like this on the lockdep side. enum lockdep_cmp_result { LOCKDEP_CMP_SAFE = -1, LOCKDEP_CMP_DEADLOCK, }; What do you think ?
On Mon, Jun 10, 2024 at 04:38:57PM -0700, Kuniyuki Iwashima wrote: > From: Kent Overstreet <kent.overstreet@linux.dev> > Date: Mon, 10 Jun 2024 18:43:44 -0400 > > On Mon, Jun 10, 2024 at 03:34:51PM -0700, Kuniyuki Iwashima wrote: > > > When created, AF_UNIX socket is put into net->unx.table.buckets[], > > > and the hash is stored in sk->sk_hash. > > > > > > * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD > > > > > > When bind() is called, the socket could be moved to another bucket. > > > > > > * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD > > > * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 > > > > > > Then, we call unix_table_double_lock() which locks a single bucket > > > or two. > > > > > > Let's define the order as unix_table_lock_cmp_fn() instead of using > > > spin_lock_nested(). > > > > > > The locking is always done in ascending order of sk->sk_hash, which > > > is the index of buckets/locks array allocated by kvmalloc_array(). > > > > > > sk_hash_A < sk_hash_B > > > <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map > > > > > > So, the relation of two sk->sk_hash can be derived from the addresses > > > of dep_map in the array of locks. > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > > --- > > > net/unix/af_unix.c | 10 +++++++++- > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > > > index 3821f8945b1e..b0a9891c0384 100644 > > > --- a/net/unix/af_unix.c > > > +++ b/net/unix/af_unix.c > > > @@ -126,6 +126,13 @@ static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2]; > > > * hash table is protected with spinlock. > > > * each socket state is protected by separate spinlock. > > > */ > > > +#ifdef CONFIG_PROVE_LOCKING > > > +static int unix_table_lock_cmp_fn(const struct lockdep_map *a, > > > + const struct lockdep_map *b) > > > +{ > > > + return a < b ? -1 : 0; > > > +} > > > +#endif > > > > This should be a proper comparison function: -1 for less than, 0 for > > equal, 1 for greater than. > > > > I've got a cmp_int() macro in bcachefs that does this nicely. > > So, should this be : > > a < b ? -1 : 1 > > ? > > or > > ((a > b) - (b < a)) > > ? > > I think most double_lock functions eliminate the a == b case beforehand, > and even ->cmp_fn() is not called for such a recursive case because > debug_spin_lock_before() triggers BUG() then. > > Initially I added the same macro, but checkpatch complains about it, > and I thought the current form is easier to understand because it's > the actual comparison used in the double lock part. > > Also, there is a case, where we just want to return an error without > classifying it into 0 or 1. > > I rather think we should define something like this on the lockdep side. > > enum lockdep_cmp_result { > LOCKDEP_CMP_SAFE = -1, > LOCKDEP_CMP_DEADLOCK, > }; > > What do you think ? No, we're defining an ordering, there's no need for an enum - this should work exactly the same as a comparison function that you pass to sort(). Comparison functions are no place to get fancy, they should be as standard as possible: you can get _crazy_ bugs resulting from buggy comparison functions that don't actually define a total ordering.
From: Kent Overstreet <kent.overstreet@linux.dev> Date: Mon, 10 Jun 2024 19:50:27 -0400 > On Mon, Jun 10, 2024 at 04:38:57PM -0700, Kuniyuki Iwashima wrote: > > From: Kent Overstreet <kent.overstreet@linux.dev> > > Date: Mon, 10 Jun 2024 18:43:44 -0400 > > > On Mon, Jun 10, 2024 at 03:34:51PM -0700, Kuniyuki Iwashima wrote: > > > > When created, AF_UNIX socket is put into net->unx.table.buckets[], > > > > and the hash is stored in sk->sk_hash. > > > > > > > > * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD > > > > > > > > When bind() is called, the socket could be moved to another bucket. > > > > > > > > * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD > > > > * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 > > > > > > > > Then, we call unix_table_double_lock() which locks a single bucket > > > > or two. > > > > > > > > Let's define the order as unix_table_lock_cmp_fn() instead of using > > > > spin_lock_nested(). > > > > > > > > The locking is always done in ascending order of sk->sk_hash, which > > > > is the index of buckets/locks array allocated by kvmalloc_array(). > > > > > > > > sk_hash_A < sk_hash_B > > > > <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map > > > > > > > > So, the relation of two sk->sk_hash can be derived from the addresses > > > > of dep_map in the array of locks. > > > > > > > > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > > > > --- > > > > net/unix/af_unix.c | 10 +++++++++- > > > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c > > > > index 3821f8945b1e..b0a9891c0384 100644 > > > > --- a/net/unix/af_unix.c > > > > +++ b/net/unix/af_unix.c > > > > @@ -126,6 +126,13 @@ static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2]; > > > > * hash table is protected with spinlock. > > > > * each socket state is protected by separate spinlock. > > > > */ > > > > +#ifdef CONFIG_PROVE_LOCKING > > > > +static int unix_table_lock_cmp_fn(const struct lockdep_map *a, > > > > + const struct lockdep_map *b) > > > > +{ > > > > + return a < b ? -1 : 0; > > > > +} > > > > +#endif > > > > > > This should be a proper comparison function: -1 for less than, 0 for > > > equal, 1 for greater than. > > > > > > I've got a cmp_int() macro in bcachefs that does this nicely. > > > > So, should this be : > > > > a < b ? -1 : 1 > > > > ? > > > > or > > > > ((a > b) - (b < a)) > > > > ? > > > > I think most double_lock functions eliminate the a == b case beforehand, > > and even ->cmp_fn() is not called for such a recursive case because > > debug_spin_lock_before() triggers BUG() then. > > > > Initially I added the same macro, but checkpatch complains about it, > > and I thought the current form is easier to understand because it's > > the actual comparison used in the double lock part. > > > > Also, there is a case, where we just want to return an error without > > classifying it into 0 or 1. > > > > I rather think we should define something like this on the lockdep side. > > > > enum lockdep_cmp_result { > > LOCKDEP_CMP_SAFE = -1, > > LOCKDEP_CMP_DEADLOCK, > > }; > > > > What do you think ? > > No, we're defining an ordering, there's no need for an enum - this > should work exactly the same as a comparison function that you pass to > sort(). > > Comparison functions are no place to get fancy, they should be as > standard as possible: you can get _crazy_ bugs resulting from buggy > comparison functions that don't actually define a total ordering. What should it return if we cannot define the total ordering like when we only define the allowed list of ordering ? See patch 8, the rule there is if the nested order is listening socket -> child socket, then ok, and otherwise, not. So we don't know the clear ordering, equal or greater, but we know it's actually illegal. https://lore.kernel.org/netdev/20240610223501.73191-9-kuniyu@amazon.com/
On Mon, Jun 10, 2024 at 04:58:36PM -0700, Kuniyuki Iwashima wrote: > > No, we're defining an ordering, there's no need for an enum - this > > should work exactly the same as a comparison function that you pass to > > sort(). > > > > Comparison functions are no place to get fancy, they should be as > > standard as possible: you can get _crazy_ bugs resulting from buggy > > comparison functions that don't actually define a total ordering. > > What should it return if we cannot define the total ordering like > when we only define the allowed list of ordering ? > > See patch 8, the rule there is > > if the nested order is listening socket -> child socket, then ok, > and otherwise, not. > > So we don't know the clear ordering, equal or greater, but we know > it's actually illegal. > > https://lore.kernel.org/netdev/20240610223501.73191-9-kuniyu@amazon.com/ Ok yeah, that's a tricky one, and it does come up elsewhere. I think we can allow comparison functions to return "undefined", and define 0 == undefined for lockdep. The important thing I want to maintain is that comparison functions be symmetric.
From: Kent Overstreet <kent.overstreet@linux.dev> Date: Mon, 10 Jun 2024 20:30:58 -0400 > On Mon, Jun 10, 2024 at 04:58:36PM -0700, Kuniyuki Iwashima wrote: > > > No, we're defining an ordering, there's no need for an enum - this > > > should work exactly the same as a comparison function that you pass to > > > sort(). > > > > > > Comparison functions are no place to get fancy, they should be as > > > standard as possible: you can get _crazy_ bugs resulting from buggy > > > comparison functions that don't actually define a total ordering. > > > > What should it return if we cannot define the total ordering like > > when we only define the allowed list of ordering ? > > > > See patch 8, the rule there is > > > > if the nested order is listening socket -> child socket, then ok, > > and otherwise, not. > > > > So we don't know the clear ordering, equal or greater, but we know > > it's actually illegal. > > > > https://lore.kernel.org/netdev/20240610223501.73191-9-kuniyu@amazon.com/ > > Ok yeah, that's a tricky one, and it does come up elsewhere. Actually patch 3 & 4 is another example where we cannot define the ordering, and only one necessary rule out of three is defined. > I think we > can allow comparison functions to return "undefined", and define 0 == > undefined for lockdep. I agree. > > The important thing I want to maintain is that comparison functions be > symmetric. Ok, I'll use ((a > b) - (b < a)) for patch 1 & 2 where the odering can be well defined as numeric ascending order. Thanks!
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 3821f8945b1e..b0a9891c0384 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -126,6 +126,13 @@ static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2]; * hash table is protected with spinlock. * each socket state is protected by separate spinlock. */ +#ifdef CONFIG_PROVE_LOCKING +static int unix_table_lock_cmp_fn(const struct lockdep_map *a, + const struct lockdep_map *b) +{ + return a < b ? -1 : 0; +} +#endif static unsigned int unix_unbound_hash(struct sock *sk) { @@ -168,7 +175,7 @@ static void unix_table_double_lock(struct net *net, swap(hash1, hash2); spin_lock(&net->unx.table.locks[hash1]); - spin_lock_nested(&net->unx.table.locks[hash2], SINGLE_DEPTH_NESTING); + spin_lock(&net->unx.table.locks[hash2]); } static void unix_table_double_unlock(struct net *net, @@ -3578,6 +3585,7 @@ static int __net_init unix_net_init(struct net *net) for (i = 0; i < UNIX_HASH_SIZE; i++) { spin_lock_init(&net->unx.table.locks[i]); + lock_set_cmp_fn(&net->unx.table.locks[i], unix_table_lock_cmp_fn, NULL); INIT_HLIST_HEAD(&net->unx.table.buckets[i]); }
When created, AF_UNIX socket is put into net->unx.table.buckets[], and the hash is stored in sk->sk_hash. * unbound socket : 0 <= sk_hash <= UNIX_HASH_MOD When bind() is called, the socket could be moved to another bucket. * pathname socket : 0 <= sk_hash <= UNIX_HASH_MOD * abstract socket : UNIX_HASH_MOD + 1 <= sk_hash <= UNIX_HASH_MOD * 2 + 1 Then, we call unix_table_double_lock() which locks a single bucket or two. Let's define the order as unix_table_lock_cmp_fn() instead of using spin_lock_nested(). The locking is always done in ascending order of sk->sk_hash, which is the index of buckets/locks array allocated by kvmalloc_array(). sk_hash_A < sk_hash_B <=> &locks[sk_hash_A].dep_map < &locks[sk_hash_B].dep_map So, the relation of two sk->sk_hash can be derived from the addresses of dep_map in the array of locks. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- net/unix/af_unix.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-)