Message ID | 20220616234714.4291-5-kuniyu@amazon.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | af_unix: Introduce per-netns socket hash table. | expand |
Greeting, FYI, we noticed the following commit (built with gcc-11): commit: b4813d591454d771b5aaf33a6252b214648c430f ("[PATCH v1 net-next 4/6] af_unix: Acquire/Release per-netns hash table's locks.") url: https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/af_unix-Introduce-per-netns-socket-hash-table/20220617-075046 base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git 5dcb50c009c9f8ec1cfca6a81a05c0060a5bbf68 patch link: https://lore.kernel.org/netdev/20220616234714.4291-5-kuniyu@amazon.com in testcase: boot on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> [ 113.085258][ T1] WARNING: possible recursive locking detected [ 113.085261][ T1] 5.19.0-rc1-00408-gb4813d591454 #1 Not tainted [ 113.085264][ T1] -------------------------------------------- [ 113.085265][ T1] systemd/1 is trying to acquire lock: [ 113.085270][ T1] ffff888167ee6c18 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:1200) [ 113.085313][ T1] [ 113.085313][ T1] but task is already holding lock: [ 113.085314][ T1] ffff888167ee0918 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:175 net/unix/af_unix.c:1199) [ 113.085321][ T1] [ 113.085321][ T1] other info that might help us debug this: [ 113.085323][ T1] Possible unsafe locking scenario: [ 113.085323][ T1] [ 113.085324][ T1] CPU0 [ 113.085325][ T1] ---- [ 113.085325][ T1] lock(&net->unx.hash[i].lock); [ 113.085328][ T1] lock(&net->unx.hash[i].lock); [ 113.085330][ T1] [ 113.085330][ T1] *** DEADLOCK *** [ 113.085330][ T1] [ 113.085331][ T1] May be due to missing lock nesting notation [ 113.085331][ T1] [ 113.085333][ T1] 6 locks held by systemd/1: [ 113.085335][ T1] #0: ffff88815da40448 (sb_writers#6){.+.+}-{0:0}, at: filename_create (fs/namei.c:3744) [ 113.085351][ T1] #1: ffff88815bffec40 (&type->i_mutex_dir_key#4/1){+.+.}-{3:3}, at: filename_create (fs/namei.c:3747) [ OK ] Started Forward Password Requests to Wall Directory Watch. [ OK ] Started Dispatch Password Requests to Console Directory Watch. [ OK ] Reached target Paths. [ OK ] Listening on udev Control Socket. [ 113.085359][ T1] #2: ffff88815d974e18 (&u->bindlock){+.+.}-{3:3}, at: unix_bind_bsd (net/unix/af_unix.c:1192) [ 113.085370][ T1] #3: ffffffffb0eec038 (&unix_table_locks[i]){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:172 net/unix/af_unix.c:1199) [ 113.085377][ T1] #4: ffffffffb0ef1838 (&unix_table_locks[i]/1){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:174 net/unix/af_unix.c:1199) [ 113.085384][ T1] #5: ffff888167ee0918 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:175 net/unix/af_unix.c:1199) [ 113.085391][ T1] [ 113.085391][ T1] stack backtrace: [ 113.085395][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.19.0-rc1-00408-gb4813d591454 #1 [ 113.085401][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 [ 113.085408][ T1] Call Trace: [ 113.085419][ T1] <TASK> [ 113.085421][ T1] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) [ 113.085453][ T1] validate_chain.cold (kernel/locking/lockdep.c:2988 kernel/locking/lockdep.c:3031 kernel/locking/lockdep.c:3816) [ 113.085473][ T1] ? check_prev_add (kernel/locking/lockdep.c:3785) [ 113.085483][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) [ 113.085489][ T1] __lock_acquire (kernel/locking/lockdep.c:5053) [ OK ] Listening on Journal Socket (/dev/log). [ OK ] Listening on Journal Socket. [ OK ] Reached target Encrypted Volumes. [ OK ] Listening on /dev/initctl Compatibility Named Pipe. [ 113.085497][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) [ 113.085501][ T1] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5667 kernel/locking/lockdep.c:5630) [ 113.085504][ T1] ? unix_bind_bsd (net/unix/af_unix.c:1200) [ 113.085509][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) [ 113.085513][ T1] ? do_raw_spin_lock (arch/x86/include/asm/atomic.h:202 include/linux/atomic/atomic-instrumented.h:543 include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:115) [ 113.085519][ T1] ? rwlock_bug+0xc0/0xc0 [ OK ] Created slice User and Session Slice. [ 113.085524][ T1] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) [ 113.085539][ T1] ? unix_bind_bsd (net/unix/af_unix.c:1200) [ 113.085543][ T1] unix_bind_bsd (net/unix/af_unix.c:1200) [ 113.085548][ T1] ? __might_fault (mm/memory.c:5566 mm/memory.c:5559) [ 113.085557][ T1] ? unix_stream_sendmsg (net/unix/af_unix.c:1153) [ OK ] Created slice System Slice. [ 113.085560][ T1] ? lock_release (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5687) [ 113.085563][ T1] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:46 arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) [ 113.085580][ T1] __sys_bind (net/socket.c:1776) [ 113.085589][ T1] ? __ia32_sys_socketpair (net/socket.c:1763) [ 113.085592][ T1] ? __lock_release (kernel/locking/lockdep.c:5341) [ 113.085597][ T1] ? lock_is_held_type (kernel/locking/lockdep.c:5406 kernel/locking/lockdep.c:5708) [ 113.085606][ T1] ? __might_fault (mm/memory.c:5566 mm/memory.c:5559) [ 113.085610][ T1] ? lock_release (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5687) [ 113.085614][ T1] __do_compat_sys_socketcall (net/compat.c:453) [ 113.085627][ T1] ? __x64_sys_rmdir (fs/namei.c:4221) [ 113.085631][ T1] ? __ia32_compat_sys_recvmmsg_time32 (net/compat.c:425) [ 113.085637][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:129 kernel/entry/common.c:296) [ 113.085642][ T1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4526) [ 113.085646][ T1] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) [ 113.085652][ T1] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183) Mounting Debug File System... [ 113.085656][ T1] do_fast_syscall_32 (arch/x86/entry/common.c:203) [ 113.085660][ T1] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:117) [ 113.085669][ T1] RIP: 0023:0xf7f70549 [ 113.085673][ T1] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 All code ======== 0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi 4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d a: 10 06 adc %al,(%rsi) c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi 10: 10 07 adc %al,(%rdi) 12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi 16: 10 08 adc %cl,(%rax) 18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi 1c: 00 00 add %al,(%rax) 1e: 00 00 add %al,(%rax) 20: 00 51 52 add %dl,0x52(%rcx) 23: 55 push %rbp 24: 89 e5 mov %esp,%ebp 26: 0f 34 sysenter 28: cd 80 int $0x80 2a:* 5d pop %rbp <-- trapping instruction 2b: 5a pop %rdx 2c: 59 pop %rcx 2d: c3 retq 2e: 90 nop 2f: 90 nop 30: 90 nop 31: 90 nop 32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi 39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi Code starting with the faulting instruction =========================================== 0: 5d pop %rbp 1: 5a pop %rdx 2: 59 pop %rcx 3: c3 retq 4: 90 nop 5: 90 nop 6: 90 nop 7: 90 nop 8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi To reproduce: # build kernel cd linux cp config-5.19.0-rc1-00408-gb4813d591454 .config make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install cd <mod-install-dir> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email # if come across any failure that blocks the test, # please remove ~/.lkp and /lkp dir to run from a clean state.
From: kernel test robot <oliver.sang@intel.com> Date: Mon, 20 Jun 2022 14:10:53 +0800 > Greeting, > > FYI, we noticed the following commit (built with gcc-11): > > commit: b4813d591454d771b5aaf33a6252b214648c430f ("[PATCH v1 net-next 4/6] af_unix: Acquire/Release per-netns hash table's locks.") > url: https://github.com/intel-lab-lkp/linux/commits/Kuniyuki-Iwashima/af_unix-Introduce-per-netns-socket-hash-table/20220617-075046 > base: https://git.kernel.org/cgit/linux/kernel/git/davem/net-next.git 5dcb50c009c9f8ec1cfca6a81a05c0060a5bbf68 > patch link: https://lore.kernel.org/netdev/20220616234714.4291-5-kuniyu@amazon.com > > in testcase: boot > > on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace): > > > > If you fix the issue, kindly add following tag > Reported-by: kernel test robot <oliver.sang@intel.com> > > > [ 113.085258][ T1] WARNING: possible recursive locking detected > [ 113.085261][ T1] 5.19.0-rc1-00408-gb4813d591454 #1 Not tainted > [ 113.085264][ T1] -------------------------------------------- > [ 113.085265][ T1] systemd/1 is trying to acquire lock: > [ 113.085270][ T1] ffff888167ee6c18 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:1200) > [ 113.085313][ T1] > [ 113.085313][ T1] but task is already holding lock: > [ 113.085314][ T1] ffff888167ee0918 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:175 net/unix/af_unix.c:1199) > [ 113.085321][ T1] > [ 113.085321][ T1] other info that might help us debug this: > [ 113.085323][ T1] Possible unsafe locking scenario: > [ 113.085323][ T1] > [ 113.085324][ T1] CPU0 > [ 113.085325][ T1] ---- > [ 113.085325][ T1] lock(&net->unx.hash[i].lock); > [ 113.085328][ T1] lock(&net->unx.hash[i].lock); > [ 113.085330][ T1] > [ 113.085330][ T1] *** DEADLOCK *** > [ 113.085330][ T1] > [ 113.085331][ T1] May be due to missing lock nesting notation Sorry, I did a wrong copy-and-paste. I'll use spin_lock_nested() in unix_table_double_lock(). > [ 113.085331][ T1] > [ 113.085333][ T1] 6 locks held by systemd/1: > [ 113.085335][ T1] #0: ffff88815da40448 (sb_writers#6){.+.+}-{0:0}, at: filename_create (fs/namei.c:3744) > [ 113.085351][ T1] #1: ffff88815bffec40 (&type->i_mutex_dir_key#4/1){+.+.}-{3:3}, at: filename_create (fs/namei.c:3747) > [ OK ] Started Forward Password Requests to Wall Directory Watch. > [ OK ] Started Dispatch Password Requests to Console Directory Watch. > [ OK ] Reached target Paths. > [ OK ] Listening on udev Control Socket. > [ 113.085359][ T1] #2: ffff88815d974e18 (&u->bindlock){+.+.}-{3:3}, at: unix_bind_bsd (net/unix/af_unix.c:1192) > [ 113.085370][ T1] #3: ffffffffb0eec038 (&unix_table_locks[i]){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:172 net/unix/af_unix.c:1199) > [ 113.085377][ T1] #4: ffffffffb0ef1838 (&unix_table_locks[i]/1){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:174 net/unix/af_unix.c:1199) > [ 113.085384][ T1] #5: ffff888167ee0918 (&net->unx.hash[i].lock){+.+.}-{2:2}, at: unix_bind_bsd (net/unix/af_unix.c:175 net/unix/af_unix.c:1199) > [ 113.085391][ T1] > [ 113.085391][ T1] stack backtrace: > [ 113.085395][ T1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.19.0-rc1-00408-gb4813d591454 #1 > [ 113.085401][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014 > [ 113.085408][ T1] Call Trace: > [ 113.085419][ T1] <TASK> > [ 113.085421][ T1] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) > [ 113.085453][ T1] validate_chain.cold (kernel/locking/lockdep.c:2988 kernel/locking/lockdep.c:3031 kernel/locking/lockdep.c:3816) > [ 113.085473][ T1] ? check_prev_add (kernel/locking/lockdep.c:3785) > [ 113.085483][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) > [ 113.085489][ T1] __lock_acquire (kernel/locking/lockdep.c:5053) > [ OK ] Listening on Journal Socket (/dev/log). > [ OK ] Listening on Journal Socket. > [ OK ] Reached target Encrypted Volumes. > [ OK ] Listening on /dev/initctl Compatibility Named Pipe. > [ 113.085497][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) > [ 113.085501][ T1] lock_acquire (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5667 kernel/locking/lockdep.c:5630) > [ 113.085504][ T1] ? unix_bind_bsd (net/unix/af_unix.c:1200) > [ 113.085509][ T1] ? rcu_read_unlock (include/linux/rcupdate.h:724 (discriminator 5)) > [ 113.085513][ T1] ? do_raw_spin_lock (arch/x86/include/asm/atomic.h:202 include/linux/atomic/atomic-instrumented.h:543 include/asm-generic/qspinlock.h:111 kernel/locking/spinlock_debug.c:115) > [ 113.085519][ T1] ? rwlock_bug+0xc0/0xc0 > [ OK ] Created slice User and Session Slice. > [ 113.085524][ T1] _raw_spin_lock (include/linux/spinlock_api_smp.h:134 kernel/locking/spinlock.c:154) > [ 113.085539][ T1] ? unix_bind_bsd (net/unix/af_unix.c:1200) > [ 113.085543][ T1] unix_bind_bsd (net/unix/af_unix.c:1200) > [ 113.085548][ T1] ? __might_fault (mm/memory.c:5566 mm/memory.c:5559) > [ 113.085557][ T1] ? unix_stream_sendmsg (net/unix/af_unix.c:1153) > [ OK ] Created slice System Slice. > [ 113.085560][ T1] ? lock_release (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5687) > [ 113.085563][ T1] ? _copy_from_user (arch/x86/include/asm/uaccess_64.h:46 arch/x86/include/asm/uaccess_64.h:52 lib/usercopy.c:16) > [ 113.085580][ T1] __sys_bind (net/socket.c:1776) > [ 113.085589][ T1] ? __ia32_sys_socketpair (net/socket.c:1763) > [ 113.085592][ T1] ? __lock_release (kernel/locking/lockdep.c:5341) > [ 113.085597][ T1] ? lock_is_held_type (kernel/locking/lockdep.c:5406 kernel/locking/lockdep.c:5708) > [ 113.085606][ T1] ? __might_fault (mm/memory.c:5566 mm/memory.c:5559) > [ 113.085610][ T1] ? lock_release (kernel/locking/lockdep.c:466 kernel/locking/lockdep.c:5687) > [ 113.085614][ T1] __do_compat_sys_socketcall (net/compat.c:453) > [ 113.085627][ T1] ? __x64_sys_rmdir (fs/namei.c:4221) > [ 113.085631][ T1] ? __ia32_compat_sys_recvmmsg_time32 (net/compat.c:425) > [ 113.085637][ T1] ? syscall_exit_to_user_mode (kernel/entry/common.c:129 kernel/entry/common.c:296) > [ 113.085642][ T1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4526) > [ 113.085646][ T1] __do_fast_syscall_32 (arch/x86/entry/common.c:112 arch/x86/entry/common.c:178) > [ 113.085652][ T1] ? __do_fast_syscall_32 (arch/x86/entry/common.c:183) > Mounting Debug File System... > [ 113.085656][ T1] do_fast_syscall_32 (arch/x86/entry/common.c:203) > [ 113.085660][ T1] entry_SYSENTER_compat_after_hwframe (arch/x86/entry/entry_64_compat.S:117) > [ 113.085669][ T1] RIP: 0023:0xf7f70549 > [ 113.085673][ T1] Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d b4 26 00 00 00 00 8d b4 26 00 00 00 00 > All code > ======== > 0: 03 74 c0 01 add 0x1(%rax,%rax,8),%esi > 4: 10 05 03 74 b8 01 adc %al,0x1b87403(%rip) # 0x1b8740d > a: 10 06 adc %al,(%rsi) > c: 03 74 b4 01 add 0x1(%rsp,%rsi,4),%esi > 10: 10 07 adc %al,(%rdi) > 12: 03 74 b0 01 add 0x1(%rax,%rsi,4),%esi > 16: 10 08 adc %cl,(%rax) > 18: 03 74 d8 01 add 0x1(%rax,%rbx,8),%esi > 1c: 00 00 add %al,(%rax) > 1e: 00 00 add %al,(%rax) > 20: 00 51 52 add %dl,0x52(%rcx) > 23: 55 push %rbp > 24: 89 e5 mov %esp,%ebp > 26: 0f 34 sysenter > 28: cd 80 int $0x80 > 2a:* 5d pop %rbp <-- trapping instruction > 2b: 5a pop %rdx > 2c: 59 pop %rcx > 2d: c3 retq > 2e: 90 nop > 2f: 90 nop > 30: 90 nop > 31: 90 nop > 32: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi > 39: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi > > Code starting with the faulting instruction > =========================================== > 0: 5d pop %rbp > 1: 5a pop %rdx > 2: 59 pop %rcx > 3: c3 retq > 4: 90 nop > 5: 90 nop > 6: 90 nop > 7: 90 nop > 8: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi > f: 8d b4 26 00 00 00 00 lea 0x0(%rsi,%riz,1),%esi > > > To reproduce: > > # build kernel > cd linux > cp config-5.19.0-rc1-00408-gb4813d591454 .config > make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules > make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install > cd <mod-install-dir> > find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz > > > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email > > # if come across any failure that blocks the test, > # please remove ~/.lkp and /lkp dir to run from a clean state. > > > > -- > 0-DAY CI Kernel Test Service > https://01.org/lkp >
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 3c07702e2349..ae21e3fb86da 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -158,7 +158,8 @@ static unsigned int unix_abstract_hash(struct sockaddr_un *sunaddr, return hash & UNIX_HASH_MOD; } -static void unix_table_double_lock(unsigned int hash1, unsigned int hash2) +static void unix_table_double_lock(struct net *net, + unsigned int hash1, unsigned int hash2) { /* hash1 and hash2 is never the same because * one is between 0 and UNIX_HASH_MOD, and @@ -169,10 +170,17 @@ static void unix_table_double_lock(unsigned int hash1, unsigned int hash2) spin_lock(&unix_table_locks[hash1]); spin_lock_nested(&unix_table_locks[hash2], SINGLE_DEPTH_NESTING); + + spin_lock(&net->unx.hash[hash1].lock); + spin_lock(&net->unx.hash[hash2].lock); } -static void unix_table_double_unlock(unsigned int hash1, unsigned int hash2) +static void unix_table_double_unlock(struct net *net, + unsigned int hash1, unsigned int hash2) { + spin_unlock(&net->unx.hash[hash1].lock); + spin_unlock(&net->unx.hash[hash2].lock); + spin_unlock(&unix_table_locks[hash1]); spin_unlock(&unix_table_locks[hash2]); } @@ -316,17 +324,21 @@ static void __unix_set_addr_hash(struct sock *sk, struct unix_address *addr, __unix_insert_socket(sk); } -static void unix_remove_socket(struct sock *sk) +static void unix_remove_socket(struct net *net, struct sock *sk) { spin_lock(&unix_table_locks[sk->sk_hash]); + spin_lock(&net->unx.hash[sk->sk_hash].lock); __unix_remove_socket(sk); + spin_unlock(&net->unx.hash[sk->sk_hash].lock); spin_unlock(&unix_table_locks[sk->sk_hash]); } -static void unix_insert_unbound_socket(struct sock *sk) +static void unix_insert_unbound_socket(struct net *net, struct sock *sk) { spin_lock(&unix_table_locks[sk->sk_hash]); + spin_lock(&net->unx.hash[sk->sk_hash].lock); __unix_insert_socket(sk); + spin_unlock(&net->unx.hash[sk->sk_hash].lock); spin_unlock(&unix_table_locks[sk->sk_hash]); } @@ -356,28 +368,33 @@ static inline struct sock *unix_find_socket_byname(struct net *net, struct sock *s; spin_lock(&unix_table_locks[hash]); + spin_lock(&net->unx.hash[hash].lock); s = __unix_find_socket_byname(net, sunname, len, hash); if (s) sock_hold(s); + spin_unlock(&net->unx.hash[hash].lock); spin_unlock(&unix_table_locks[hash]); return s; } -static struct sock *unix_find_socket_byinode(struct inode *i) +static struct sock *unix_find_socket_byinode(struct net *net, struct inode *i) { unsigned int hash = unix_bsd_hash(i); struct sock *s; spin_lock(&unix_table_locks[hash]); + spin_lock(&net->unx.hash[hash].lock); sk_for_each(s, &unix_socket_table[hash]) { struct dentry *dentry = unix_sk(s)->path.dentry; if (dentry && d_backing_inode(dentry) == i) { sock_hold(s); + spin_unlock(&net->unx.hash[hash].lock); spin_unlock(&unix_table_locks[hash]); return s; } } + spin_unlock(&net->unx.hash[hash].lock); spin_unlock(&unix_table_locks[hash]); return NULL; } @@ -576,12 +593,12 @@ static void unix_sock_destructor(struct sock *sk) static void unix_release_sock(struct sock *sk, int embrion) { struct unix_sock *u = unix_sk(sk); - struct path path; struct sock *skpair; struct sk_buff *skb; + struct path path; int state; - unix_remove_socket(sk); + unix_remove_socket(sock_net(sk), sk); /* Clear state */ unix_state_lock(sk); @@ -930,7 +947,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern, init_waitqueue_head(&u->peer_wait); init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay); memset(&u->scm_stat, 0, sizeof(struct scm_stat)); - unix_insert_unbound_socket(sk); + unix_insert_unbound_socket(net, sk); sock_prot_inuse_add(net, sk->sk_prot, 1); @@ -1015,7 +1032,7 @@ static struct sock *unix_find_bsd(struct net *net, struct sockaddr_un *sunaddr, if (!S_ISSOCK(inode->i_mode)) goto path_put; - sk = unix_find_socket_byinode(inode); + sk = unix_find_socket_byinode(net, inode); if (!sk) goto path_put; @@ -1074,6 +1091,7 @@ static int unix_autobind(struct sock *sk) { unsigned int new_hash, old_hash = sk->sk_hash; struct unix_sock *u = unix_sk(sk); + struct net *net = sock_net(sk); struct unix_address *addr; u32 lastnum, ordernum; int err; @@ -1102,11 +1120,10 @@ static int unix_autobind(struct sock *sk) sprintf(addr->name->sun_path + 1, "%05x", ordernum); new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type); - unix_table_double_lock(old_hash, new_hash); + unix_table_double_lock(net, old_hash, new_hash); - if (__unix_find_socket_byname(sock_net(sk), addr->name, addr->len, - new_hash)) { - unix_table_double_unlock(old_hash, new_hash); + if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) { + unix_table_double_unlock(net, old_hash, new_hash); /* __unix_find_socket_byname() may take long time if many names * are already in use. @@ -1124,7 +1141,7 @@ static int unix_autobind(struct sock *sk) } __unix_set_addr_hash(sk, addr, new_hash); - unix_table_double_unlock(old_hash, new_hash); + unix_table_double_unlock(net, old_hash, new_hash); err = 0; out: mutex_unlock(&u->bindlock); @@ -1138,6 +1155,7 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr, (SOCK_INODE(sk->sk_socket)->i_mode & ~current_umask()); unsigned int new_hash, old_hash = sk->sk_hash; struct unix_sock *u = unix_sk(sk); + struct net *net = sock_net(sk); struct user_namespace *ns; // barf... struct unix_address *addr; struct dentry *dentry; @@ -1178,11 +1196,11 @@ static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr, goto out_unlock; new_hash = unix_bsd_hash(d_backing_inode(dentry)); - unix_table_double_lock(old_hash, new_hash); + unix_table_double_lock(net, old_hash, new_hash); u->path.mnt = mntget(parent.mnt); u->path.dentry = dget(dentry); __unix_set_addr_hash(sk, addr, new_hash); - unix_table_double_unlock(old_hash, new_hash); + unix_table_double_unlock(net, old_hash, new_hash); mutex_unlock(&u->bindlock); done_path_create(&parent, dentry); return 0; @@ -1205,6 +1223,7 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr, { unsigned int new_hash, old_hash = sk->sk_hash; struct unix_sock *u = unix_sk(sk); + struct net *net = sock_net(sk); struct unix_address *addr; int err; @@ -1222,19 +1241,18 @@ static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr, } new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type); - unix_table_double_lock(old_hash, new_hash); + unix_table_double_lock(net, old_hash, new_hash); - if (__unix_find_socket_byname(sock_net(sk), addr->name, addr->len, - new_hash)) + if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) goto out_spin; __unix_set_addr_hash(sk, addr, new_hash); - unix_table_double_unlock(old_hash, new_hash); + unix_table_double_unlock(net, old_hash, new_hash); mutex_unlock(&u->bindlock); return 0; out_spin: - unix_table_double_unlock(old_hash, new_hash); + unix_table_double_unlock(net, old_hash, new_hash); err = -EADDRINUSE; out_mutex: mutex_unlock(&u->bindlock); @@ -3237,15 +3255,18 @@ static struct sock *unix_from_bucket(struct seq_file *seq, loff_t *pos) static struct sock *unix_get_first(struct seq_file *seq, loff_t *pos) { unsigned long bucket = get_bucket(*pos); + struct net *net = seq_file_net(seq); struct sock *sk; while (bucket < UNIX_HASH_SIZE) { spin_lock(&unix_table_locks[bucket]); + spin_lock(&net->unx.hash[bucket].lock); sk = unix_from_bucket(seq, pos); if (sk) return sk; + spin_unlock(&net->unx.hash[bucket].lock); spin_unlock(&unix_table_locks[bucket]); *pos = set_bucket_offset(++bucket, 1); @@ -3258,11 +3279,13 @@ static struct sock *unix_get_next(struct seq_file *seq, struct sock *sk, loff_t *pos) { unsigned long bucket = get_bucket(*pos); + struct net *net = seq_file_net(seq); for (sk = sk_next(sk); sk; sk = sk_next(sk)) - if (sock_net(sk) == seq_file_net(seq)) + if (sock_net(sk) == net) return sk; + spin_unlock(&net->unx.hash[bucket].lock); spin_unlock(&unix_table_locks[bucket]); *pos = set_bucket_offset(++bucket, 1); @@ -3292,8 +3315,10 @@ static void unix_seq_stop(struct seq_file *seq, void *v) { struct sock *sk = v; - if (sk) + if (sk) { + spin_unlock(&seq_file_net(seq)->unx.hash[sk->sk_hash].lock); spin_unlock(&unix_table_locks[sk->sk_hash]); + } } static int unix_seq_show(struct seq_file *seq, void *v) @@ -3381,6 +3406,7 @@ static int bpf_iter_unix_hold_batch(struct seq_file *seq, struct sock *start_sk) { struct bpf_unix_iter_state *iter = seq->private; + struct net *net = seq_file_net(seq); unsigned int expected = 1; struct sock *sk; @@ -3388,7 +3414,7 @@ static int bpf_iter_unix_hold_batch(struct seq_file *seq, struct sock *start_sk) iter->batch[iter->end_sk++] = start_sk; for (sk = sk_next(start_sk); sk; sk = sk_next(sk)) { - if (sock_net(sk) != seq_file_net(seq)) + if (sock_net(sk) != net) continue; if (iter->end_sk < iter->max_sk) { @@ -3399,6 +3425,7 @@ static int bpf_iter_unix_hold_batch(struct seq_file *seq, struct sock *start_sk) expected++; } + spin_unlock(&net->unx.hash[start_sk->sk_hash].lock); spin_unlock(&unix_table_locks[start_sk->sk_hash]); return expected; diff --git a/net/unix/diag.c b/net/unix/diag.c index c5d1cca72aa5..41b67b82f51f 100644 --- a/net/unix/diag.c +++ b/net/unix/diag.c @@ -195,9 +195,9 @@ static int sk_diag_dump(struct sock *sk, struct sk_buff *skb, struct unix_diag_r static int unix_diag_dump(struct sk_buff *skb, struct netlink_callback *cb) { - struct unix_diag_req *req; - int num, s_num, slot, s_slot; struct net *net = sock_net(skb->sk); + int num, s_num, slot, s_slot; + struct unix_diag_req *req; req = nlmsg_data(cb->nlh); @@ -209,6 +209,7 @@ static int unix_diag_dump(struct sk_buff *skb, struct netlink_callback *cb) num = 0; spin_lock(&unix_table_locks[slot]); + spin_lock(&net->unx.hash[slot].lock); sk_for_each(sk, &unix_socket_table[slot]) { if (!net_eq(sock_net(sk), net)) continue; @@ -220,12 +221,14 @@ static int unix_diag_dump(struct sk_buff *skb, struct netlink_callback *cb) NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI) < 0) { + spin_unlock(&net->unx.hash[slot].lock); spin_unlock(&unix_table_locks[slot]); goto done; } next: num++; } + spin_unlock(&net->unx.hash[slot].lock); spin_unlock(&unix_table_locks[slot]); } done: @@ -235,19 +238,22 @@ static int unix_diag_dump(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } -static struct sock *unix_lookup_by_ino(unsigned int ino) +static struct sock *unix_lookup_by_ino(struct net *net, unsigned int ino) { struct sock *sk; int i; for (i = 0; i < UNIX_HASH_SIZE; i++) { spin_lock(&unix_table_locks[i]); + spin_lock(&net->unx.hash[i].lock); sk_for_each(sk, &unix_socket_table[i]) if (ino == sock_i_ino(sk)) { sock_hold(sk); + spin_unlock(&net->unx.hash[i].lock); spin_unlock(&unix_table_locks[i]); return sk; } + spin_unlock(&net->unx.hash[i].lock); spin_unlock(&unix_table_locks[i]); } return NULL; @@ -257,16 +263,17 @@ static int unix_diag_get_exact(struct sk_buff *in_skb, const struct nlmsghdr *nlh, struct unix_diag_req *req) { - int err = -EINVAL; - struct sock *sk; - struct sk_buff *rep; - unsigned int extra_len; struct net *net = sock_net(in_skb->sk); + unsigned int extra_len; + struct sk_buff *rep; + struct sock *sk; + int err; + err = -EINVAL; if (req->udiag_ino == 0) goto out_nosk; - sk = unix_lookup_by_ino(req->udiag_ino); + sk = unix_lookup_by_ino(net, req->udiag_ino); err = -ENOENT; if (sk == NULL) goto out_nosk;
This commit adds extra spin_lock/spin_unlock() for a per-netns hash table inside the existing ones for unix_table_locks. As of this commit, sockets are still linked in the global hash table. After putting sockets in a per-netns hash table in the next patch, we remove the global hash table in the last patch of this series. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> --- net/unix/af_unix.c | 75 +++++++++++++++++++++++++++++++--------------- net/unix/diag.c | 23 +++++++++----- 2 files changed, 66 insertions(+), 32 deletions(-)