[RFC,net,v2,1/2] net/smc: Resolve the race between link group access and termination

We encountered some crashes caused by the race between the access
and the termination of link groups.

Here are some of panic stacks we met:

1) Race between smc_clc_wait_msg() and __smc_lgr_terminate()

 BUG: kernel NULL pointer dereference, address: 00000000000002f0
 Workqueue: smc_hs_wq smc_listen_work [smc]
 RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc]
 Call Trace:
  <TASK>
  ? smc_clc_send_accept+0x45/0xa0 [smc]
  ? smc_clc_send_accept+0x45/0xa0 [smc]
  smc_listen_work+0x783/0x1220 [smc]
  ? finish_task_switch+0xc4/0x2e0
  ? process_one_work+0x1ad/0x3c0
  process_one_work+0x1ad/0x3c0
  worker_thread+0x4c/0x390
  ? rescuer_thread+0x320/0x320
  kthread+0x149/0x190
  ? set_kthread_struct+0x40/0x40
  ret_from_fork+0x1f/0x30
  </TASK>

smc_listen_work()                abnormal case like port error
---------------------------------------------------------------
                                | __smc_lgr_terminate()
                                |     |- smc_conn_kill()
                                |            |- smc_lgr_unregister_conn()
                                |                   |- set conn->lgr = NULL
smc_clc_wait_msg()              |
    |- access conn->lgr (panic) |

2) Race between smc_setsockopt() and __smc_lgr_terminate()

 BUG: kernel NULL pointer dereference, address: 00000000000002e8
 RIP: 0010:smc_setsockopt+0x17a/0x280 [smc]
 Call Trace:
  <TASK>
  __sys_setsockopt+0xfc/0x190
  __x64_sys_setsockopt+0x20/0x30
  do_syscall_64+0x34/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xae
  </TASK>

smc_setsockopt()                 abnormal case like port error
--------------------------------------------------------------
                                | __smc_lgr_terminate()
                                |     |- smc_conn_kill()
                                |            |- smc_lgr_unregister_conn()
                                |                   |- set conn->lgr = NULL
mod_delayed_work()              |
    |- access conn->lgr (panic) |

There are some other panic points and they are caused by the
simmilar reason as described above, which is accessing link
group after termination, thus getting a NULL pointer or invalid
resource.

Currently, there seems to be no synchronization between the
link group access and a sudden termination of it. This patch
tries to fix this by introducing reference count of link group
and not freeing link group until reference count is zero.

Link group might be referred to by link or smc connection. So
the operation to the link group reference count can be concluded
as follows:

object          [hold or initialized as 1]         [put]
--------------------------------------------------------------------
link group      smc_lgr_create()                   smc_lgr_free()
connections     smc_lgr_register_conn()            smc_conn_free()
links           smcr_link_init()                   smcr_link_clear()

Througth this way, we extend the life cycle of link group and
ensure it is longer than the life cycle of connections and links
above it, so that avoid invalid access to link group after its
termination.

Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
 net/smc/smc.h      |  1 +
 net/smc/smc_core.c | 45 ++++++++++++++++++++++++++++++++++++++++-----
 net/smc/smc_core.h |  3 +++
 3 files changed, 44 insertions(+), 5 deletions(-)

Message ID	1640704432-76825-2-git-send-email-guwen@linux.alibaba.com (mailing list archive)
State	RFC
Delegated to:	Netdev Maintainers
Headers	show Return-Path: <netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8268EC43219 for <netdev@archiver.kernel.org>; Tue, 28 Dec 2021 15:14:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235012AbhL1POV (ORCPT <rfc822;netdev@archiver.kernel.org>); Tue, 28 Dec 2021 10:14:21 -0500 Received: from out30-132.freemail.mail.aliyun.com ([115.124.30.132]:45214 "EHLO out30-132.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234981AbhL1POT (ORCPT <rfc822;netdev@vger.kernel.org>); Tue, 28 Dec 2021 10:14:19 -0500 X-Alimail-AntiSpam: AC=PASS;BC=-1\|-1;BR=01201311R141e4;CH=green;DM=\|\|false\|;DS=\|\|;FP=0\|-1\|-1\|-1\|0\|-1\|-1\|-1;HT=e01e04395;MF=guwen@linux.alibaba.com;NM=1;PH=DS;RN=8;SR=0;TI=SMTPD_---0V07rdMJ_1640704434; Received: from e02h04404.eu6sqa(mailfrom:guwen@linux.alibaba.com fp:SMTPD_---0V07rdMJ_1640704434) by smtp.aliyun-inc.com(127.0.0.1); Tue, 28 Dec 2021 23:14:17 +0800 From: Wen Gu <guwen@linux.alibaba.com> To: kgraul@linux.ibm.com, davem@davemloft.net, kuba@kernel.org Cc: linux-s390@vger.kernel.org, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, dust.li@linux.alibaba.com, tonylu@linux.alibaba.com Subject: [RFC PATCH net v2 1/2] net/smc: Resolve the race between link group access and termination Date: Tue, 28 Dec 2021 23:13:51 +0800 Message-Id: <1640704432-76825-2-git-send-email-guwen@linux.alibaba.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1640704432-76825-1-git-send-email-guwen@linux.alibaba.com> References: <1640704432-76825-1-git-send-email-guwen@linux.alibaba.com> Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org X-Patchwork-State: RFC
Series	net/smc: Fix for race in smc link group termination \| expand [RFC,net,v2,0/2] net/smc: Fix for race in smc link group termination [RFC,net,v2,1/2] net/smc: Resolve the race between link group access and termination [RFC,net,v2,2/2] net/smc: Resolve the race between SMC-R link access and clear

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	fail	Series targets non-next tree, but doesn't contain any Fixes tags
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Series has a cover letter
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers	success	CCed 5 of 5 maintainers
netdev/build_clang	success	Errors and warnings before: 0 this patch: 0
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/verify_fixes	success	No Fixes tag
netdev/build_allmodconfig_warn	success	Errors and warnings before: 0 this patch: 0
netdev/checkpatch	success	total: 0 errors, 0 warnings, 0 checks, 131 lines checked
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0

[RFC,net,v2,1/2] net/smc: Resolve the race between link group access and termination

Checks

Commit Message

Comments

Patch