From patchwork Wed Apr 18 13:31:43 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Zhu Yanjun X-Patchwork-Id: 10348153 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id D34A360244 for ; Wed, 18 Apr 2018 13:30:38 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id C117A2865F for ; Wed, 18 Apr 2018 13:30:38 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BF90428668; Wed, 18 Apr 2018 13:30:38 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00,DKIM_SIGNED, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_DKIM_INVALID, UNPARSEABLE_RELAY autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id DA8822869D for ; Wed, 18 Apr 2018 13:30:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753577AbeDRNad (ORCPT ); Wed, 18 Apr 2018 09:30:33 -0400 Received: from aserp2120.oracle.com ([141.146.126.78]:50128 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752747AbeDRNac (ORCPT ); Wed, 18 Apr 2018 09:30:32 -0400 Received: from pps.filterd (aserp2120.oracle.com [127.0.0.1]) by aserp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id w3IDQewE148813; Wed, 18 Apr 2018 13:30:28 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=from : to : subject : date : message-id : mime-version : content-type : content-transfer-encoding; s=corp-2017-10-26; bh=QQsa/uDMYHxe52ItPEA5qhBaCpnZkU7GvRtHYULqYdA=; b=QiXN6GaM4FmB0+VoZ8vAT63Q1ZqSV8+Gjb4Uq61ohk2CtSAQeCirhSXrYRE4lW7zJ5Gd xvDArWnwQSILpkftPlMXj2YfpxhJaLWL8b6ZcjxaG2hSQQvy7iatLBa+tbTVK3/5TQGD D2G/KBezx4tavJfSoV5QxjdpZFn3tzE5uvmra6CgWLIPRUL40tn1ypVfMECeP9+2xT1T l4ihBvR//7WQyAU52XQ930u2wRoqjji+hS4qgACKKlMAbRKIEbnl6hVlWNdiGdYinkrX r4K6KUU0F/QmxNW8aUh3CkzwlKDP6ux4e+jFnONj3GKzkyMXQ3LQyq7fGv90UpcIPwrK Yg== Received: from aserv0022.oracle.com (aserv0022.oracle.com [141.146.126.234]) by aserp2120.oracle.com with ESMTP id 2hdrxnamhh-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Apr 2018 13:30:28 +0000 Received: from userv0121.oracle.com (userv0121.oracle.com [156.151.31.72]) by aserv0022.oracle.com (8.14.4/8.14.4) with ESMTP id w3IDUR5l006333 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 Apr 2018 13:30:28 GMT Received: from abhmp0017.oracle.com (abhmp0017.oracle.com [141.146.116.23]) by userv0121.oracle.com (8.14.4/8.13.8) with ESMTP id w3IDUR1p008360; Wed, 18 Apr 2018 13:30:27 GMT Received: from shipfan.cn.oracle.com (/10.113.210.105) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Wed, 18 Apr 2018 06:30:26 -0700 From: Zhu Yanjun To: tariqt@mellanox.com, netdev@vger.kernel.org, linux-rdma@vger.kernel.org Subject: [PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device Date: Wed, 18 Apr 2018 09:31:43 -0400 Message-Id: <1524058303-379-1-git-send-email-yanjun.zhu@oracle.com> X-Mailer: git-send-email 2.7.4 MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=8866 signatures=668698 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1711220000 definitions=main-1804180122 Sender: linux-rdma-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-rdma@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP While a faulty cable is used or HCA firmware error, HCA device will be offline. When the driver is accessing this offline device, the following call trace will pop out. " ... [] dump_stack+0x63/0x81 [] panic+0xcc/0x21b [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core] [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core] [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core] [] __mlx4_cmd+0xb0/0x160 [mlx4_core] [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core] [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core] ... " In the above call trace, the function mlx4_cmd_poll calls the function mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out. This is not reasonable. Since HCA device is offline when it is being accessed, it should not be reset again. In this patch, since HCA is offline, the function mlx4_cmd_post returns an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns instead of resetting HCA. CC: Srinivas Eeda CC: Junxiao Bi Suggested-by: HÃ¥kon Bugge Suggested-by: Tariq Toukan Signed-off-by: Zhu Yanjun Reviewed-by: Tariq Toukan --- V1->V2: Follow Tariq's advice, avoid the disturbance from other returned errors. Since the returned values from the function mlx4_cmd_post are -EIO and -EINVAL, to -EIO, the HCA device should be reset. To -EINVAL, that means that the function mlx4_cmd_post is accessing an offline device. It is not necessary to reset HCA. Go to label out directly. --- drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c index 6a9086d..df735b8 100644 --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param, * Device is going through error recovery * and cannot accept commands. */ + mlx4_err(dev, "%s : Device is in error recovery.\n", __func__); + ret = -EINVAL; goto out; } @@ -610,8 +612,11 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param, err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0, in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0); - if (err) + if (err) { + if (err == -EINVAL) + goto out; goto out_reset; + } end = msecs_to_jiffies(timeout) + jiffies; while (cmd_pending(dev) && time_before(jiffies, end)) { @@ -710,8 +715,11 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param, err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0, in_modifier, op_modifier, op, context->token, 1); - if (err) + if (err) { + if (err == -EINVAL) + goto out; goto out_reset; + } if (op == MLX4_CMD_SENSE_PORT) { ret_wait =