From patchwork Thu Feb 27 21:09:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Simmons X-Patchwork-Id: 11409843 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6BA2F138D for ; Thu, 27 Feb 2020 21:23:49 +0000 (UTC) Received: from pdx1-mailman02.dreamhost.com (pdx1-mailman02.dreamhost.com [64.90.62.194]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 541BF246A0 for ; Thu, 27 Feb 2020 21:23:49 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 541BF246A0 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=infradead.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=lustre-devel-bounces@lists.lustre.org Received: from pdx1-mailman02.dreamhost.com (localhost [IPv6:::1]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 799B9348B06; Thu, 27 Feb 2020 13:21:47 -0800 (PST) X-Original-To: lustre-devel@lists.lustre.org Delivered-To: lustre-devel-lustre.org@pdx1-mailman02.dreamhost.com Received: from smtp3.ccs.ornl.gov (smtp3.ccs.ornl.gov [160.91.203.39]) by pdx1-mailman02.dreamhost.com (Postfix) with ESMTP id 27E3621FA9F for ; Thu, 27 Feb 2020 13:18:43 -0800 (PST) Received: from star.ccs.ornl.gov (star.ccs.ornl.gov [160.91.202.134]) by smtp3.ccs.ornl.gov (Postfix) with ESMTP id 45ED4EF8; Thu, 27 Feb 2020 16:18:14 -0500 (EST) Received: by star.ccs.ornl.gov (Postfix, from userid 2004) id 447A346F; Thu, 27 Feb 2020 16:18:14 -0500 (EST) From: James Simmons To: Andreas Dilger , Oleg Drokin , NeilBrown Date: Thu, 27 Feb 2020 16:09:16 -0500 Message-Id: <1582838290-17243-89-git-send-email-jsimmons@infradead.org> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> References: <1582838290-17243-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 088/622] lnet: handle fatal device error X-BeenThere: lustre-devel@lists.lustre.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: "For discussing Lustre software development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Amir Shehata , Lustre Development List MIME-Version: 1.0 Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" From: Amir Shehata The o2iblnd can receive device status on the QP event handler. There are three in specific that are being handled in this patch: IB_EVENT_DEVICE_FATAL IB_EVENT_PORT_ERR IB_EVENT_PORT_ACTIVE For DEVICE_FATAL and PORT_ERR the NI associated with the QP is set in fatal error mode. This NI will no longer be selected when sending messages. When PORT_ACTIVE is received the NI associated with the QP has the fatal error cleared and future messages can use that NI. WC-bug-id: https://jira.whamcloud.com/browse/LU-9120 Lustre-commit: 6b1571209a99 ("LU-9120 lnet: handle fatal device error") Signed-off-by: Amir Shehata Reviewed-on: https://review.whamcloud.com/32772 Reviewed-by: Sonia Sharma Reviewed-by: Olaf Weber Signed-off-by: James Simmons --- include/linux/lnet/lib-types.h | 7 +++++++ net/lnet/klnds/o2iblnd/o2iblnd_cb.c | 13 +++++++++++++ net/lnet/lnet/lib-move.c | 6 +++++- 3 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h index d815a87..2b3e76a 100644 --- a/include/linux/lnet/lib-types.h +++ b/include/linux/lnet/lib-types.h @@ -443,6 +443,13 @@ struct lnet_ni { atomic_t ni_healthv; /* + * Set to 1 by the LND when it receives an event telling it the device + * has gone into a fatal state. Set to 0 when the LND receives an + * even telling it the device is back online. + */ + atomic_t ni_fatal_error_on; + + /* * equivalent interfaces to use * This is an array because socklnd bonding can still be configured */ diff --git a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c index c6e8e73..293a859 100644 --- a/net/lnet/klnds/o2iblnd/o2iblnd_cb.c +++ b/net/lnet/klnds/o2iblnd/o2iblnd_cb.c @@ -3567,6 +3567,19 @@ static int kiblnd_resolve_addr(struct rdma_cm_id *cmid, rdma_notify(conn->ibc_cmid, IB_EVENT_COMM_EST); return; + case IB_EVENT_PORT_ERR: + case IB_EVENT_DEVICE_FATAL: + CERROR("Fatal device error for NI %s\n", + libcfs_nid2str(conn->ibc_peer->ibp_ni->ni_nid)); + atomic_set(&conn->ibc_peer->ibp_ni->ni_fatal_error_on, 1); + return; + + case IB_EVENT_PORT_ACTIVE: + CERROR("Port reactivated for NI %s\n", + libcfs_nid2str(conn->ibc_peer->ibp_ni->ni_nid)); + atomic_set(&conn->ibc_peer->ibp_ni->ni_fatal_error_on, 0); + return; + default: CERROR("%s: Async QP event type %d\n", libcfs_nid2str(conn->ibc_peer->ibp_nid), event->event); diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c index 55cbf57..8d5f1e5 100644 --- a/net/lnet/lnet/lib-move.c +++ b/net/lnet/lnet/lib-move.c @@ -1303,9 +1303,11 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats, unsigned int distance; int ni_credits; int ni_healthv; + int ni_fatal; ni_credits = atomic_read(&ni->ni_tx_credits); ni_healthv = atomic_read(&ni->ni_healthv); + ni_fatal = atomic_read(&ni->ni_fatal_error_on); /* * calculate the distance from the CPT on which @@ -1334,7 +1336,9 @@ void lnet_usr_translate_stats(struct lnet_ioctl_element_msg_stats *msg_stats, * Select on health, shorter distance, available * credits, then round-robin. */ - if (ni_healthv < best_healthv) { + if (ni_fatal) { + continue; + } else if (ni_healthv < best_healthv) { continue; } else if (ni_healthv > best_healthv) { best_healthv = ni_healthv;