diff mbox series

[08/25] lnet: Protect lpni deref in lnet_health_check

Message ID 1627933851-7603-9-git-send-email-jsimmons@infradead.org (mailing list archive)
State New, archived
Headers show
Series Sync to OpenSFS tree as of Aug 2, 2021 | expand

Commit Message

James Simmons Aug. 2, 2021, 7:50 p.m. UTC
From: Chris Horn <chris.horn@hpe.com>

Discovery thread can modify peer NI/peer net/peer relationship
so we need to be careful when dereferencing the peer NI pointer in
lnet_health_check(). Discovery thread operations under net lock, so
move the peer NI dereference under the net lock which is taken for
incrementing the health stats.

Move some of the other code that is only relevant for messages with a
health status != LNET_MSG_STATUS_OK under the appropriate condition.

HPE-bug-id: LUS-9962
WC-bug-id: https://jira.whamcloud.com/browse/LU-14655
Lustre-commit: d87af24452a2e883 ("LU-14655 lnet: Protect lpni deref in lnet_health_check")
Signed-off-by: Chris Horn <chris.horn@hpe.com>
Reviewed-on: https://review.whamcloud.com/43503
Reviewed-by: Alexander Boyko <alexander.boyko@hpe.com>
Reviewed-by: Serguei Smirnov <ssmirnov@whamcloud.com>
Reviewed-by: Oleg Drokin <green@whamcloud.com>
Signed-off-by: James Simmons <jsimmons@infradead.org>
---
 net/lnet/lnet/lib-msg.c | 71 ++++++++++++++++++++++++++-----------------------
 1 file changed, 38 insertions(+), 33 deletions(-)
diff mbox series

Patch

diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c
index 580ddf6..e471848 100644
--- a/net/lnet/lnet/lib-msg.c
+++ b/net/lnet/lnet/lib-msg.c
@@ -821,38 +821,6 @@ 
 		attempt_remote_resend = false;
 	}
 
-	/* Don't further decrement the health value if a recovery message
-	 * failed.
-	 */
-	if (msg->msg_recovery) {
-		handle_local_health = false;
-		handle_remote_health = false;
-	} else {
-		handle_local_health = false;
-		handle_remote_health = true;
-	}
-
-	/* For local failures, health/recovery/resends are not needed if I only
-	 * have a single (non-lolnd) interface. NB: pb_nnis includes the lolnd
-	 * interface, so a single-rail node would have pb_nnis == 2.
-	 */
-	if (the_lnet.ln_ping_target->pb_nnis <= 2) {
-		handle_local_health = false;
-		attempt_local_resend = false;
-	}
-
-	/* For remote failures, health/recovery/resends are not needed if the
-	 * peer only has a single interface. Special case for routers where we
-	 * rely on health feature to manage route aliveness. NB: unlike pb_nnis
-	 * above, lp_nnis does _not_ include the lolnd, so a single-rail node
-	 * would have lp_nnis == 1.
-	 */
-	if (lpni && lpni->lpni_peer_net->lpn_peer->lp_nnis <= 1) {
-		attempt_remote_resend = false;
-		if (!lnet_isrouter(lpni))
-			handle_remote_health = false;
-	}
-
 	if (!lo)
 		LASSERT(ni && lpni);
 	else
@@ -865,11 +833,48 @@ 
 	       lnet_health_error2str(hstatus));
 
 	/* stats are only incremented for errors so avoid wasting time
-	 * incrementing statistics if there is no error.
+	 * incrementing statistics if there is no error. Similarly, whether to
+	 * update health values or perform resends is only applicable for
+	 * messages with a health status != OK.
 	 */
 	if (hstatus != LNET_MSG_STATUS_OK) {
+		/* Don't further decrement the health value if a recovery
+		 * message failed.
+		 */
+		if (msg->msg_recovery) {
+			handle_local_health = false;
+			handle_remote_health = false;
+		} else {
+			handle_local_health = true;
+			handle_remote_health = true;
+		}
+
+		/* For local failures, health/recovery/resends are not needed if
+		 * I only have a single (non-lolnd) interface. NB: pb_nnis
+		 * includes the lolnd interface, so a single-rail node would
+		 * have pb_nnis == 2.
+		 */
+		if (the_lnet.ln_ping_target->pb_nnis <= 2) {
+			handle_local_health = false;
+			attempt_local_resend = false;
+		}
+
 		lnet_net_lock(0);
 		lnet_incr_hstats(ni, lpni, hstatus);
+		/* For remote failures, health/recovery/resends are not needed
+		 * if the peer only has a single interface. Special case for
+		 * routers where we rely on health feature to manage route
+		 * aliveness. NB: unlike pb_nnis above, lp_nnis does _not_
+		 * include the lolnd, so a single-rail node would have
+		 * lp_nnis == 1.
+		 */
+		if (lpni && lpni->lpni_peer_net &&
+		    lpni->lpni_peer_net->lpn_peer &&
+		    lpni->lpni_peer_net->lpn_peer->lp_nnis <= 1) {
+			attempt_remote_resend = false;
+			if (!lnet_isrouter(lpni))
+				handle_remote_health = false;
+		}
 		lnet_net_unlock(0);
 	}