From patchwork Mon Apr 17 13:47:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: James Simmons X-Patchwork-Id: 13214135 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from pdx1-mailman-customer002.dreamhost.com (listserver-buz.dreamhost.com [69.163.136.29]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 70097C77B76 for ; Mon, 17 Apr 2023 14:07:21 +0000 (UTC) Received: from pdx1-mailman-customer002.dreamhost.com (localhost [127.0.0.1]) by pdx1-mailman-customer002.dreamhost.com (Postfix) with ESMTP id 4Q0T574889z22Rs; Mon, 17 Apr 2023 06:52:27 -0700 (PDT) Received: from smtp4.ccs.ornl.gov (smtp4.ccs.ornl.gov [160.91.203.40]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by pdx1-mailman-customer002.dreamhost.com (Postfix) with ESMTPS id 4Q0T1P5J66z21BZ for ; Mon, 17 Apr 2023 06:49:13 -0700 (PDT) Received: from star.ccs.ornl.gov (star.ccs.ornl.gov [160.91.202.134]) by smtp4.ccs.ornl.gov (Postfix) with ESMTP id 9B0D61008493; Mon, 17 Apr 2023 09:47:24 -0400 (EDT) Received: by star.ccs.ornl.gov (Postfix, from userid 2004) id 9971A375; Mon, 17 Apr 2023 09:47:24 -0400 (EDT) From: James Simmons To: Andreas Dilger , Oleg Drokin , NeilBrown Date: Mon, 17 Apr 2023 09:47:13 -0400 Message-Id: <1681739243-29375-18-git-send-email-jsimmons@infradead.org> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1681739243-29375-1-git-send-email-jsimmons@infradead.org> References: <1681739243-29375-1-git-send-email-jsimmons@infradead.org> Subject: [lustre-devel] [PATCH 17/27] lnet: Health logging improvements X-BeenThere: lustre-devel@lists.lustre.org X-Mailman-Version: 2.1.39 Precedence: list List-Id: "For discussing Lustre software development." List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Chris Horn , Lustre Development List MIME-Version: 1.0 Errors-To: lustre-devel-bounces@lists.lustre.org Sender: "lustre-devel" From: Chris Horn LNet health activity can generate noise in console logs. The NI/Peer NI recovery pings could be expected to fail and the related messages from lnet_handle_recovery_reply() are generally redundant. Improve this logging by having the lnet_monitor_thread() provide a summary of NIs in recovery. Another useful metric in spotting network trouble is if we have messages exceeding their deadline. We do not currently log this information. Keep a count of messages that have exceeded their deadline and track the total excess time. The lnet_monitor_thread() will then provide a summary of the number of messages and their average excess time at a regular interval. These stats are then reset when the monitor thread prints this information to the console. Because NIs can be in recovery for extended periods of time, the interval of console updates will increase from 1 to 5 minutes. The interval is reset when it is detected that there are no longer any NIs in recovery and there haven't been any messages past their deadline since the last console update. HPE-bug-id: LUS-11500 WC-bug-id: https://jira.whamcloud.com/browse/LU-16643 Lustre-commit: 0cb3d86c4004d7581 ("LU-16643 lnet: Health logging improvements") Signed-off-by: Chris Horn Reviewed-on: https://review.whamcloud.com/c/fs/lustre-release/+/50305 Reviewed-by: Andreas Dilger Reviewed-by: Cyril Bordage Reviewed-by: Oleg Drokin Signed-off-by: James Simmons --- include/linux/lnet/lib-types.h | 5 ++ net/lnet/lnet/api-ni.c | 2 + net/lnet/lnet/lib-move.c | 165 +++++++++++++++++++++++++++++++++++++++-- net/lnet/lnet/lib-msg.c | 16 +++- 4 files changed, 176 insertions(+), 12 deletions(-) diff --git a/include/linux/lnet/lib-types.h b/include/linux/lnet/lib-types.h index eb54e75..1ae4530 100644 --- a/include/linux/lnet/lib-types.h +++ b/include/linux/lnet/lib-types.h @@ -1529,6 +1529,11 @@ struct lnet { struct completion ln_started; /* UDSP list */ struct list_head ln_udsp_list; + + /* Number of messages that have exceeded their message deadline */ + atomic_t ln_late_msg_count; + /* Total amount of time past their deadline for all late ^ messages */ + atomic64_t ln_late_msg_nsecs; }; struct genl_filter_list { diff --git a/net/lnet/lnet/api-ni.c b/net/lnet/lnet/api-ni.c index fb596ed..f3f9aee 100644 --- a/net/lnet/lnet/api-ni.c +++ b/net/lnet/lnet/api-ni.c @@ -1295,6 +1295,8 @@ struct list_head ** init_waitqueue_head(&the_lnet.ln_dc_waitq); the_lnet.ln_mt_handler = NULL; init_completion(&the_lnet.ln_started); + atomic_set(&the_lnet.ln_late_msg_count, 0); + atomic64_set(&the_lnet.ln_late_msg_nsecs, 0); rc = lnet_slab_setup(); if (rc != 0) diff --git a/net/lnet/lnet/lib-move.c b/net/lnet/lnet/lib-move.c index 95abe4f1..9d50260 100644 --- a/net/lnet/lnet/lib-move.c +++ b/net/lnet/lnet/lib-move.c @@ -3237,8 +3237,11 @@ struct lnet_mt_event_info { lnet_ni_lock(ni); } -static void -lnet_recover_local_nis(void) +/* Returns the total number of local NIs in recovery. + * Records up to @arrsz of the associated NIDs in the @nidarr array + */ +static int +lnet_recover_local_nis(struct lnet_nid *nidarr, unsigned int arrsz) { struct lnet_mt_event_info *ev_info; LIST_HEAD(processed_list); @@ -3250,6 +3253,7 @@ struct lnet_mt_event_info { int healthv; int rc; time64_t now; + unsigned int nnis = 0; /* splice the recovery queue on a local queue. We will iterate * through the local queue and update it as needed. Once we're @@ -3286,6 +3290,10 @@ struct lnet_mt_event_info { continue; } + if (nnis < arrsz) + nidarr[nnis] = ni->ni_nid; + nnis++; + /* if the local NI failed recovery we must unlink the md. * But we want to keep the local_ni on the recovery queue * so we can continue the attempts to recover it. @@ -3391,6 +3399,8 @@ struct lnet_mt_event_info { lnet_net_lock(0); list_splice(&local_queue, &the_lnet.ln_mt_localNIRecovq); lnet_net_unlock(0); + + return nnis; } static int @@ -3490,12 +3500,16 @@ struct lnet_mt_event_info { cfs_percpt_free(the_lnet.ln_mt_resendqs); } -static void -lnet_recover_peer_nis(void) +/* Returns the total number of peer NIs in recovery. + * Records up to @arrsz of the associated NIDs in the @nidarr array + */ +static unsigned int +lnet_recover_peer_nis(struct lnet_nid *nidarr, unsigned int arrsz) { struct lnet_mt_event_info *ev_info; LIST_HEAD(processed_list); LIST_HEAD(local_queue); + unsigned int nlpnis = 0; struct lnet_handle_md mdh; struct lnet_peer_ni *lpni; struct lnet_peer_ni *tmp; @@ -3532,6 +3546,10 @@ struct lnet_mt_event_info { continue; } + if (nlpnis < arrsz) + nidarr[nlpnis] = lpni->lpni_nid; + nlpnis++; + /* If the peer NI has failed recovery we must unlink the * md. But we want to keep the peer ni on the recovery * queue so we can try to continue recovering it @@ -3621,6 +3639,131 @@ struct lnet_mt_event_info { lnet_net_lock(0); list_splice(&local_queue, &the_lnet.ln_mt_peerNIRecovq); lnet_net_unlock(0); + + return nlpnis; +} + +#define LNET_MAX_NNIDS 20 +/* @nids is array of nids that are in recovery. It has max size of + * LNET_MAX_NNIDS. + * @nnids is the total number of nids that are in recovery. It can be + * larger than LNET_MAX_NNIDS. + * @local tells us whether these are local or peer NIs in recovery. + */ +static void +lnet_print_recovery_list(struct lnet_nid *nids, unsigned int nnids, + bool local) +{ + static bool printed; + char *buf = NULL; + char *tmp; + int i; + unsigned int arrsz; + unsigned int bufsz; + + if (!nnids) + return; + + arrsz = nnids < LNET_MAX_NNIDS ? nnids : LNET_MAX_NNIDS; + + /* Printing arrsz NIDs, each has max size LNET_NIDSTR_SIZE, a comma + * and space for each nid after the first (2 * (arrsz - 1)), + * + 1 for terminating null byte + */ + bufsz = (arrsz * LNET_NIDSTR_SIZE) + (2 * (arrsz - 1)) + 1; + buf = kzalloc(bufsz, GFP_KERNEL); + if (!buf) { + LCONSOLE(D_INFO, "%u %s NIs in recovery\n", + nnids, local ? "local" : "peer"); + return; + } + + tmp = buf; + tmp += sprintf(tmp, "%s", libcfs_nidstr(&nids[0])); + for (i = 1; i < arrsz; i++) + tmp += sprintf(tmp, ", %s", libcfs_nidstr(&nids[i])); + + /* LCONSOLE() used to avoid rate limiting when we have both local + * and peer NIs in recovery + */ + LCONSOLE(D_INFO, "%u %s NIs in recovery (showing %u): %s\n", + nnids, local ? "local" : "peer", arrsz, buf); + + kfree(buf); + + if (!printed && nnids > LNET_MAX_NNIDS) { + LCONSOLE(D_INFO, "See full list with 'lnetctl debug recovery -(p|l)'\n"); + printed = true; + } +} + +static void +lnet_health_update_console(struct lnet_nid *lnids, unsigned int nnis, + struct lnet_nid *rnids, unsigned int nlpnis, + time64_t now) +{ + static time64_t next_ni_update; + static time64_t next_lpni_update; + static time64_t next_msg_update; + static unsigned int num_ni_updates; + static unsigned int num_lpni_updates; + static unsigned int num_msg_updates = 1; + int late_count; + + if (now >= next_ni_update) { + if (nnis) { + lnet_print_recovery_list(lnids, nnis, true); + if (num_ni_updates < 5) + num_ni_updates++; + next_ni_update = now + (60 * num_ni_updates); + } else { + next_ni_update = 0; + num_ni_updates = 0; + } + } + + if (now >= next_lpni_update) { + if (nlpnis) { + lnet_print_recovery_list(rnids, nlpnis, false); + if (num_lpni_updates < 5) + num_lpni_updates++; + next_lpni_update = now + (60 * num_lpni_updates); + } else { + next_lpni_update = 0; + num_lpni_updates = 0; + } + } + + /* Let late_count accumulate for 60 seconds */ + if (unlikely(!next_msg_update)) + next_msg_update = now + 60; + + if (now >= next_msg_update) { + late_count = atomic_read(&the_lnet.ln_late_msg_count); + + if (late_count) { + s64 avg = atomic64_xchg(&the_lnet.ln_late_msg_nsecs, 0) / + atomic_xchg(&the_lnet.ln_late_msg_count, 0); + + if (avg > NSEC_PER_SEC) { + unsigned int avg_msec; + + avg_msec = do_div(avg, NSEC_PER_SEC) / + NSEC_PER_MSEC; + LCONSOLE_INFO("%u messages in past %us over their deadline by avg %lld.%03us\n", + late_count, + (60 * num_msg_updates), avg, + avg_msec); + + if (num_msg_updates < 5) + num_msg_updates++; + next_msg_update = now + (60 * num_msg_updates); + } + } else { + next_msg_update = now + 60; + num_msg_updates = 1; + } + } } static int @@ -3628,6 +3771,10 @@ struct lnet_mt_event_info { { time64_t rsp_timeout = 0; time64_t now; + unsigned int nnis; + unsigned int nlpnis; + struct lnet_nid local_nids[LNET_MAX_NNIDS]; + struct lnet_nid peer_nids[LNET_MAX_NNIDS]; wait_for_completion(&the_lnet.ln_started); @@ -3653,8 +3800,10 @@ struct lnet_mt_event_info { rsp_timeout = now + (lnet_transaction_timeout / 2); } - lnet_recover_local_nis(); - lnet_recover_peer_nis(); + nnis = lnet_recover_local_nis(local_nids, LNET_MAX_NNIDS); + nlpnis = lnet_recover_peer_nis(peer_nids, LNET_MAX_NNIDS); + lnet_health_update_console(local_nids, nnis, peer_nids, nlpnis, + now); /* TODO do we need to check if we should sleep without * timeout? Technically, an active system will always @@ -3768,7 +3917,7 @@ struct lnet_mt_event_info { lnet_net_unlock(0); if (status != 0) { - CERROR("local NI (%s) recovery failed with %d\n", + CDEBUG(D_NET, "local NI (%s) recovery failed with %d\n", libcfs_nidstr(nid), status); return; } @@ -3800,7 +3949,7 @@ struct lnet_mt_event_info { lnet_net_unlock(cpt); if (status != 0) - CERROR("peer NI (%s) recovery failed with %d\n", + CDEBUG(D_NET, "peer NI (%s) recovery failed with %d\n", libcfs_nidstr(nid), status); } } diff --git a/net/lnet/lnet/lib-msg.c b/net/lnet/lnet/lib-msg.c index 82d117d..420236d 100644 --- a/net/lnet/lnet/lib-msg.c +++ b/net/lnet/lnet/lib-msg.c @@ -761,6 +761,7 @@ bool attempt_remote_resend; bool handle_local_health; bool handle_remote_health; + ktime_t now; /* if we're shutting down no point in handling health. */ if (the_lnet.ln_mt_state != LNET_MT_STATE_RUNNING) @@ -778,10 +779,6 @@ nid_is_lo0(&msg->msg_txni->ni_nid)) lo = true; - if (hstatus != LNET_MSG_STATUS_OK && - ktime_after(ktime_get(), msg->msg_deadline)) - return -1; - /* always prefer txni/txpeer if they message is committed for both * directions. */ @@ -802,6 +799,17 @@ else LASSERT(ni); + now = ktime_get(); + if (ktime_after(now, msg->msg_deadline)) { + s64 time = ktime_to_ns(ktime_sub(now, msg->msg_deadline)); + + atomic64_add(time, &the_lnet.ln_late_msg_nsecs); + atomic_inc(&the_lnet.ln_late_msg_count); + + if (hstatus != LNET_MSG_STATUS_OK) + return -1; + } + CDEBUG(D_NET, "health check: %s->%s: %s: %s\n", libcfs_nidstr(&ni->ni_nid), (lo) ? "self" : libcfs_nidstr(&lpni->lpni_nid),