From patchwork Mon Mar 8 16:57:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122941 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8C4FC433DB for ; Mon, 8 Mar 2021 17:01:48 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 733866522C for ; Mon, 8 Mar 2021 17:01:48 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230126AbhCHRBQ (ORCPT ); Mon, 8 Mar 2021 12:01:16 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2658 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229469AbhCHRAq (ORCPT ); Mon, 8 Mar 2021 12:00:46 -0500 Received: from fraeml741-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPZz0cSqz67wj5; Tue, 9 Mar 2021 00:54:51 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml741-chm.china.huawei.com (10.206.15.222) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:00:45 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:00:44 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 1/7] rasdaemon: add support for memory_failure events Date: Mon, 8 Mar 2021 16:57:26 +0000 Message-ID: <20210308165732.273-2-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add support to log the memory_failure kernel trace events. Example rasdaemon log and SQLite DB output for the memory_failure event, ================================================= rasdaemon: memory_failure_event store: 0x126ce8f8 rasdaemon: register inserted at db <...>-785 [000] 0.000024: memory_failure_event: 2020-10-02 13:27:13 -0400 pfn=0x204000000 page_type=free buddy page action_result=Delayed CREATE TABLE memory_failure_event (id INTEGER PRIMARY KEY, timestamp TEXT, pfn TEXT, page_type TEXT, action_result TEXT); INSERT INTO memory_failure_event VALUES(1,'2020-10-02 13:27:13 -0400','0x204000000','free buddy page','Delayed'); ================================================== Signed-off-by: Shiju Jose --- .travis.yml | 2 +- Makefile.am | 7 +- configure.ac | 11 +++ ras-events.c | 15 +++ ras-events.h | 1 + ras-memory-failure-handler.c | 179 +++++++++++++++++++++++++++++++++++ ras-memory-failure-handler.h | 25 +++++ ras-record.c | 70 ++++++++++++++ ras-record.h | 13 +++ ras-report.c | 68 +++++++++++++ ras-report.h | 2 + 11 files changed, 390 insertions(+), 3 deletions(-) create mode 100644 ras-memory-failure-handler.c create mode 100644 ras-memory-failure-handler.h diff --git a/.travis.yml b/.travis.yml index 41d716d..7855b8e 100644 --- a/.travis.yml +++ b/.travis.yml @@ -26,7 +26,7 @@ before_install: - sudo apt-get install -y sqlite3 install: - autoreconf -vfi -- ./configure --enable-sqlite3 --enable-aer --enable-non-standard --enable-arm --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-abrt-report --enable-hisi-ns-decode --enable-memory-ce-pfa +- ./configure --enable-sqlite3 --enable-aer --enable-non-standard --enable-arm --enable-mce --enable-extlog --enable-devlink --enable-diskerror --enable-memory-failure --enable-abrt-report --enable-hisi-ns-decode --enable-memory-ce-pfa script: - make && sudo make install diff --git a/Makefile.am b/Makefile.am index de01098..7c1c027 100644 --- a/Makefile.am +++ b/Makefile.am @@ -48,6 +48,9 @@ endif if WITH_DISKERROR rasdaemon_SOURCES += ras-diskerror-handler.c endif +if WITH_MEMORY_FAILURE + rasdaemon_SOURCES += ras-memory-failure-handler.c +endif if WITH_ABRT_REPORT rasdaemon_SOURCES += ras-report.c endif @@ -62,8 +65,8 @@ rasdaemon_LDADD = -lpthread $(SQLITE3_LIBS) libtrace/libtrace.a include_HEADERS = config.h ras-events.h ras-logger.h ras-mc-handler.h \ ras-aer-handler.h ras-mce-handler.h ras-record.h bitfield.h ras-report.h \ ras-extlog-handler.h ras-arm-handler.h ras-non-standard-handler.h \ - ras-devlink-handler.h ras-diskerror-handler.h rbtree.h ras-page-isolation.h \ - non-standard-hisilicon.h + ras-devlink-handler.h ras-diskerror-handler.h ras-memory-failure-handler.h \ + rbtree.h ras-page-isolation.h non-standard-hisilicon.h # This rule can't be called with more than one Makefile job (like make -j8) # I can't figure out a way to fix that diff --git a/configure.ac b/configure.ac index e276c84..a6251d4 100644 --- a/configure.ac +++ b/configure.ac @@ -111,6 +111,16 @@ AS_IF([test "x$enable_diskerror" = "xyes" || test "x$enable_all" == "xyes"], [ AM_CONDITIONAL([WITH_DISKERROR], [test x$enable_diskerror = xyes || test x$enable_all == xyes]) AM_COND_IF([WITH_DISKERROR], [USE_DISKERROR="yes"], [USE_DISKERROR="no"]) +AC_ARG_ENABLE([memory_failure], + AS_HELP_STRING([--enable-memory-failure], [enable memory failure events (currently experimental)])) + +AS_IF([test "x$enable_memory_failure" = "xyes" || test "x$enable_all" == "xyes"], [ + AC_DEFINE(HAVE_MEMORY_FAILURE,1,"have memory failure events collect") + AC_SUBST([WITH_MEMORY_FAILURE]) +]) +AM_CONDITIONAL([WITH_MEMORY_FAILURE], [test x$enable_memory_failure = xyes || test x$enable_all == xyes]) +AM_COND_IF([WITH_MEMORY_FAILURE], [USE_MEMORY_FAILURE="yes"], [USE_MEMORY_FAILURE="no"]) + AC_ARG_ENABLE([abrt_report], AS_HELP_STRING([--enable-abrt-report], [enable report event to ABRT (currently experimental)])) @@ -178,5 +188,6 @@ compile time options summary ARM events : $USE_ARM DEVLINK : $USE_DEVLINK Disk I/O errors : $USE_DISKERROR + Memory Failure : $USE_MEMORY_FAILURE Memory CE PFA : $USE_MEMORY_CE_PFA EOF diff --git a/ras-events.c b/ras-events.c index 4509f56..fe4bd26 100644 --- a/ras-events.c +++ b/ras-events.c @@ -37,6 +37,7 @@ #include "ras-extlog-handler.h" #include "ras-devlink-handler.h" #include "ras-diskerror-handler.h" +#include "ras-memory-failure-handler.h" #include "ras-record.h" #include "ras-logger.h" #include "ras-page-isolation.h" @@ -231,6 +232,10 @@ int toggle_ras_mc_event(int enable) rc |= __toggle_ras_mc_event(ras, "block", "block_rq_complete", enable); #endif +#ifdef HAVE_MEMORY_FAILURE + rc |= __toggle_ras_mc_event(ras, "ras", "memory_failure_event", enable); +#endif + free_ras: free(ras); return rc; @@ -909,6 +914,16 @@ int handle_ras_events(int record_events) } #endif +#ifdef HAVE_MEMORY_FAILURE + rc = add_event_handler(ras, pevent, page_size, "ras", "memory_failure_event", + ras_memory_failure_event_handler, NULL, MF_EVENT); + if (!rc) + num_events++; + else + log(ALL, LOG_ERR, "Can't get traces from %s:%s\n", + "ras", "memory_failure_event"); +#endif + if (!num_events) { log(ALL, LOG_INFO, "Failed to trace all supported RAS events. Aborting.\n"); diff --git a/ras-events.h b/ras-events.h index f028741..dfd690c 100644 --- a/ras-events.h +++ b/ras-events.h @@ -38,6 +38,7 @@ enum { EXTLOG_EVENT, DEVLINK_EVENT, DISKERROR_EVENT, + MF_EVENT, NR_EVENTS }; diff --git a/ras-memory-failure-handler.c b/ras-memory-failure-handler.c new file mode 100644 index 0000000..9941e68 --- /dev/null +++ b/ras-memory-failure-handler.c @@ -0,0 +1,179 @@ +/* + * Copyright (c) Huawei Technologies Co., Ltd. 2020. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#include +#include +#include +#include "libtrace/kbuffer.h" +#include "ras-memory-failure-handler.h" +#include "ras-record.h" +#include "ras-logger.h" +#include "ras-report.h" + +/* Memory failure - various types of pages */ +enum mf_action_page_type { + MF_MSG_KERNEL, + MF_MSG_KERNEL_HIGH_ORDER, + MF_MSG_SLAB, + MF_MSG_DIFFERENT_COMPOUND, + MF_MSG_POISONED_HUGE, + MF_MSG_HUGE, + MF_MSG_FREE_HUGE, + MF_MSG_NON_PMD_HUGE, + MF_MSG_UNMAP_FAILED, + MF_MSG_DIRTY_SWAPCACHE, + MF_MSG_CLEAN_SWAPCACHE, + MF_MSG_DIRTY_MLOCKED_LRU, + MF_MSG_CLEAN_MLOCKED_LRU, + MF_MSG_DIRTY_UNEVICTABLE_LRU, + MF_MSG_CLEAN_UNEVICTABLE_LRU, + MF_MSG_DIRTY_LRU, + MF_MSG_CLEAN_LRU, + MF_MSG_TRUNCATED_LRU, + MF_MSG_BUDDY, + MF_MSG_BUDDY_2ND, + MF_MSG_DAX, + MF_MSG_UNSPLIT_THP, + MF_MSG_UNKNOWN, +}; + +/* Action results for various types of pages */ +enum mf_action_result { + MF_IGNORED, /* Error: cannot be handled */ + MF_FAILED, /* Error: handling failed */ + MF_DELAYED, /* Will be handled later */ + MF_RECOVERED, /* Successfully recovered */ +}; + +/* memory failure page types */ +static const struct { + int type; + const char *page_type; +} mf_page_type[] = { + { MF_MSG_KERNEL, "reserved kernel page" }, + { MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page"}, + { MF_MSG_SLAB, "kernel slab page"}, + { MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking"}, + { MF_MSG_POISONED_HUGE, "huge page already hardware poisoned"}, + { MF_MSG_HUGE, "huge page"}, + { MF_MSG_FREE_HUGE, "free huge page"}, + { MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page"}, + { MF_MSG_UNMAP_FAILED, "unmapping failed page"}, + { MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page"}, + { MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page"}, + { MF_MSG_DIRTY_MLOCKED_LRU, "dirty mlocked LRU page"}, + { MF_MSG_CLEAN_MLOCKED_LRU, "clean mlocked LRU page"}, + { MF_MSG_DIRTY_UNEVICTABLE_LRU, "dirty unevictable LRU page"}, + { MF_MSG_CLEAN_UNEVICTABLE_LRU, "clean unevictable LRU page"}, + { MF_MSG_DIRTY_LRU, "dirty LRU page"}, + { MF_MSG_CLEAN_LRU, "clean LRU page"}, + { MF_MSG_TRUNCATED_LRU, "already truncated LRU page"}, + { MF_MSG_BUDDY, "free buddy page"}, + { MF_MSG_BUDDY_2ND, "free buddy page (2nd try)"}, + { MF_MSG_DAX, "dax page"}, + { MF_MSG_UNSPLIT_THP, "unsplit thp"}, + { MF_MSG_UNKNOWN, "unknown page"}, +}; + +/* memory failure action results */ +static const struct { + int result; + const char *action_result; +} mf_action_result[] = { + { MF_IGNORED, "Ignored" }, + { MF_FAILED, "Failed" }, + { MF_DELAYED, "Delayed" }, + { MF_RECOVERED, "Recovered" }, +}; + +static const char *get_page_type(int page_type) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(mf_page_type); i++) + if (mf_page_type[i].type == page_type) + return mf_page_type[i].page_type; + + return "unknown page"; +} + +static const char *get_action_result(int result) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(mf_action_result); i++) + if (mf_action_result[i].result == result) + return mf_action_result[i].action_result; + + return "unknown"; +} + + +int ras_memory_failure_event_handler(struct trace_seq *s, + struct pevent_record *record, + struct event_format *event, void *context) +{ + unsigned long long val; + struct ras_events *ras = context; + time_t now; + struct tm *tm; + struct ras_mf_event ev; + + /* + * Newer kernels (3.10-rc1 or upper) provide an uptime clock. + * On previous kernels, the way to properly generate an event would + * be to inject a fake one, measure its timestamp and diff it against + * gettimeofday. We won't do it here. Instead, let's use uptime, + * falling-back to the event report's time, if "uptime" clock is + * not available (legacy kernels). + */ + + if (ras->use_uptime) + now = record->ts/user_hz + ras->uptime_diff; + else + now = time(NULL); + + tm = localtime(&now); + if (tm) + strftime(ev.timestamp, sizeof(ev.timestamp), + "%Y-%m-%d %H:%M:%S %z", tm); + trace_seq_printf(s, "%s ", ev.timestamp); + + if (pevent_get_field_val(s, event, "pfn", record, &val, 1) < 0) + return -1; + sprintf(ev.pfn, "0x%llx", val); + trace_seq_printf(s, "pfn=0x%llx ", val); + + if (pevent_get_field_val(s, event, "type", record, &val, 1) < 0) + return -1; + ev.page_type = get_page_type(val); + trace_seq_printf(s, "page_type=%s ", ev.page_type); + + if (pevent_get_field_val(s, event, "result", record, &val, 1) < 0) + return -1; + ev.action_result = get_action_result(val); + trace_seq_printf(s, "action_result=%s ", ev.action_result); + + /* Store data into the SQLite DB */ +#ifdef HAVE_SQLITE3 + ras_store_mf_event(ras, &ev); +#endif + +#ifdef HAVE_ABRT_REPORT + /* Report event to ABRT */ + ras_report_mf_event(ras, &ev); +#endif + + return 0; +} diff --git a/ras-memory-failure-handler.h b/ras-memory-failure-handler.h new file mode 100644 index 0000000..b9e9971 --- /dev/null +++ b/ras-memory-failure-handler.h @@ -0,0 +1,25 @@ +/* + * Copyright (c) Huawei Technologies Co., Ltd. 2020. All rights reserved. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. +*/ + +#ifndef __RAS_MEMORY_FAILURE_HANDLER_H +#define __RAS_MEMORY_FAILURE_HANDLER_H + +#include "ras-events.h" +#include "libtrace/event-parse.h" + +int ras_memory_failure_event_handler(struct trace_seq *s, + struct pevent_record *record, + struct event_format *event, void *context); + +#endif diff --git a/ras-record.c b/ras-record.c index 549c494..1a2ea06 100644 --- a/ras-record.c +++ b/ras-record.c @@ -498,6 +498,56 @@ int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev } #endif +/* + * Table and functions to handle ras:memory_failure + */ + +#ifdef HAVE_MEMORY_FAILURE +static const struct db_fields mf_event_fields[] = { + { .name="id", .type="INTEGER PRIMARY KEY" }, + { .name="timestamp", .type="TEXT" }, + { .name="pfn", .type="TEXT" }, + { .name="page_type", .type="TEXT" }, + { .name="action_result", .type="TEXT" }, +}; + +static const struct db_table_descriptor mf_event_tab = { + .name = "memory_failure_event", + .fields = mf_event_fields, + .num_fields = ARRAY_SIZE(mf_event_fields), +}; + +int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) +{ + int rc; + struct sqlite3_priv *priv = ras->db_priv; + + if (!priv || !priv->stmt_mf_event) + return 0; + log(TERM, LOG_INFO, "memory_failure_event store: %p\n", priv->stmt_mf_event); + + sqlite3_bind_text(priv->stmt_mf_event, 1, ev->timestamp, -1, NULL); + sqlite3_bind_text(priv->stmt_mf_event, 2, ev->pfn, -1, NULL); + sqlite3_bind_text(priv->stmt_mf_event, 3, ev->page_type, -1, NULL); + sqlite3_bind_text(priv->stmt_mf_event, 4, ev->action_result, -1, NULL); + + rc = sqlite3_step(priv->stmt_mf_event); + if (rc != SQLITE_OK && rc != SQLITE_DONE) + log(TERM, LOG_ERR, + "Failed to do memory_failure_event step on sqlite: error = %d\n", rc); + + rc = sqlite3_reset(priv->stmt_mf_event); + if (rc != SQLITE_OK && rc != SQLITE_DONE) + log(TERM, LOG_ERR, + "Failed reset memory_failure_event on sqlite: error = %d\n", + rc); + + log(TERM, LOG_INFO, "register inserted at db\n"); + + return rc; +} +#endif + /* * Generic code */ @@ -810,6 +860,16 @@ int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) } #endif +#ifdef HAVE_MEMORY_FAILURE + rc = ras_mc_create_table(priv, &mf_event_tab); + if (rc == SQLITE_OK) { + rc = ras_mc_prepare_stmt(priv, &priv->stmt_mf_event, + &mf_event_tab); + if (rc != SQLITE_OK) + goto error; + } +#endif + ras->db_priv = priv; return 0; @@ -912,6 +972,16 @@ int ras_mc_event_closedb(unsigned int cpu, struct ras_events *ras) } #endif +#ifdef HAVE_MEMORY_FAILURE + if (priv->stmt_mf_event) { + rc = sqlite3_finalize(priv->stmt_mf_event); + if (rc != SQLITE_OK) + log(TERM, LOG_ERR, + "cpu %u: Failed to finalize mf_event sqlite: error = %d\n", + cpu, rc); + } +#endif + rc = sqlite3_close_v2(db); if (rc != SQLITE_OK) log(TERM, LOG_ERR, diff --git a/ras-record.h b/ras-record.h index cc217a9..4bbeb0c 100644 --- a/ras-record.h +++ b/ras-record.h @@ -98,6 +98,13 @@ struct diskerror_event { const char *cmd; }; +struct ras_mf_event { + char timestamp[64]; + char pfn[30]; + const char *page_type; + const char *action_result; +}; + struct ras_mc_event; struct ras_aer_event; struct ras_extlog_event; @@ -106,6 +113,7 @@ struct ras_arm_event; struct mce_event; struct devlink_event; struct diskerror_event; +struct ras_mf_event; #ifdef HAVE_SQLITE3 @@ -135,6 +143,9 @@ struct sqlite3_priv { #ifdef HAVE_DISKERROR sqlite3_stmt *stmt_diskerror_event; #endif +#ifdef HAVE_MEMORY_FAILURE + sqlite3_stmt *stmt_mf_event; +#endif }; struct db_fields { @@ -161,6 +172,7 @@ int ras_store_non_standard_record(struct ras_events *ras, struct ras_non_standar int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev); int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev); int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev); +int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev); #else static inline int ras_mc_event_opendb(unsigned cpu, struct ras_events *ras) { return 0; }; @@ -173,6 +185,7 @@ static inline int ras_store_non_standard_record(struct ras_events *ras, struct r static inline int ras_store_arm_record(struct ras_events *ras, struct ras_arm_event *ev) { return 0; }; static inline int ras_store_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; }; static inline int ras_store_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; }; +static inline int ras_store_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; }; #endif diff --git a/ras-report.c b/ras-report.c index 2710eac..ea3a9b6 100644 --- a/ras-report.c +++ b/ras-report.c @@ -309,6 +309,28 @@ static int set_diskerror_event_backtrace(char *buf, struct diskerror_event *ev) return 0; } +static int set_mf_event_backtrace(char *buf, struct ras_mf_event *ev) +{ + char bt_buf[MAX_BACKTRACE_SIZE]; + + if (!buf || !ev) + return -1; + + sprintf(bt_buf, "BACKTRACE=" \ + "timestamp=%s\n" \ + "pfn=%s\n" \ + "page_type=%s\n" \ + "action_result=%s\n", \ + ev->timestamp, \ + ev->pfn, \ + ev->page_type, \ + ev->action_result); + + strcat(buf, bt_buf); + + return 0; +} + static int commit_report_backtrace(int sockfd, int type, void *ev){ char buf[MAX_BACKTRACE_SIZE]; char *pbuf = buf; @@ -343,6 +365,9 @@ static int commit_report_backtrace(int sockfd, int type, void *ev){ case DISKERROR_EVENT: rc = set_diskerror_event_backtrace(buf, (struct diskerror_event *)ev); break; + case MF_EVENT: + rc = set_mf_event_backtrace(buf, (struct ras_mf_event *)ev); + break; default: return -1; } @@ -708,3 +733,46 @@ diskerror_fail: return -1; } } + +int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) +{ + char buf[MAX_MESSAGE_SIZE]; + int sockfd = 0; + int done = 0; + int rc = -1; + + memset(buf, 0, sizeof(buf)); + + sockfd = setup_report_socket(); + if (sockfd < 0) + return -1; + + rc = commit_report_basic(sockfd); + if (rc < 0) + goto mf_fail; + + rc = commit_report_backtrace(sockfd, MF_EVENT, ev); + if (rc < 0) + goto mf_fail; + + sprintf(buf, "ANALYZER=%s", "rasdaemon-memory_failure"); + rc = write(sockfd, buf, strlen(buf) + 1); + if (rc < strlen(buf) + 1) + goto mf_fail; + + sprintf(buf, "REASON=%s", "memory failure problem"); + rc = write(sockfd, buf, strlen(buf) + 1); + if (rc < strlen(buf) + 1) + goto mf_fail; + + done = 1; + +mf_fail: + if (sockfd > 0) + close(sockfd); + + if (done) + return 0; + else + return -1; +} diff --git a/ras-report.h b/ras-report.h index 1d911de..e605eb1 100644 --- a/ras-report.h +++ b/ras-report.h @@ -38,6 +38,7 @@ int ras_report_non_standard_event(struct ras_events *ras, struct ras_non_standar int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev); int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev); int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev); +int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev); #else @@ -48,6 +49,7 @@ static inline int ras_report_non_standard_event(struct ras_events *ras, struct r static inline int ras_report_arm_event(struct ras_events *ras, struct ras_arm_event *ev) { return 0; }; static inline int ras_report_devlink_event(struct ras_events *ras, struct devlink_event *ev) { return 0; }; static inline int ras_report_diskerror_event(struct ras_events *ras, struct diskerror_event *ev) { return 0; }; +static inline int ras_report_mf_event(struct ras_events *ras, struct ras_mf_event *ev) { return 0; }; #endif From patchwork Mon Mar 8 16:57:27 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122943 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 149BCC433E0 for ; Mon, 8 Mar 2021 17:02:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E123D64FB8 for ; Mon, 8 Mar 2021 17:02:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229730AbhCHRBs (ORCPT ); Mon, 8 Mar 2021 12:01:48 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2659 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229813AbhCHRBS (ORCPT ); Mon, 8 Mar 2021 12:01:18 -0500 Received: from fraeml740-chm.china.huawei.com (unknown [172.18.147.206]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPdM4gvxz67wch; Tue, 9 Mar 2021 00:56:55 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml740-chm.china.huawei.com (10.206.15.221) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:01:16 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:01:16 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 2/7] rasdaemon: ras-mc-ctl: Modify ARM processor error summary log Date: Mon, 8 Mar 2021 16:57:27 +0000 Message-ID: <20210308165732.273-3-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add CPU's mpidr information to the ARM processor error summary log. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 1fbeb63..21e9d29 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1137,7 +1137,7 @@ sub summary my ($err_type, $label, $mc, $top, $mid, $low, $count, $msg); my ($etype, $severity, $etype_string, $severity_string); my ($dev_name, $dev); - my ($affinity, $mpidr); + my ($mpidr); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1177,13 +1177,13 @@ sub summary # ARM processor arm_event errors if ($has_arm == 1) { - $query = "select affinity, mpidr, count(*) from arm_event group by affinity, mpidr"; + $query = "select mpidr, count(*) from arm_event group by mpidr"; $query_handle = $dbh->prepare($query); $query_handle->execute(); - $query_handle->bind_columns(\($affinity, $mpidr, $count)); + $query_handle->bind_columns(\($mpidr, $count)); $out = ""; while($query_handle->fetch()) { - $out .= "\t$count errors\n"; + $out .= sprintf "\tCPU(mpidr=0x%x) has %d errors\n", $mpidr, $count; } if ($out ne "") { print "ARM processor events summary:\n$out\n"; From patchwork Mon Mar 8 16:57:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122945 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B056DC433E0 for ; Mon, 8 Mar 2021 17:02:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8568C6522C for ; Mon, 8 Mar 2021 17:02:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229459AbhCHRCU (ORCPT ); Mon, 8 Mar 2021 12:02:20 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2660 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230075AbhCHRCP (ORCPT ); Mon, 8 Mar 2021 12:02:15 -0500 Received: from fraeml736-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPch1018z67wp1; Tue, 9 Mar 2021 00:56:20 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml736-chm.china.huawei.com (10.206.15.217) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:02:14 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:02:12 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 3/7] rasdaemon: ras-mc-ctl: Add memory failure events Date: Mon, 8 Mar 2021 16:57:28 +0000 Message-ID: <20210308165732.273-4-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add supporting memory failure errors (memory_failure_event) to the ras-mc-ctl tool. Sample Log, ras-mc-ctl --summary ... Memory failure events summary: Delayed errors: 4 Failed errors: 1 ... ras-mc-ctl --errors ... Memory failure events: 1 2020-10-28 23:20:41 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed 2 2020-10-28 23:31:38 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed 3 2020-10-28 23:54:54 -0800 error: pfn=0x205000000, page_type=free buddy page, action_result=Delayed 4 2020-10-29 00:12:25 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Delayed 5 2020-10-29 00:26:36 -0800 error: pfn=0x204000000, page_type=free buddy page, action_result=Failed Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 42 +++++++++++++++++++++++++++++++++++++++++- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 21e9d29..44847d2 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -46,6 +46,7 @@ my $has_arm = 0; my $has_devlink = 0; my $has_disk_errors = 0; my $has_extlog = 0; +my $has_mem_failure = 0; my $has_mce = 0; @WITH_AER_TRUE@$has_aer = 1; @@ -53,6 +54,7 @@ my $has_mce = 0; @WITH_DEVLINK_TRUE@$has_devlink = 1; @WITH_DISKERROR_TRUE@$has_disk_errors = 1; @WITH_EXTLOG_TRUE@$has_extlog = 1; +@WITH_MEMORY_FAILURE_TRUE@$has_mem_failure = 1; @WITH_MCE_TRUE@$has_mce = 1; my %conf = (); @@ -1134,7 +1136,7 @@ sub summary { require DBI; my ($query, $query_handle, $out); - my ($err_type, $label, $mc, $top, $mid, $low, $count, $msg); + my ($err_type, $label, $mc, $top, $mid, $low, $count, $msg, $action_result); my ($etype, $severity, $etype_string, $severity_string); my ($dev_name, $dev); my ($mpidr); @@ -1249,6 +1251,24 @@ sub summary $query_handle->finish; } + # Memory failure errors + if ($has_mem_failure == 1) { + $query = "select action_result, count(*) from memory_failure_event group by action_result"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($action_result, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$action_result errors: $count\n"; + } + if ($out ne "") { + print "Memory failure events summary:\n$out\n"; + } else { + print "No Memory failure errors.\n\n"; + } + $query_handle->finish; + } + # MCE mce_record errors if ($has_mce == 1) { $query = "select error_msg, count(*) from mce_record group by error_msg"; @@ -1279,6 +1299,7 @@ sub errors my ($bus_name, $dev_name, $driver_name, $reporter_name); my ($dev, $sector, $nr_sector, $error, $rwbs, $cmd); my ($error_count, $affinity, $mpidr, $r_state, $psci_state); + my ($pfn, $page_type, $action_result); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1420,6 +1441,25 @@ sub errors $query_handle->finish; } + # Memory failure errors + if ($has_mem_failure == 1) { + $query = "select id, timestamp, pfn, page_type, action_result from memory_failure_event order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $pfn, $page_type, $action_result)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "pfn=$pfn, page_type=$page_type, action_result=$action_result\n"; + } + if ($out ne "") { + print "Memory failure events:\n$out\n"; + } else { + print "No Memory failure errors.\n\n"; + } + $query_handle->finish; + } + # MCE mce_record errors if ($has_mce == 1) { $query = "select id, timestamp, mcgcap, mcgstatus, status, addr, misc, ip, tsc, walltime, cpu, cpuid, apicid, socketid, cs, bank, cpuvendor, bank_name, error_msg, mcgstatus_msg, mcistatus_msg, user_action, mc_location from mce_record order by id"; From patchwork Mon Mar 8 16:57:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122947 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AF4A9C433E6 for ; Mon, 8 Mar 2021 17:03:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7AB0364FB8 for ; Mon, 8 Mar 2021 17:03:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229955AbhCHRCw (ORCPT ); Mon, 8 Mar 2021 12:02:52 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2661 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229463AbhCHRCq (ORCPT ); Mon, 8 Mar 2021 12:02:46 -0500 Received: from fraeml737-chm.china.huawei.com (unknown [172.18.147.226]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPZv5cwPz67tX9; Tue, 9 Mar 2021 00:54:47 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml737-chm.china.huawei.com (10.206.15.218) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:02:45 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:02:44 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 4/7] rasdaemon: ras-mc-ctl: Add support for the vendor-specific errors Date: Mon, 8 Mar 2021 16:57:29 +0000 Message-ID: <20210308165732.273-5-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add commands to support logging the vendor-specific error info in the ras-mc-ctl. Signed-off-by: Shiju Jose Reviewed-by: Xiaofei Tan --- util/ras-mc-ctl.in | 64 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 63 insertions(+), 1 deletion(-) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 44847d2..f282eff 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -95,6 +95,9 @@ Usage: $prog [OPTIONS...] --summary Presents a summary of the logged errors. --errors Shows the errors stored at the error database. --error-count Shows the corrected and uncorrected error counts using sysfs. + --vendor-errors-summary Presents a summary of the vendor-specific logged errors. + --vendor-errors Shows the vendor-specific errors stored in the error database. + --vendor-platforms Shows the supported platforms with platform-ids for the vendor-specific errors. --help This help message. EOF @@ -142,6 +145,18 @@ if ($conf{opt}{errors}) { errors (); } +if ($conf{opt}{vendor_errors_summary}) { + vendor_errors_summary (); +} + +if ($conf{opt}{vendor_errors}) { + vendor_errors (); +} + +if ($conf{opt}{vendor_platforms}) { + vendor_platforms (); +} + exit (0); sub parse_cmdline @@ -157,6 +172,9 @@ sub parse_cmdline $conf{opt}{summary} = 0; $conf{opt}{errors} = 0; $conf{opt}{error_count} = 0; + $conf{opt}{vendor_errors_summary} = 0; + $conf{opt}{vendor_errors} = 0; + $conf{opt}{vendor_platforms} = 0; my $rref = \$conf{opt}{report}; my $mref = \$conf{opt}{mainboard}; @@ -174,7 +192,10 @@ sub parse_cmdline "layout" => \$conf{opt}{display_memory_layout}, "summary" => \$conf{opt}{summary}, "errors" => \$conf{opt}{errors}, - "error-count" => \$conf{opt}{error_count} + "error-count" => \$conf{opt}{error_count}, + "vendor-errors-summary" => \$conf{opt}{vendor_errors_summary}, + "vendor-errors" => \$conf{opt}{vendor_errors}, + "vendor-platforms" => \$conf{opt}{vendor_platforms}, ); usage(1) if !$rc; @@ -1503,6 +1524,47 @@ sub errors undef($dbh); } +sub vendor_errors_summary +{ + require DBI; + my ($num_args, $platform_id); + + $num_args = $#ARGV + 1; + $platform_id = 0; + if ($num_args ne 0) { + $platform_id = $ARGV[0]; + } else { + return; + } + + my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); + + undef($dbh); +} + +sub vendor_errors +{ + require DBI; + my ($num_args, $platform_id); + + $num_args = $#ARGV + 1; + $platform_id = 0; + if ($num_args ne 0) { + $platform_id = $ARGV[0]; + } else { + return; + } + + my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); + + undef($dbh); +} + +sub vendor_platforms +{ + print "\nSupported platforms for the vendor-specific errors:\n"; +} + sub log_msg { print STDERR "$prog: ", @_ unless $conf{opt}{quiet}; } sub log_error { log_msg ("Error: @_"); } From patchwork Mon Mar 8 16:57:30 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122949 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33C65C433DB for ; Mon, 8 Mar 2021 17:04:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F0B0965220 for ; Mon, 8 Mar 2021 17:04:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229459AbhCHREA (ORCPT ); Mon, 8 Mar 2021 12:04:00 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2662 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230287AbhCHRDe (ORCPT ); Mon, 8 Mar 2021 12:03:34 -0500 Received: from fraeml734-chm.china.huawei.com (unknown [172.18.147.201]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPfB5dkjz67wNW; Tue, 9 Mar 2021 00:57:38 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml734-chm.china.huawei.com (10.206.15.215) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:03:32 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:03:32 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 5/7] rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng920 errors Date: Mon, 8 Mar 2021 16:57:30 +0000 Message-ID: <20210308165732.273-6-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add support for the HiSilicon Kunpeng920 errors. Supported error formats: OEM type 1, OEM typ2 and PCIe controller error formats. Signed-off-by: Shiju Jose Reviewed-by: Xiaofei Tan --- util/ras-mc-ctl.in | 149 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 149 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index f282eff..407bf3c 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1524,10 +1524,17 @@ sub errors undef($dbh); } +# Definitions of the vendor platform IDs. +use constant { + HISILICON_KUNPENG_920 => "Kunpeng920", +}; + sub vendor_errors_summary { require DBI; my ($num_args, $platform_id); + my ($query, $query_handle, $count, $out); + my ($module_id, $sub_module_id, $err_severity, $err_sev); $num_args = $#ARGV + 1; $platform_id = 0; @@ -1539,6 +1546,69 @@ sub vendor_errors_summary my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); + # HiSilicon Kunpeng920 errors + if ($platform_id eq HISILICON_KUNPENG_920) { + $query = "select err_severity, module_id, count(*) from hip08_oem_type1_event_v2 group by err_severity, module_id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($err_severity, $module_id, $count)); + $out = ""; + $err_sev = ""; + while($query_handle->fetch()) { + if ($err_severity ne $err_sev) { + $out .= "$err_severity errors:\n"; + $err_sev = $err_severity; + } + $out .= "\t$module_id: $count\n"; + } + if ($out ne "") { + print "HiSilicon Kunpeng920 OEM type1 error events summary:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 OEM type1 errors.\n\n"; + } + $query_handle->finish; + + $query = "select err_severity, module_id, count(*) from hip08_oem_type2_event_v2 group by err_severity, module_id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($err_severity, $module_id, $count)); + $out = ""; + $err_sev = ""; + while($query_handle->fetch()) { + if ($err_severity ne $err_sev) { + $out .= "$err_severity errors:\n"; + $err_sev = $err_severity; + } + $out .= "\t$module_id: $count\n"; + } + if ($out ne "") { + print "HiSilicon Kunpeng920 OEM type2 error events summary:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 OEM type2 errors.\n\n"; + } + $query_handle->finish; + + $query = "select err_severity, sub_module_id, count(*) from hip08_pcie_local_event_v2 group by err_severity, sub_module_id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($err_severity, $sub_module_id, $count)); + $out = ""; + $err_sev = ""; + while($query_handle->fetch()) { + if ($err_severity ne $err_sev) { + $out .= "$err_severity errors:\n"; + $err_sev = $err_severity; + } + $out .= "\t$sub_module_id: $count\n"; + } + if ($out ne "") { + print "HiSilicon Kunpeng920 PCIe controller error events summary:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 PCIe controller errors.\n\n"; + } + $query_handle->finish; + } + undef($dbh); } @@ -1546,6 +1616,9 @@ sub vendor_errors { require DBI; my ($num_args, $platform_id); + my ($query, $query_handle, $id, $timestamp, $out); + my ($version, $soc_id, $socket_id, $nimbus_id, $core_id, $port_id); + my ($module_id, $sub_module_id, $err_severity, $err_type, $regs); $num_args = $#ARGV + 1; $platform_id = 0; @@ -1557,12 +1630,88 @@ sub vendor_errors my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); + # HiSilicon Kunpeng920 errors + if ($platform_id eq HISILICON_KUNPENG_920) { + $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, module_id, sub_module_id, err_severity, regs_dump from hip08_oem_type1_event_v2 order by id, module_id, err_severity"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $module_id, $sub_module_id, $err_severity, $regs)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id. $timestamp Error Info: "; + $out .= "version=$version, "; + $out .= "soc_id=$soc_id, " if ($soc_id); + $out .= "socket_id=$socket_id, " if ($socket_id); + $out .= "nimbus_id=$nimbus_id, " if ($nimbus_id); + $out .= "module_id=$module_id, " if ($module_id); + $out .= "sub_module_id=$sub_module_id, " if ($sub_module_id); + $out .= "err_severity=$err_severity, \n" if ($err_severity); + $out .= "Error Registers: $regs\n\n" if ($regs); + } + if ($out ne "") { + print "HiSilicon Kunpeng920 OEM type1 error events:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 OEM type1 errors.\n"; + } + $query_handle->finish; + + $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, module_id, sub_module_id, err_severity, regs_dump from hip08_oem_type2_event_v2 order by id, module_id, err_severity"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $module_id, $sub_module_id, $err_severity, $regs)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id. $timestamp Error Info: "; + $out .= "version=$version, "; + $out .= "soc_id=$soc_id, " if ($soc_id); + $out .= "socket_id=$socket_id, " if ($socket_id); + $out .= "nimbus_id=$nimbus_id, " if ($nimbus_id); + $out .= "module_id=$module_id, " if ($module_id); + $out .= "sub_module_id=$sub_module_id, " if ($sub_module_id); + $out .= "err_severity=$err_severity, \n" if ($err_severity); + $out .= "Error Registers: $regs\n\n" if ($regs); + } + if ($out ne "") { + print "HiSilicon Kunpeng920 OEM type2 error events:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 OEM type2 errors.\n"; + } + $query_handle->finish; + + $query = "select id, timestamp, version, soc_id, socket_id, nimbus_id, sub_module_id, core_id, port_id, err_severity, err_type, regs_dump from hip08_pcie_local_event_v2 order by id, sub_module_id, err_severity"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $version, $soc_id, $socket_id, $nimbus_id, $sub_module_id, $core_id, $port_id, $err_severity, $err_type, $regs)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id. $timestamp Error Info: "; + $out .= "version=$version, "; + $out .= "soc_id=$soc_id, " if ($soc_id); + $out .= "socket_id=$socket_id, " if ($socket_id); + $out .= "nimbus_id=$nimbus_id, " if ($nimbus_id); + $out .= "sub_module_id=$sub_module_id, " if ($sub_module_id); + $out .= "core_id=$core_id, " if ($core_id); + $out .= "port_id=$port_id, " if ($port_id); + $out .= "err_severity=$err_severity, " if ($err_severity); + $out .= "err_type=$err_type, \n" if ($err_type); + $out .= "Error Registers: $regs\n\n" if ($regs); + } + if ($out ne "") { + print "HiSilicon Kunpeng920 PCIe controller error events:\n$out\n"; + } else { + print "No HiSilicon Kunpeng920 PCIe controller errors.\n"; + } + $query_handle->finish; + } + undef($dbh); } sub vendor_platforms { print "\nSupported platforms for the vendor-specific errors:\n"; + print "\tHiSilicon Kunpeng920, platform-id=\"", HISILICON_KUNPENG_920, "\"\n"; + print "\n"; } sub log_msg { print STDERR "$prog: ", @_ unless $conf{opt}{quiet}; } From patchwork Mon Mar 8 16:57:31 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122951 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5377FC433E9 for ; Mon, 8 Mar 2021 17:05:08 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 18AD26522F for ; Mon, 8 Mar 2021 17:05:08 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229646AbhCHREd (ORCPT ); Mon, 8 Mar 2021 12:04:33 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2663 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229580AbhCHREG (ORCPT ); Mon, 8 Mar 2021 12:04:06 -0500 Received: from fraeml714-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPcR3gpTz67x43; Tue, 9 Mar 2021 00:56:07 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml714-chm.china.huawei.com (10.206.15.33) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:04:04 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:04:03 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 6/7] rasdaemon: ras-mc-ctl: Add support for HiSilicon Kunpeng9xx common errors Date: Mon, 8 Mar 2021 16:57:31 +0000 Message-ID: <20210308165732.273-7-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Add support for the HiSilicon Kunpeng9xx platforms common errors. Signed-off-by: Shiju Jose Reviewed-by: Xiaofei Tan --- util/ras-mc-ctl.in | 44 ++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 42 insertions(+), 2 deletions(-) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 407bf3c..1e3aeb7 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1527,6 +1527,7 @@ sub errors # Definitions of the vendor platform IDs. use constant { HISILICON_KUNPENG_920 => "Kunpeng920", + HISILICON_KUNPENG_9XX => "Kunpeng9xx", }; sub vendor_errors_summary @@ -1534,7 +1535,7 @@ sub vendor_errors_summary require DBI; my ($num_args, $platform_id); my ($query, $query_handle, $count, $out); - my ($module_id, $sub_module_id, $err_severity, $err_sev); + my ($module_id, $sub_module_id, $err_severity, $err_sev, $err_info); $num_args = $#ARGV + 1; $platform_id = 0; @@ -1609,6 +1610,24 @@ sub vendor_errors_summary $query_handle->finish; } + # HiSilicon Kunpeng9xx common errors + if ($platform_id eq HISILICON_KUNPENG_9XX) { + $query = "select err_info, count(*) from hisi_common_section"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($err_info, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\terrors: $count\n"; + } + if ($out ne "") { + print "HiSilicon Kunpeng9xx common error events summary:\n$out\n"; + } else { + print "No HiSilicon Kunpeng9xx common errors.\n\n"; + } + $query_handle->finish; + } + undef($dbh); } @@ -1618,7 +1637,7 @@ sub vendor_errors my ($num_args, $platform_id); my ($query, $query_handle, $id, $timestamp, $out); my ($version, $soc_id, $socket_id, $nimbus_id, $core_id, $port_id); - my ($module_id, $sub_module_id, $err_severity, $err_type, $regs); + my ($module_id, $sub_module_id, $err_severity, $err_type, $err_info, $regs); $num_args = $#ARGV + 1; $platform_id = 0; @@ -1704,6 +1723,26 @@ sub vendor_errors $query_handle->finish; } + # HiSilicon Kunpeng9xx common errors + if ($platform_id eq HISILICON_KUNPENG_9XX) { + $query = "select id, timestamp, err_info, regs_dump from hisi_common_section order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $err_info, $regs)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id. $timestamp "; + $out .= "Error Info:$err_info \n" if ($err_info); + $out .= "Error Registers: $regs\n\n" if ($regs); + } + if ($out ne "") { + print "HiSilicon Kunpeng9xx common error events:\n$out\n"; + } else { + print "No HiSilicon Kunpeng9xx common errors.\n"; + } + $query_handle->finish; + } + undef($dbh); } @@ -1711,6 +1750,7 @@ sub vendor_platforms { print "\nSupported platforms for the vendor-specific errors:\n"; print "\tHiSilicon Kunpeng920, platform-id=\"", HISILICON_KUNPENG_920, "\"\n"; + print "\tHiSilicon Kunpeng9xx, platform-id=\"", HISILICON_KUNPENG_9XX, "\"\n"; print "\n"; } From patchwork Mon Mar 8 16:57:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 12122953 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2EC2DC433DB for ; Mon, 8 Mar 2021 17:05:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 067F2650E5 for ; Mon, 8 Mar 2021 17:05:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230050AbhCHRFH (ORCPT ); Mon, 8 Mar 2021 12:05:07 -0500 Received: from frasgout.his.huawei.com ([185.176.79.56]:2664 "EHLO frasgout.his.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230301AbhCHREy (ORCPT ); Mon, 8 Mar 2021 12:04:54 -0500 Received: from fraeml713-chm.china.huawei.com (unknown [172.18.147.200]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4DvPgl1qVHz67w9Y; Tue, 9 Mar 2021 00:58:59 +0800 (CST) Received: from lhreml715-chm.china.huawei.com (10.201.108.66) by fraeml713-chm.china.huawei.com (10.206.15.32) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 18:04:53 +0100 Received: from DESKTOP-6T4S3DQ.china.huawei.com (10.47.25.24) by lhreml715-chm.china.huawei.com (10.201.108.66) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2106.2; Mon, 8 Mar 2021 17:04:52 +0000 From: Shiju Jose To: , CC: , , Subject: [PATCH v3 7/7] rasdaemon: Modify confiure.ac for Hisilicon Kunpeng errors Date: Mon, 8 Mar 2021 16:57:32 +0000 Message-ID: <20210308165732.273-8-shiju.jose@huawei.com> X-Mailer: git-send-email 2.26.0.windows.1 In-Reply-To: <20210308165732.273-1-shiju.jose@huawei.com> References: <20210308165732.273-1-shiju.jose@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.47.25.24] X-ClientProxiedBy: lhreml747-chm.china.huawei.com (10.201.108.197) To lhreml715-chm.china.huawei.com (10.201.108.66) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org Modify HIP07 SAS HW errors : $USE_HISI_NS_DECODE to HISI Kunpeng errors : $USE_HISI_NS_DECODE. Signed-off-by: Shiju Jose --- configure.ac | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/configure.ac b/configure.ac index a6251d4..0b36f4f 100644 --- a/configure.ac +++ b/configure.ac @@ -184,7 +184,7 @@ compile time options summary EXTLOG : $USE_EXTLOG CPER non-standard : $USE_NON_STANDARD ABRT report : $USE_ABRT_REPORT - HIP07 SAS HW errors : $USE_HISI_NS_DECODE + HISI Kunpeng errors : $USE_HISI_NS_DECODE ARM events : $USE_ARM DEVLINK : $USE_DEVLINK Disk I/O errors : $USE_DISKERROR