From patchwork Thu Feb 15 11:32:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558208 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6E84912C522; Thu, 15 Feb 2024 11:33:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996829; cv=none; b=D35Su/J8USbFeci01HmhMK4nWWJdNPBv05BRQgYFBvPvsIYztSGKchWBESVdgXmjOhAfnGkQEa1PRlAqgaS0LME2Q2+FgJw/6RzGH7YGFYPOQdZGW7gxt3SL50CbiNikD4Gp6hX2GLQRdscsUjPYEHoRDi95lL+zCtDs445PrNw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996829; c=relaxed/simple; bh=zRjOSWfrOF67zKLw16hITWkRmhVmpZZgHbehbksIzRs=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=luQy3BRGlF7QMIKUIPctaPMHXPlwZsYO/HNZXY3iHeEIn6c9fzmgxXHW8v9XXNf1YaXRcZwVnNg5tpQfm5v+NQHq/rtrram0kKR3TAdwJQyykG0GlgHTY9wN0M9SENaIcu9f6rdYHeWnxAzphfhd0Dn+WL2K9W0uzH7OBBKisQ0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCXn61X7z6K8wc; Thu, 15 Feb 2024 19:30:17 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id C9DE61400D4; Thu, 15 Feb 2024 19:33:45 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:45 +0000 From: To: , , CC: , , , , Subject: [PATCH 1/2] rasdaemon: ras-memory-failure-handler: update memory failure action page types Date: Thu, 15 Feb 2024 19:32:28 +0800 Message-ID: <20240215113235.1498-3-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Update memory failure action page types corresponding to the same in mm/memory-failure.c in the kernel. Signed-off-by: Shiju Jose --- ras-memory-failure-handler.c | 6 ------ 1 file changed, 6 deletions(-) diff --git a/ras-memory-failure-handler.c b/ras-memory-failure-handler.c index 97e8840..a5acc08 100644 --- a/ras-memory-failure-handler.c +++ b/ras-memory-failure-handler.c @@ -26,10 +26,8 @@ enum mf_action_page_type { MF_MSG_KERNEL_HIGH_ORDER, MF_MSG_SLAB, MF_MSG_DIFFERENT_COMPOUND, - MF_MSG_POISONED_HUGE, MF_MSG_HUGE, MF_MSG_FREE_HUGE, - MF_MSG_NON_PMD_HUGE, MF_MSG_UNMAP_FAILED, MF_MSG_DIRTY_SWAPCACHE, MF_MSG_CLEAN_SWAPCACHE, @@ -41,7 +39,6 @@ enum mf_action_page_type { MF_MSG_CLEAN_LRU, MF_MSG_TRUNCATED_LRU, MF_MSG_BUDDY, - MF_MSG_BUDDY_2ND, MF_MSG_DAX, MF_MSG_UNSPLIT_THP, MF_MSG_UNKNOWN, @@ -64,10 +61,8 @@ static const struct { { MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page"}, { MF_MSG_SLAB, "kernel slab page"}, { MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking"}, - { MF_MSG_POISONED_HUGE, "huge page already hardware poisoned"}, { MF_MSG_HUGE, "huge page"}, { MF_MSG_FREE_HUGE, "free huge page"}, - { MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page"}, { MF_MSG_UNMAP_FAILED, "unmapping failed page"}, { MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page"}, { MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page"}, @@ -79,7 +74,6 @@ static const struct { { MF_MSG_CLEAN_LRU, "clean LRU page"}, { MF_MSG_TRUNCATED_LRU, "already truncated LRU page"}, { MF_MSG_BUDDY, "free buddy page"}, - { MF_MSG_BUDDY_2ND, "free buddy page (2nd try)"}, { MF_MSG_DAX, "dax page"}, { MF_MSG_UNSPLIT_THP, "unsplit thp"}, { MF_MSG_UNKNOWN, "unknown page"}, From patchwork Thu Feb 15 11:32:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558212 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F14D12C542; Thu, 15 Feb 2024 11:33:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; cv=none; b=htLpywQZ2P2wCJQUn1sTP+waAgnlhiFQ+NQ8piJ4tKPiBHba8gvE3bFCLo8X0wyLx9zw/5RqdgtDflBE6jSqlHXy/WsAZqN+aAdwYZ5PSWZiggAUFiFkgUmmgtzTpzaZ53+kJIOPx7TAhboDE7/gXOJpeyw78JI+puBnXWh4kxc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; c=relaxed/simple; bh=pRbS/fS9qqvajQXy9MM7kbSll/pa2rMaQJYQxThhazE=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=ooy/wmQ4+Ce/MIbvIf8+IDPBxB9IfT5kqIg/RJaUY2p2RH6Pl1cXm/l811NH++sVxnJEe0mNt4DODReMuf7XuXRiwBsirpSsg4EKNLXBW7O9F674y/jK903m/g2QliYxEtSMNzPDS6p0k7I8nAfKp0GvXgqdIBsYb7dlhUWFf2w= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCX80BJhz6JB1C; Thu, 15 Feb 2024 19:29:44 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id 3C5AE1411CE; Thu, 15 Feb 2024 19:33:46 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:45 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 2/8] rasdaemon: ras-mc-ctl: Add support for CXL AER correctable trace events Date: Thu, 15 Feb 2024 19:32:29 +0800 Message-ID: <20240215113235.1498-4-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL AER correctable events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 79 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 630edde..7e2a921 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1241,6 +1241,46 @@ sub get_cxl_ue_error_status_text return join (", ", @out); } +use constant { + CXL_AER_CE_CACHE_DATA_ECC => 0x0001, + CXL_AER_CE_MEM_DATA_ECC => 0x0002, + CXL_AER_CE_CRC_THRESH => 0x0004, + CXL_AER_CE_RETRY_THRESH => 0x0008, + CXL_AER_CE_CACHE_POISON => 0x0010, + CXL_AER_CE_MEM_POISON => 0x0020, + CXL_AER_CE_PHYS_LAYER_ERR => 0x0040, +}; + +sub get_cxl_ce_error_status_text +{ + my $error_status = $_[0]; + my @out; + + if ($error_status & CXL_AER_CE_CACHE_DATA_ECC) { + push @out, (sprintf "\'Cache Data ECC Error\' "); + } + if ($error_status & CXL_AER_CE_MEM_DATA_ECC) { + push @out, (sprintf "\'Memory Data ECC Error\' "); + } + if ($error_status & CXL_AER_CE_CRC_THRESH) { + push @out, (sprintf "\'CRC Threshold Hit\' "); + } + if ($error_status & CXL_AER_CE_RETRY_THRESH) { + push @out, (sprintf "\'Retry Threshold\' "); + } + if ($error_status & CXL_AER_CE_CACHE_POISON) { + push @out, (sprintf "\'Received Cache Poison From Peer\' "); + } + if ($error_status & CXL_AER_CE_MEM_POISON) { + push @out, (sprintf "\'Received Memory Poison From Peer\' "); + } + if ($error_status & CXL_AER_CE_PHYS_LAYER_ERR) { + push @out, (sprintf "\'Received Error From Physical Layer\' "); + } + + return join (", ", @out); +} + sub summary { require DBI; @@ -1321,6 +1361,22 @@ sub summary print "No CXL AER uncorrectable errors.\n\n"; } $query_handle->finish; + + # CXL AER correctable errors + $query = "select memdev, count(*) from cxl_aer_ce_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL AER correctable events summary:\n$out\n"; + } else { + print "No CXL AER correctable errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1530,6 +1586,29 @@ sub errors print "No CXL AER uncorrectable errors.\n\n"; } $query_handle->finish; + + # CXL AER correctable errors + $query = "select id, timestamp, memdev, host, serial, error_status from cxl_aer_ce_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $error_status)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + if (defined $error_status && length $error_status) { + $out .= sprintf "error_status: %s, ", get_cxl_ce_error_status_text($error_status); + } + $out .= "\n"; + } + if ($out ne "") { + print "CXL AER correctable events:\n$out\n"; + } else { + print "No CXL AER correctable errors.\n\n"; + } + $query_handle->finish; } # Extlog errors From patchwork Thu Feb 15 11:32:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558211 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F10A12AAF7; Thu, 15 Feb 2024 11:33:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; cv=none; b=S/YO4PTI5vuiDQ1wlDIUhrjl+MUzMKq2A7y34nsbEKHK/cyZHEwwWhblYOG6h70MKEJf7s9flPVLu+NO+a1ASrwoeCLXoHyBD1VwcRfgOZloES8Yx+E1HPaA840nOylVOBQGLOXNRzbgtFhsx577QMHT4opioE3jEAjtrSCJjq8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; c=relaxed/simple; bh=wf8TnzCNfwQdO1ondE8t5/Utrvl0AFYZzhG7vHjCe+Y=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=XQwRN/4q2QcLJ26WVh9D7rI7sUUhoh7PrOtbbMTanfa10bHz5oR+BPOIGTz9aA6GAsKymKAizWfWHciQdnQaNxx8vjD2WZxqMr461X9Vs1vrUSFK3sQ3XnAjCPJG7TC59MfCmWOUozZis4v30TBM2gAfXmcfPxAP7KK1cD38Qr4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.31]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCXp3TTgz67Lqc; Thu, 15 Feb 2024 19:30:18 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id 555F9141DD2; Thu, 15 Feb 2024 19:33:46 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:46 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 3/8] rasdaemon: ras-mc-ctl: Add support for CXL overflow trace events Date: Thu, 15 Feb 2024 19:32:30 +0800 Message-ID: <20240215113235.1498-5-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL overflow events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 38 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 38 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 7e2a921..b1175dc 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1377,6 +1377,22 @@ sub summary print "No CXL AER correctable errors.\n\n"; } $query_handle->finish; + + # CXL overflow errors + $query = "select memdev, count(*) from cxl_overflow_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL overflow events summary:\n$out\n"; + } else { + print "No CXL overflow errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1485,6 +1501,7 @@ sub errors my ($error_count, $affinity, $mpidr, $r_state, $psci_state); my ($pfn, $page_type, $action_result); my ($memdev, $host, $serial, $error_status, $first_error, $header_log); + my ($log_type, $first_ts, $last_ts); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1609,6 +1626,27 @@ sub errors print "No CXL AER correctable errors.\n\n"; } $query_handle->finish; + + # CXL overflow errors + $query = "select id, timestamp, memdev, host, serial, log_type, count, first_ts, last_ts from cxl_overflow_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $count, $first_ts, $last_ts)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "log=$log_type, " if (defined $log_type && length $log_type); + $out .= sprintf "%u records from $first_ts to $last_ts", $count if (defined $count && length $count); + $out .= "\n"; + } + if ($out ne "") { + print "CXL overflow events:\n$out\n"; + } else { + print "No CXL overflow errors.\n\n"; + } } # Extlog errors From patchwork Thu Feb 15 11:32:31 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558213 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BFBE712C54E; Thu, 15 Feb 2024 11:33:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; cv=none; b=qYvqhNrhy0KCDxHdQzcf0g8D3jLTWdQKsRzE72x4jpss3u11k1Ef90+IZfCd83StIHHgIIfYvltrnKk+t9Sw6XXPYNw/72plQdEXJ/r/VCRbVnotVtEOdicgjNWrJbMnWrjzi38x6R9IA1rV1+G9vX36TN5ur6FmOEkbfDy+YSg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; c=relaxed/simple; bh=Z4zLins1INTCTDqGGK3jX5LkroiYpIYairS0xeK28CM=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Ph3VjHvJAGAWRiA93NEviIZCtlIJEJuy5fhFW1QB0fg7GnDo9CVWd6CkB549MkIG/7Y5XtmJWggt5xQp8CCuVh04QXQ2Jw0kZ/qhyi7DKEruIeyHMyiT81Fhrhf5L31fBItty8DGAs5rEjA8Y82F+6tiXIHK8ebHG/BZ8RMQKWc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCX83hvJz6JB1b; Thu, 15 Feb 2024 19:29:44 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id B57D11400CD; Thu, 15 Feb 2024 19:33:46 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:46 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 4/8] rasdaemon: ras-mc-ctl: Add support for CXL poison trace events Date: Thu, 15 Feb 2024 19:32:31 +0800 Message-ID: <20240215113235.1498-6-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL poison events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 45 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index b1175dc..d8c1dec 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1393,6 +1393,22 @@ sub summary print "No CXL overflow errors.\n\n"; } $query_handle->finish; + + # CXL poison errors + $query = "select memdev, count(*) from cxl_poison_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL poison events summary:\n$out\n"; + } else { + print "No CXL poison errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1502,6 +1518,7 @@ sub errors my ($pfn, $page_type, $action_result); my ($memdev, $host, $serial, $error_status, $first_error, $header_log); my ($log_type, $first_ts, $last_ts); + my ($trace_type, $region, $region_uuid, $hpa, $dpa, $dpa_length, $source, $flags, $overflow_ts); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1647,6 +1664,34 @@ sub errors } else { print "No CXL overflow errors.\n\n"; } + + # CXL poison errors + $query = "select id, timestamp, memdev, host, serial, trace_type, region, region_uuid, hpa, dpa, dpa_length, source, flags, overflow_ts from cxl_poison_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $trace_type, $region, $region_uuid, $hpa, $dpa, $dpa_length, $source, $flags, $overflow_ts)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "trace_type=$trace_type, " if (defined $trace_type && length $trace_type); + $out .= "region=$region, " if (defined $region && length $region); + $out .= "region_uuid=$region_uuid, " if (defined $region_uuid && length $region_uuid); + $out .= sprintf "hpa=0x%llx, ", $hpa if (defined $hpa && length $hpa); + $out .= sprintf "dpa=0x%llx, ", $dpa if (defined $dpa && length $dpa); + $out .= sprintf "dpa_length=0x%x, ", $dpa_length if (defined $dpa_length && length $dpa_length); + $out .= "source=$source, " if (defined $source && length $source); + $out .= sprintf "flags=%d, ", $flags if (defined $flags && length $flags); + $out .= "overflow timestamp=$overflow_ts " if (defined $overflow_ts && length $overflow_ts); + $out .= "\n"; + } + if ($out ne "") { + print "CXL poison events:\n$out\n"; + } else { + print "No CXL poison errors.\n\n"; + } } # Extlog errors From patchwork Thu Feb 15 11:32:32 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558214 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 14ED312C55B; Thu, 15 Feb 2024 11:33:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; cv=none; b=Lbgo40UAsnB10ALfH6ReKGQum6DjTz6RSJ6xp00YZI3U+ZuY9XeASO9lbW9BYpgRm4Ly43Uf5mbdyPALircd3CewgRXEVXWuIfbnAZkM1iFExC+PABIFbWNKI6lHjNXA1JU8GMlELmruGA5Xw1PP1sXWE6ONwhtPX6TjzPyZi7c= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; c=relaxed/simple; bh=WBMuGaFByOBizSHSc9nVwDXvzMsp2kRr0IrhOv/Us1o=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=kdnHltZO0wEeAjuBS04U6gsx1m8sEDZma6nUwYEaxIx4HH1rPo6Pfb8XLfgBWtsZkDo+qlZFLrQ+h+C2i1FLhi6QYMuAQqIDlS4B06hagYcB+OVFKJMkI6Y2EULpyKB0UbFmSGMaEa4aqpa8SbDgJrsFHDgkg824jcUAdBfrZmI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCXq3f5Kz6K8wt; Thu, 15 Feb 2024 19:30:19 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id 1AD891400D4; Thu, 15 Feb 2024 19:33:47 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:46 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 5/8] rasdaemon: ras-mc-ctl: Add support for CXL generic trace events Date: Thu, 15 Feb 2024 19:32:32 +0800 Message-ID: <20240215113235.1498-7-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL generic events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 83 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 83 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index d8c1dec..84cdf2c 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1281,6 +1281,34 @@ sub get_cxl_ce_error_status_text return join (", ", @out); } +use constant { + CXL_EVENT_RECORD_FLAG_PERMANENT => 0x0004, + CXL_EVENT_RECORD_FLAG_MAINT_NEEDED => 0x0008, + CXL_EVENT_RECORD_FLAG_PERF_DEGRADED => 0x0010, + CXL_EVENT_RECORD_FLAG_HW_REPLACE => 0x0020, +}; + +sub get_cxl_hdr_flags_text +{ + my $flags = $_[0]; + my @out; + + if ($flags & CXL_EVENT_RECORD_FLAG_PERMANENT) { + push @out, (sprintf "\'PERMANENT_CONDITION\' "); + } + if ($flags & CXL_EVENT_RECORD_FLAG_MAINT_NEEDED) { + push @out, (sprintf "\'MAINTENANCE_NEEDED\' "); + } + if ($flags & CXL_EVENT_RECORD_FLAG_PERF_DEGRADED) { + push @out, (sprintf "\'PERFORMANCE_DEGRADED\' "); + } + if ($flags & CXL_EVENT_RECORD_FLAG_HW_REPLACE) { + push @out, (sprintf "\'HARDWARE_REPLACEMENT_NEEDED\' "); + } + + return join (", ", @out); +} + sub summary { require DBI; @@ -1409,6 +1437,22 @@ sub summary print "No CXL poison errors.\n\n"; } $query_handle->finish; + + # CXL generic errors + $query = "select memdev, count(*) from cxl_generic_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL generic events summary:\n$out\n"; + } else { + print "No CXL generic errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1519,6 +1563,7 @@ sub errors my ($memdev, $host, $serial, $error_status, $first_error, $header_log); my ($log_type, $first_ts, $last_ts); my ($trace_type, $region, $region_uuid, $hpa, $dpa, $dpa_length, $source, $flags, $overflow_ts); + my ($hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $data); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1692,6 +1737,44 @@ sub errors } else { print "No CXL poison errors.\n\n"; } + + # CXL generic errors + use constant CXL_EVENT_RECORD_DATA_LENGTH => 0x50; + $query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, data from cxl_generic_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $data)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "log=$log_type, " if (defined $log_type && length $log_type); + $out .= "hdr_uuid=$hdr_uuid, " if (defined $hdr_uuid && length $hdr_uuid); + $out .= sprintf "hdr_flags=0x%llx %s, ", $hdr_flags, get_cxl_hdr_flags_text($hdr_flags) if (defined $hdr_flags && length $hdr_flags); + $out .= sprintf "hdr_handle=0x%x, ", $hdr_handle if (defined $hdr_handle && length $hdr_handle); + $out .= sprintf "hdr_related_handle=0x%x, ", $hdr_related_handle if (defined $hdr_related_handle && length $hdr_related_handle); + $out .= "hdr_timestamp=$hdr_ts, " if (defined $hdr_ts && length $hdr_ts); + $out .= sprintf "hdr_length=%u, ", $hdr_length if (defined $hdr_length && length $hdr_length); + $out .= sprintf "hdr_maint_op_class=%u, ", $hdr_maint_op_class if (defined $hdr_maint_op_class && length $hdr_maint_op_class); + if (defined $data && length $data) { + $out .= sprintf "data:\n"; + my @bytes = unpack "C*", $data; + for (my $i = 0; $i < CXL_EVENT_RECORD_DATA_LENGTH; $i++) { + if (($i > 0) && (($i % 16) == 0)) { + $out .= sprintf "\n %08x: ", $i; + } + $out .= sprintf "%02x%02x%02x%02x ", $bytes[$i], $bytes[$i + 1], $bytes[$i + 2], $bytes[$i + 3]; + } + } + $out .= "\n"; + } + if ($out ne "") { + print "CXL generic events:\n$out\n"; + } else { + print "No CXL generic errors.\n\n"; + } } # Extlog errors From patchwork Thu Feb 15 11:32:33 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558216 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3167E12C54B; Thu, 15 Feb 2024 11:33:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996831; cv=none; b=VnQrMHPKBwzzJ0AX4d+9RPs8SRVMASgywjiLrLi3ZDEtrKMfgj/+OC7W2WM5rH9SI7DLybbQtIEd6VOtoZDLbchL/kVKQfld4sERNQU5QctEezBdg9RULJTYE7gZxV6y49V6XODgIX2B+fdyZcS+wtw1TgOQ6G1EOPpNFVOf7lI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996831; c=relaxed/simple; bh=ki91YIiPp+hwGj2CcMBkns2E9AGAxSYmijbXopUzklQ=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=oe07o82kF85nvNCykp2CYLjsPTaV6QtUz8BxOd4x5KYIM3z8VFDAd691eYCXXsWjsB+k/+YoSMfClMjNm1Lbo8MXRFkXEGOdZheM/b1fn78Du6amsYn50BAY8Q4EI8wv+AJdrp3QP7X/glwByBMxXBf3m2P3F86aQzMejFAI3nE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCX90jSGz6JB1p; Thu, 15 Feb 2024 19:29:45 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id 50A77140DAF; Thu, 15 Feb 2024 19:33:47 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:46 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 6/8] rasdaemon: ras-mc-ctl: Add support for CXL general media trace events Date: Thu, 15 Feb 2024 19:32:33 +0800 Message-ID: <20240215113235.1498-8-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL general media events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 138 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 138 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 84cdf2c..27b6962 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1309,6 +1309,84 @@ sub get_cxl_hdr_flags_text return join (", ", @out); } +use constant { + CXL_DPA_VOLATILE => 0x0001, + CXL_DPA_NOT_REPAIRABLE => 0x0002, +}; + +sub get_cxl_dpa_flags_text +{ + my $flags = $_[0]; + my @out; + + if ($flags & CXL_DPA_VOLATILE) { + push @out, (sprintf "\'VOLATILE\' "); + } + if ($flags & CXL_DPA_NOT_REPAIRABLE) { + push @out, (sprintf "\'NOT_REPAIRABLE\' "); + } + + return join (", ", @out); +} + +use constant { + CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT => 0x0001, + CXL_GMER_EVT_DESC_THRESHOLD_EVENT => 0x0002, + CXL_GMER_EVT_DESC_POISON_LIST_OVERFLOW => 0x0004, +}; + +sub get_cxl_descriptor_flags_text +{ + my $flags = $_[0]; + my @out; + + if ($flags & CXL_GMER_EVT_DESC_UNCORECTABLE_EVENT) { + push @out, (sprintf "\'UNCORRECTABLE EVENT\' "); + } + if ($flags & CXL_GMER_EVT_DESC_THRESHOLD_EVENT) { + push @out, (sprintf "\'THRESHOLD EVENT\' "); + } + if ($flags & CXL_GMER_EVT_DESC_POISON_LIST_OVERFLOW) { + push @out, (sprintf "\'POISON LIST OVERFLOW\' "); + } + + return join (", ", @out); +} + +sub get_cxl_mem_event_type +{ + my @types; + + if ($_[0] < 0 || $_[0] > 2) { + return "unknown-type"; + } + + @types = ("ECC Error", + "Invalid Address", + "Data Path Error"); + + return $types[$_[0]]; +} + +sub get_cxl_transaction_type +{ + my @types; + + if ($_[0] < 0 || $_[0] > 6) { + return "unknown-type"; + } + + @types = ("Unknown", + "Host Read", + "Host Write", + "Host Scan Media", + "Host Inject Poison", + "Internal Media Scrub", + "Internal Media Management"); + + return $types[$_[0]]; +} + sub summary { require DBI; @@ -1453,6 +1531,22 @@ sub summary print "No CXL generic errors.\n\n"; } $query_handle->finish; + + # CXL general media errors + $query = "select memdev, count(*) from cxl_general_media_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL general media events summary:\n$out\n"; + } else { + print "No CXL general media errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1564,6 +1658,7 @@ sub errors my ($log_type, $first_ts, $last_ts); my ($trace_type, $region, $region_uuid, $hpa, $dpa, $dpa_length, $source, $flags, $overflow_ts); my ($hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $data); + my ($dpa_flags, $descriptor, $mem_event_type, $transaction_type, $channel, $rank, $device, $comp_id); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1775,6 +1870,49 @@ sub errors } else { print "No CXL generic errors.\n\n"; } + + # CXL general media errors + use constant CXL_EVENT_GEN_MED_COMP_ID_SIZE => 0x10; + $query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, dpa, dpa_flags, descriptor, type, transaction_type, channel, rank, device, comp_id from cxl_general_media_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $dpa, $dpa_flags, $descriptor, $mem_event_type, $transaction_type, $channel, $rank, $device, $comp_id)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "log=$log_type, " if (defined $log_type && length $log_type); + $out .= "hdr_uuid=$hdr_uuid, " if (defined $hdr_uuid && length $hdr_uuid); + $out .= sprintf "hdr_flags=0x%llx %s, ", $hdr_flags, get_cxl_hdr_flags_text($hdr_flags) if (defined $hdr_flags && length $hdr_flags); + $out .= sprintf "hdr_handle=0x%x, ", $hdr_handle if (defined $hdr_handle && length $hdr_handle); + $out .= sprintf "hdr_related_handle=0x%x, ", $hdr_related_handle if (defined $hdr_related_handle && length $hdr_related_handle); + $out .= "hdr_timestamp=$hdr_ts, " if (defined $hdr_ts && length $hdr_ts); + $out .= sprintf "hdr_length=%u, ", $hdr_length if (defined $hdr_length && length $hdr_length); + $out .= sprintf "hdr_maint_op_class=%u, ", $hdr_maint_op_class if (defined $hdr_maint_op_class && length $hdr_maint_op_class); + $out .= sprintf "dpa=0x%llx, ", $dpa if (defined $dpa && length $dpa); + $out .= sprintf "dpa_flags: %s, ", get_cxl_dpa_flags_text($dpa_flags) if (defined $dpa_flags && length $dpa_flags); + $out .= sprintf "descriptor_flags: %s, ", get_cxl_descriptor_flags_text($descriptor) if (defined $descriptor && length $descriptor); + $out .= sprintf "memory event type: %s, ", get_cxl_mem_event_type($mem_event_type) if (defined $mem_event_type && length $mem_event_type); + $out .= sprintf "transaction_type: %s, ", get_cxl_transaction_type($transaction_type) if (defined $transaction_type && length $transaction_type); + $out .= sprintf "channel=%u, ", $channel if (defined $channel && length $channel); + $out .= sprintf "rank=%u, ", $rank if (defined $rank && length $rank); + $out .= sprintf "device=0x%x, ", $device if (defined $device && length $device); + if (defined $comp_id && length $comp_id) { + $out .= sprintf "component_id:"; + my @bytes = unpack "C*", $comp_id; + for (my $i = 0; $i < CXL_EVENT_GEN_MED_COMP_ID_SIZE; $i++) { + $out .= sprintf "%02x ", $bytes[$i]; + } + } + $out .= "\n"; + } + if ($out ne "") { + print "CXL general media events:\n$out\n"; + } else { + print "No CXL general media errors.\n\n"; + } } # Extlog errors From patchwork Thu Feb 15 11:32:34 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558215 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 32ECE12C55E; Thu, 15 Feb 2024 11:33:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; cv=none; b=TwBf7lmCKWbtc0vUAsHakbugnDwFDIPUtIvFOwLb7+WUSBMLfVBzC/oQ3XxVFriCP35V+qBKn5x2yJI7IiPoYtV0kBaKgsdXpLkQI16VIWRy7Ajp6Nr+GlQKFCeoTZkdAf98PdJ3pTTQDJAg4YL+hki/IBYIT8jjKwZhgFCN3Zc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996830; c=relaxed/simple; bh=y4LEABMzaH0A2Jq6jQWE1luBSFmpRr/RwC6u0D89WBY=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Ruib/ZmNdSgetfws+aFphOnA3AqdRc+EBhhLa08YrOKzeBj9yKDz0TC2h7J5TPhryczmcdOdTygtYSk5QqhwhwK6ORiKL5zh+MlD24hxON0xG0xjmWDZ3UfCZa/1fa7DgRMugEIyA5ClgwuZKtRqCyAqQ5u2iEY2Rz2ReeCGcoE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.231]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCXq4xTtz6K8wy; Thu, 15 Feb 2024 19:30:19 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id A5CC31400D4; Thu, 15 Feb 2024 19:33:47 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:47 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 7/8] rasdaemon: ras-mc-ctl: Add support for CXL DRAM trace events Date: Thu, 15 Feb 2024 19:32:34 +0800 Message-ID: <20240215113235.1498-9-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL DRAM events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 64 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 64 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index 27b6962..cae0e86 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1547,6 +1547,22 @@ sub summary print "No CXL general media errors.\n\n"; } $query_handle->finish; + + # CXL DRAM errors + $query = "select memdev, count(*) from cxl_dram_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL DRAM events summary:\n$out\n"; + } else { + print "No CXL DRAM errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1659,6 +1675,7 @@ sub errors my ($trace_type, $region, $region_uuid, $hpa, $dpa, $dpa_length, $source, $flags, $overflow_ts); my ($hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $data); my ($dpa_flags, $descriptor, $mem_event_type, $transaction_type, $channel, $rank, $device, $comp_id); + my ($nibble_mask, $bank_group, $row, $column, $cor_mask); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1913,6 +1930,53 @@ sub errors } else { print "No CXL general media errors.\n\n"; } + + # CXL DRAM errors + use constant CXL_EVENT_DER_CORRECTION_MASK_SIZE => 0x20; + $query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, dpa, dpa_flags, descriptor, type, transaction_type, channel, rank, nibble_mask, bank_group, bank, row, column, cor_mask from cxl_dram_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $dpa, $dpa_flags, $descriptor, $type, $transaction_type, $channel, $rank, $nibble_mask, $bank_group, $bank, $row, $column, $cor_mask)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "log=$log_type, " if (defined $log_type && length $log_type); + $out .= "hdr_uuid=$hdr_uuid, " if (defined $hdr_uuid && length $hdr_uuid); + $out .= sprintf "hdr_flags=0x%llx, %s, ", $hdr_flags, get_cxl_hdr_flags_text($hdr_flags) if (defined $hdr_flags && length $hdr_flags); + $out .= sprintf "hdr_handle=0x%x, ", $hdr_handle if (defined $hdr_handle && length $hdr_handle); + $out .= sprintf "hdr_related_handle=0x%x, ", $hdr_related_handle if (defined $hdr_related_handle && length $hdr_related_handle); + $out .= "hdr_timestamp=$hdr_ts, " if (defined $hdr_ts && length $hdr_ts); + $out .= sprintf "hdr_length=%u, ", $hdr_length if (defined $hdr_length && length $hdr_length); + $out .= sprintf "hdr_maint_op_class=%u, ", $hdr_maint_op_class if (defined $hdr_maint_op_class && length $hdr_maint_op_class); + $out .= sprintf "dpa=0x%llx, ", $dpa if (defined $dpa && length $dpa); + $out .= sprintf "dpa_flags: %s, ", get_cxl_dpa_flags_text($dpa_flags) if (defined $dpa_flags && length $dpa_flags); + $out .= sprintf "descriptor_flags: %s, ", get_cxl_descriptor_flags_text($descriptor) if (defined $descriptor && length $descriptor); + $out .= sprintf "memory event type: %s, ", get_cxl_mem_event_type($type) if (defined $type && length $type); + $out .= sprintf "transaction_type: %s, ", get_cxl_transaction_type($transaction_type) if (defined $transaction_type && length $transaction_type); + $out .= sprintf "channel=%u, ", $channel if (defined $channel && length $channel); + $out .= sprintf "rank=%u, ", $rank if (defined $rank && length $rank); + $out .= sprintf "nibble_mask=%u, ", $nibble_mask if (defined $nibble_mask && length $nibble_mask); + $out .= sprintf "bank_group=%u, ", $bank_group if (defined $bank_group && length $bank_group); + $out .= sprintf "bank=%u, ", $bank if (defined $bank && length $bank); + $out .= sprintf "row=%u, ", $row if (defined $row && length $row); + $out .= sprintf "column=%u, ", $column if (defined $column && length $column); + if (defined $cor_mask && length $cor_mask) { + $out .= sprintf "correction_mask:"; + my @bytes = unpack "C*", $cor_mask; + for (my $i = 0; $i < CXL_EVENT_DER_CORRECTION_MASK_SIZE; $i++) { + $out .= sprintf "%02x ", $bytes[$i]; + } + } + $out .= "\n"; + } + if ($out ne "") { + print "CXL DRAM events:\n$out\n"; + } else { + print "No CXL DRAM errors.\n\n"; + } } # Extlog errors From patchwork Thu Feb 15 11:32:35 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Shiju Jose X-Patchwork-Id: 13558217 Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C36D312C542; Thu, 15 Feb 2024 11:33:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=185.176.79.56 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996832; cv=none; b=PqdLpBckZzdms+g8NG85vbXdF2EtHLb/F6Xa1kKtGqzJwxmQifZFvjlEUXlUYMyWiY/Z9YxrmugDa6B5TJ0uuX4r+0MWMAzvWbQbTAb05s1szB2loPW+ab9xmgpOFURxCJWVdJFRSwaU5Pgu4YOT/JXCcLI/z7Asm5tR4DhtLlY= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707996832; c=relaxed/simple; bh=+pwndm8IXBkS2fSegK9PDzjm7PW4Csd/mFuqiZ1bnFc=; h=From:To:CC:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=WDcYlTEG+H24iw1eVx3QMr9F8n8L1JcnFluSLSfci3LGHvOHTTDW7yNt1M8ASaeE6PSgbNMSUrM3phIgvV7RdXrATU33eVR64JZD/uLOeXmtTejgW7EZemMTu3jbRtoM18NP3wTQNlmP508ZA5OPGrgCuwlSmgpFn7GM7TVxc+U= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com; spf=pass smtp.mailfrom=huawei.com; arc=none smtp.client-ip=185.176.79.56 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=huawei.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huawei.com Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4TbCXr3wySz6K8x5; Thu, 15 Feb 2024 19:30:20 +0800 (CST) Received: from lhrpeml500006.china.huawei.com (unknown [7.191.161.198]) by mail.maildlp.com (Postfix) with ESMTPS id 243CE1400CD; Thu, 15 Feb 2024 19:33:48 +0800 (CST) Received: from SecurePC30232.china.huawei.com (10.122.247.234) by lhrpeml500006.china.huawei.com (7.191.161.198) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 15 Feb 2024 11:33:47 +0000 From: To: , , CC: , , , , Subject: [RFC PATCH 8/8] rasdaemon: ras-mc-ctl: Add support for CXL memory module trace events Date: Thu, 15 Feb 2024 19:32:35 +0800 Message-ID: <20240215113235.1498-10-shiju.jose@huawei.com> X-Mailer: git-send-email 2.35.1.windows.2 In-Reply-To: <20240215113235.1498-1-shiju.jose@huawei.com> References: <20240215113235.1498-1-shiju.jose@huawei.com> Precedence: bulk X-Mailing-List: linux-edac@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-ClientProxiedBy: lhrpeml500005.china.huawei.com (7.191.163.240) To lhrpeml500006.china.huawei.com (7.191.161.198) From: Shiju Jose Add support for CXL memory module events to the ras-mc-ctl tool. Signed-off-by: Shiju Jose --- util/ras-mc-ctl.in | 117 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 117 insertions(+) diff --git a/util/ras-mc-ctl.in b/util/ras-mc-ctl.in index cae0e86..a7ece13 100755 --- a/util/ras-mc-ctl.in +++ b/util/ras-mc-ctl.in @@ -1387,6 +1387,70 @@ sub get_cxl_transaction_type return $types[$_[0]]; } +sub get_cxl_dev_event_type +{ + my @types; + + if ($_[0] < 0 || $_[0] > 5) { + return "unknown-type"; + } + + @types = ("Health Status Change", + "Media Status Change", + "Life Used Change", + "Temperature Change", + "Data Path Error", + "LSA Error"); + + return $types[$_[0]]; +} + +use constant { + CXL_DHI_HS_MAINTENANCE_NEEDED => 0x0001, + CXL_DHI_HS_PERFORMANCE_DEGRADED => 0x0002, + CXL_DHI_HS_HW_REPLACEMENT_NEEDED => 0x0004, +}; + +sub get_cxl_health_status_text +{ + my $flags = $_[0]; + my @out; + + if ($flags & CXL_DHI_HS_MAINTENANCE_NEEDED) { + push @out, (sprintf "\'MAINTENANCE_NEEDED\' "); + } + if ($flags & CXL_DHI_HS_PERFORMANCE_DEGRADED) { + push @out, (sprintf "\'PERFORMANCE_DEGRADED\' "); + } + if ($flags & CXL_DHI_HS_HW_REPLACEMENT_NEEDED) { + push @out, (sprintf "\'REPLACEMENT_NEEDED\' "); + } + + return join (", ", @out); +} + +sub get_cxl_media_status +{ + my @types; + + if ($_[0] < 0 || $_[0] > 9) { + return "unknown"; + } + + @types = ("Normal", + "Not Ready", + "Write Persistency Lost", + "All Data Lost", + "Write Persistency Loss in the Event of Power Loss", + "Write Persistency Loss in Event of Shutdown", + "Write Persistency Loss Imminent", + "All Data Loss in Event of Power Loss", + "All Data loss in the Event of Shutdown", + "All Data Loss Imminent"); + + return $types[$_[0]]; +} + sub summary { require DBI; @@ -1563,6 +1627,22 @@ sub summary print "No CXL DRAM errors.\n\n"; } $query_handle->finish; + + # CXL memory module errors + $query = "select memdev, count(*) from cxl_memory_module_event$conf{opt}{since} group by memdev"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($memdev, $count)); + $out = ""; + while($query_handle->fetch()) { + $out .= "\t$memdev errors: $count\n"; + } + if ($out ne "") { + print "CXL memory module events summary:\n$out\n"; + } else { + print "No CXL memory module errors.\n\n"; + } + $query_handle->finish; } # extlog errors @@ -1676,6 +1756,7 @@ sub errors my ($hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $data); my ($dpa_flags, $descriptor, $mem_event_type, $transaction_type, $channel, $rank, $device, $comp_id); my ($nibble_mask, $bank_group, $row, $column, $cor_mask); + my ($event_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status); my $dbh = DBI->connect("dbi:SQLite:dbname=$dbname", "", "", {}); @@ -1977,6 +2058,42 @@ sub errors } else { print "No CXL DRAM errors.\n\n"; } + + # CXL memory module errors + $query = "select id, timestamp, memdev, host, serial, log_type, hdr_uuid, hdr_flags, hdr_handle, hdr_related_handle, hdr_ts, hdr_length, hdr_maint_op_class, event_type, health_status, media_status, life_used, dirty_shutdown_cnt, cor_vol_err_cnt, cor_per_err_cnt, device_temp, add_status from cxl_memory_module_event$conf{opt}{since} order by id"; + $query_handle = $dbh->prepare($query); + $query_handle->execute(); + $query_handle->bind_columns(\($id, $timestamp, $memdev, $host, $serial, $log_type, $hdr_uuid, $hdr_flags, $hdr_handle, $hdr_related_handle, $hdr_ts, $hdr_length, $hdr_maint_op_class, $event_type, $health_status, $media_status, $life_used, $dirty_shutdown_cnt, $cor_vol_err_cnt, $cor_per_err_cnt, $device_temp, $add_status)); + $out = ""; + while($query_handle->fetch()) { + $out .= "$id $timestamp error: "; + $out .= "memdev=$memdev, " if (defined $memdev && length $memdev); + $out .= "host=$host, " if (defined $host && length $host); + $out .= sprintf "serial=0x%llx, ", $serial if (defined $serial && length $serial); + $out .= "log=$log_type, " if (defined $log_type && length $log_type); + $out .= "hdr_uuid=$hdr_uuid, " if (defined $hdr_uuid && length $hdr_uuid); + $out .= sprintf "hdr_flags=0x%llx, %s, ", $hdr_flags, get_cxl_hdr_flags_text($hdr_flags) if (defined $hdr_flags && length $hdr_flags); + $out .= sprintf "hdr_handle=0x%x, ", $hdr_handle if (defined $hdr_handle && length $hdr_handle); + $out .= sprintf "hdr_related_handle=0x%x, ", $hdr_related_handle if (defined $hdr_related_handle && length $hdr_related_handle); + $out .= "hdr_timestamp=$hdr_ts, " if (defined $hdr_ts && length $hdr_ts); + $out .= sprintf "hdr_length=%u, ", $hdr_length if (defined $hdr_length && length $hdr_length); + $out .= sprintf "hdr_maint_op_class=%u, ", $hdr_maint_op_class if (defined $hdr_maint_op_class && length $hdr_maint_op_class); + $out .= sprintf "event_type: %s, ", get_cxl_dev_event_type($event_type) if (defined $event_type && length $event_type); + $out .= sprintf "health_status: %s, ", get_cxl_health_status_text($health_status) if (defined $health_status && length $health_status); + $out .= sprintf "media_status: %s, ", get_cxl_media_status($media_status) if (defined $media_status && length $media_status); + $out .= sprintf "life_used=%u, ", $life_used if (defined $life_used && length $life_used); + $out .= sprintf "dirty_shutdown_cnt=%u, ", $dirty_shutdown_cnt if (defined $dirty_shutdown_cnt && length $dirty_shutdown_cnt); + $out .= sprintf "cor_vol_err_cnt=%u, ", $cor_vol_err_cnt if (defined $cor_vol_err_cnt && length $cor_vol_err_cnt); + $out .= sprintf "cor_per_err_cnt=%u, ", $cor_per_err_cnt if (defined $cor_per_err_cnt && length $cor_per_err_cnt); + $out .= sprintf "device_temp=%u, ", $device_temp if (defined $device_temp && length $device_temp); + $out .= sprintf "add_status=%u ", $add_status if (defined $add_status && length $add_status); + $out .= "\n"; + } + if ($out ne "") { + print "CXL memory module events:\n$out\n"; + } else { + print "No CXL memory module errors.\n\n"; + } } # Extlog errors