From patchwork Sat Oct 31 09:57:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lv Ying X-Patchwork-Id: 11871291 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A1BEE157C for ; Sat, 31 Oct 2020 09:58:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9175320791 for ; Sat, 31 Oct 2020 09:58:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726694AbgJaJ62 (ORCPT ); Sat, 31 Oct 2020 05:58:28 -0400 Received: from szxga05-in.huawei.com ([45.249.212.191]:7126 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726697AbgJaJ62 (ORCPT ); Sat, 31 Oct 2020 05:58:28 -0400 Received: from DGGEMS403-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4CNZPX6YFyzLrH0; Sat, 31 Oct 2020 17:58:24 +0800 (CST) Received: from huawei.com (10.175.104.175) by DGGEMS403-HUB.china.huawei.com (10.3.19.203) with Microsoft SMTP Server id 14.3.487.0; Sat, 31 Oct 2020 17:58:17 +0800 From: lvying6 To: , CC: Subject: [PATCH rasdaemon 1/2] ras-page-isolation: fix do_page_offline always considers page offline is successful Date: Sat, 31 Oct 2020 17:57:14 +0800 Message-ID: <1604138235-7142-2-git-send-email-lvying6@huawei.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1604138235-7142-1-git-send-email-lvying6@huawei.com> References: <1604138235-7142-1-git-send-email-lvying6@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.175.104.175] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org From: lvying do_page_offline always consider page offline is successful even if kernel soft/hard offline page failed when I set /etc/sysconfig/rasdaemon PAGE_CE_THRESHOLD="1" i.e when a page's address occurs Corrected Error, rasdaemon will trigger this page soft offline. Also I put a livepatch into kernel's store_soft_offline_page to observe this function's return vlaue. When I inject a CE into address 0x3f7ec30000 kernel log: soft_offline: 0x3f7ec30: unknown non LRU page type ffffe0000000000 () [store_soft_offline_page]return from soft_offline_page: -5 rasdaemon log: rasdaemon[73711]: cpu 00:rasdaemon: Corrected Errors at 0x3f7ec30000 exceed threshold rasdaemon[73711]: rasdaemon: Result of offlining page at 0x3f7ec30000: offlined At the same time, I use strace to record rasdaemon's system call: strace -p 73711 openat(AT_FDCWD, "/sys/devices/system/memory/soft_offline_page", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 28 fstat(28, {st_mode=S_IFREG|0200, st_size=4096, ...}) = 0 write(28, "0x3f7ec30000", 12) = -1 EIO (Input/output error) close(28) = 0 So, kernel actually soft offline pfn 0x3f7ec30 failed, store_soft_offline_page return -EIO. However, rasdaemon always considers the page offline is successful. According to strace display, ferror is unaware of the failure of the write syscall. So I change fopen-fprintf-ferror-fclose process to open-write-close process which can be aware of the failure of the write syscall. Signed-off-by: lvying --- ras-page-isolation.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/ras-page-isolation.c b/ras-page-isolation.c index 50e4406..dc07545 100644 --- a/ras-page-isolation.c +++ b/ras-page-isolation.c @@ -17,6 +17,9 @@ #include #include #include +#include +#include +#include #include "ras-logger.h" #include "ras-page-isolation.h" @@ -210,18 +213,22 @@ void ras_page_account_init(void) static int do_page_offline(unsigned long long addr, enum otype type) { - FILE *offline_file; - int err; + int fd, rc; + char buf[20]; - offline_file = fopen(kernel_offline[type], "w"); - if (!offline_file) + fd = open(kernel_offline[type], O_WRONLY); + if (fd == -1) { + log(TERM, LOG_ERR, "[%s]:open file: %s failed\n", __func__, kernel_offline[type]); return -1; + } - fprintf(offline_file, "%#llx", addr); - err = ferror(offline_file) ? -1 : 0; - fclose(offline_file); - - return err; + sprintf(buf, "%#llx", addr); + rc = write(fd, buf, strlen(buf)); + if (rc < 0) { + log(TERM, LOG_ERR, "page offline addr(%s) by %s failed, errno:%d\n", buf, kernel_offline[type], errno); + } + close(fd); + return rc; } static void page_offline(struct page_record *pr) From patchwork Sat Oct 31 09:57:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Lv Ying X-Patchwork-Id: 11871289 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 63CC214C0 for ; Sat, 31 Oct 2020 09:58:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5504222202 for ; Sat, 31 Oct 2020 09:58:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726627AbgJaJ62 (ORCPT ); Sat, 31 Oct 2020 05:58:28 -0400 Received: from szxga05-in.huawei.com ([45.249.212.191]:7125 "EHLO szxga05-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726694AbgJaJ62 (ORCPT ); Sat, 31 Oct 2020 05:58:28 -0400 Received: from DGGEMS403-HUB.china.huawei.com (unknown [172.30.72.58]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4CNZPX6jcczLrK2; Sat, 31 Oct 2020 17:58:24 +0800 (CST) Received: from huawei.com (10.175.104.175) by DGGEMS403-HUB.china.huawei.com (10.3.19.203) with Microsoft SMTP Server id 14.3.487.0; Sat, 31 Oct 2020 17:58:18 +0800 From: lvying6 To: , CC: Subject: [PATCH rasdaemon 2/2] ras-page-isolation: page which is PAGE_OFFLINE_FAILED can be offlined again Date: Sat, 31 Oct 2020 17:57:15 +0800 Message-ID: <1604138235-7142-3-git-send-email-lvying6@huawei.com> X-Mailer: git-send-email 1.8.3.1 In-Reply-To: <1604138235-7142-1-git-send-email-lvying6@huawei.com> References: <1604138235-7142-1-git-send-email-lvying6@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.175.104.175] X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-edac@vger.kernel.org OS may fail to offline page at the previous time. After some time, this page's state changed, and the page can be offlined by OS. At this time, Correctable errors on this page reached the threshold. Rasdaemon should trigger to offline this page again. Signed-off-by: lvying6 --- ras-page-isolation.c | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/ras-page-isolation.c b/ras-page-isolation.c index dc07545..fd7bd70 100644 --- a/ras-page-isolation.c +++ b/ras-page-isolation.c @@ -237,12 +237,17 @@ static void page_offline(struct page_record *pr) int ret; /* Offlining page is not required */ - if (offline <= OFFLINE_ACCOUNT) + if (offline <= OFFLINE_ACCOUNT) { + log(TERM, LOG_INFO, "PAGE_CE_ACTION=%s, ignore to offline page at %#llx\n", + offline_choice[offline].name, addr); return; + } /* Ignore offlined pages */ - if (pr->offlined != PAGE_ONLINE) + if (pr->offlined == PAGE_OFFLINE) { + log(TERM, LOG_INFO, "page at %#llx is already offlined, ignore\n", addr); return; + } /* Time to silence this noisy page */ if (offline == OFFLINE_SOFT_THEN_HARD) {