From patchwork Fri Jan 7 19:44:50 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tony Luck X-Patchwork-Id: 12707015 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 514D4C433EF for ; Fri, 7 Jan 2022 19:45:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 58E1F6B0071; Fri, 7 Jan 2022 14:45:20 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 53CED6B0072; Fri, 7 Jan 2022 14:45:20 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 404D66B0074; Fri, 7 Jan 2022 14:45:20 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com [216.40.44.239]) by kanga.kvack.org (Postfix) with ESMTP id 31CAE6B0071 for ; Fri, 7 Jan 2022 14:45:20 -0500 (EST) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DF0CA82A13AF for ; Fri, 7 Jan 2022 19:45:19 +0000 (UTC) X-FDA: 79004519958.18.25FAD51 Received: from mga14.intel.com (mga14.intel.com [192.55.52.115]) by imf15.hostedemail.com (Postfix) with ESMTP id 110EAA0004 for ; Fri, 7 Jan 2022 19:45:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1641584719; x=1673120719; h=from:to:cc:subject:date:message-id:mime-version: content-transfer-encoding; bh=WorxCgWr17sbvtgPDsd+RkxFOR14VabgFtJ+fZNZTp4=; b=IWZbahmMuU/gKQWA9/ZoxfR8Vvz8ZRWvgft+gMJCRa3OO+6wToGzYVou +6xiVJRCIP2AnXViMeliZLCzA3U7u2W4eFydL5cwdPpUY1w7www8a2VC+ QJymQ/ifYrIhtKY1DZ8gGtdRC+W4nByN5XpYju9Ffc76Izi2gp7/9QsV2 oL3zDOk4lAfSDCWRatdZ5W+YlSeTkNnBvKbqKoZyGXOkR+7l7v01GUBsj SU98GFAKRICukQBrURMhAaU2WXpTYRwL5Ego/k+hwxeauubPPEgCUXH1a ckJFuLviFyxP7wIbiTt33mnBTKff33nbwfxlmuD0+Cjqx6Ii137YZHSua g==; X-IronPort-AV: E=McAfee;i="6200,9189,10220"; a="243128674" X-IronPort-AV: E=Sophos;i="5.88,270,1635231600"; d="scan'208";a="243128674" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jan 2022 11:45:00 -0800 X-IronPort-AV: E=Sophos;i="5.88,270,1635231600"; d="scan'208";a="513914043" Received: from agluck-desk2.sc.intel.com ([10.3.52.146]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Jan 2022 11:44:59 -0800 From: Tony Luck To: Naoya Horiguchi Cc: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Youquan Song , Tony Luck Subject: [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered" Date: Fri, 7 Jan 2022 11:44:50 -0800 Message-Id: <20220107194450.1687264-1-tony.luck@intel.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 X-Rspamd-Queue-Id: 110EAA0004 X-Stat-Signature: fcbfewqxj4ti9ps1sjtp79qbrbp48yxm Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=IWZbahmM; spf=none (imf15.hostedemail.com: domain of tony.luck@intel.com has no SPF policy when checking 192.55.52.115) smtp.mailfrom=tony.luck@intel.com; dmarc=pass (policy=none) header.from=intel.com X-Rspamd-Server: rspam10 X-HE-Tag: 1641584718-199163 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Youquan Song When an uncorrected memory error is consumed there is a race between the CMCI from the memory controller reporting an uncorrected error with a UCNA signature, and the core reporting and SRAR signature machine check when the data is about to be consumed. If the CMCI wins that race, the page is marked poisoned when uc_decode_notifier() calls memory_failure() and the machine check processing code finds the page already poisoned. It calls kill_accessing_process() to make sure a SIGBUS is sent. But returns the wrong error code. Console log looks like this: [34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400 [34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered [34775.690310] Memory failure: 0x3710b3: already hardware poisoned [34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption [34775.706072] mce: Memory error not recovered Fix kill_accessing_process() to return -EHWPOISON to avoid the noise message "Memory error not recovered" and skip duplicate SIGBUS. [Tony: Reworded some parts of commit message] Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Youquan Song Signed-off-by: Tony Luck Reported-by: Youquan Song --- This code is very subtle ... the fix makes the "not recovered" message go away ... but I'm not more than 75% confident that this is the right fix. Please check carefully :-) mm/memory-failure.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 3a274468f193..a67f558b08ea 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -707,7 +707,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn, if (ret == 1 && priv.tk.addr) kill_proc(&priv.tk, pfn, flags); mmap_read_unlock(p->mm); - return ret ? -EFAULT : -EHWPOISON; + + return (ret < 0) ? -EFAULT : -EHWPOISON; } static const char *action_name[] = {