From patchwork Fri Oct 21 20:01:18 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Tony Luck X-Patchwork-Id: 13015348 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B2A53C38A2D for ; Fri, 21 Oct 2022 20:01:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32D7F8E0002; Fri, 21 Oct 2022 16:01:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2DD288E0001; Fri, 21 Oct 2022 16:01:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1CBB88E0002; Fri, 21 Oct 2022 16:01:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 0BEC68E0001 for ; Fri, 21 Oct 2022 16:01:31 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id A559C812E7 for ; Fri, 21 Oct 2022 20:01:30 +0000 (UTC) X-FDA: 80046026340.10.89A3A25 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) by imf13.hostedemail.com (Postfix) with ESMTP id 791F42002C for ; Fri, 21 Oct 2022 20:01:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1666382489; x=1697918489; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=/l/AGxNBveIFVLifPG7mHqaofFSHPy5andRAcNZBXFI=; b=GhKA6vAhQ1W4bsRf0YaH/yrGZLlOAlCD3yzTesFGM4EmsV7jMs4oIVsh DqzJ5fkQfpbfD9WSOCeTXC12GOJt8f3LknlzPuNAgsw7cewew7QqMVoeW 6xzqbVBwYHHzVUmuVY8h7iiL6R0dt7/7cUR5sz9HmU4nZB+0iNwsIqrKW KKsMLdh5UlMTP9vKRAIkbaypYnftvytQVLzrOiFAatuo+SOLVLY6nOHB2 6eHE4XXD/+PwVHFKcdjAf0E/UMRl3tQD4l20K4aekve0WIBDqv5r37Sn6 16kTqflgShPOxA6wxtpgJM29wfyfwBYdP1pEhnfFFYkukAif8YgAXc7bd A==; X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="369153372" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="369153372" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 X-IronPort-AV: E=McAfee;i="6500,9779,10507"; a="633069084" X-IronPort-AV: E=Sophos;i="5.95,203,1661842800"; d="scan'208";a="633069084" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 21 Oct 2022 13:01:27 -0700 From: Tony Luck To: Naoya Horiguchi , Andrew Morton Cc: Miaohe Lin , Matthew Wilcox , Shuai Xue , Dan Williams , Michael Ellerman , Nicholas Piggin , Christophe Leroy , linux-mm@kvack.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, Tony Luck Subject: [PATCH v3 0/2] Copy-on-write poison recovery Date: Fri, 21 Oct 2022 13:01:18 -0700 Message-Id: <20221021200120.175753-1-tony.luck@intel.com> X-Mailer: git-send-email 2.37.3 In-Reply-To: <20221019170835.155381-1-tony.luck@intel.com> References: <20221019170835.155381-1-tony.luck@intel.com> MIME-Version: 1.0 ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1666382490; a=rsa-sha256; cv=none; b=WamvX/p38JlWpQtQeJDyNfmPn944vJ2CtjEvfYecEOI8YT4DgGFxEaWBujHIM9U+Mq+k3B lOTlL9NqARPwxO3BpV3gLQ9VclMRyJCMFutjTWa1EDEK40qmtzS4TtVXwzRtdJ9PZ9Fk1/ sbGzSJruLqvg3lIZ2wtAdrgGIgbpa60= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=GhKA6vAh; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of tony.luck@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=tony.luck@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1666382490; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bj92xyG8bxkl9HoO5ww6obuUh+93oN3aGBtfxOmvtD0=; b=BpCNT2/lIby2sBLas66NS05nF0ICjOCQ0JPbU534qmgF2XVtoRHF7b1gFcn3wxKjp30qK/ tHpEzb2gGBR0kP73AC8kN76jl0EcidjI9x4DtVZ2+brZkn9Th4B7OEWcuOdFUiuTnyHFmU Bm9o64MqSEEHrxZc7XWbTl96oo+4X/I= Authentication-Results: imf13.hostedemail.com; dkim=none ("invalid DKIM record") header.d=intel.com header.s=Intel header.b=GhKA6vAh; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of tony.luck@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=tony.luck@intel.com X-Rspamd-Server: rspam04 X-Rspam-User: X-Stat-Signature: o41aocbaaska3p4jjq83wsknqnhxtgkt X-Rspamd-Queue-Id: 791F42002C X-HE-Tag: 1666382489-857519 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Part 1 deals with the process that triggered the copy on write fault with a store to a shared read-only page. That process is send a SIGBUS with the usual machine check decoration to specify the virtual address of the lost page, together with the scope. Part 2 sets up to asynchronously take the page with the uncorrected error offline to prevent additional machine check faults. H/t to Miaohe Lin and Shuai Xue for pointing me to the existing function to queue a call to memory_failure(). On x86 there is some duplicate reporting (because the error is also signalled by the memory controller as well as by the core that triggered the machine check). Console logs look like this: [ 1647.723403] mce: [Hardware Error]: Machine check events logged Machine check from kernel copy routine [ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400 x86 fault handler sends SIGBUS to child process [ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered Async call to memory_failure() from copy on write path [ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned uc_decode_notifier() processes memory controller report [ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400 Parent process tries to read poisoned page. Page has been unmapped, so #PF handler sends SIGBUS Tony Luck (2): mm, hwpoison: Try to recover from copy-on write faults mm, hwpoison: When copy-on-write hits poison, take page offline include/linux/highmem.h | 24 ++++++++++++++++++++++++ include/linux/mm.h | 5 ++++- mm/memory.c | 32 ++++++++++++++++++++++---------- 3 files changed, 50 insertions(+), 11 deletions(-) Tested-by: Shuai Xue