From patchwork Mon Apr 30 21:33:52 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Alex G." X-Patchwork-Id: 10372959 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 25FA760116 for ; Mon, 30 Apr 2018 21:34:41 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1427E28BA7 for ; Mon, 30 Apr 2018 21:34:41 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 06B6328BAB; Mon, 30 Apr 2018 21:34:41 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.8 required=2.0 tests=BAYES_00, DKIM_ADSP_CUSTOM_MED, DKIM_SIGNED, FREEMAIL_FROM, MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI, T_DKIM_INVALID autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A209428BA7 for ; Mon, 30 Apr 2018 21:34:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755160AbeD3Ve0 (ORCPT ); Mon, 30 Apr 2018 17:34:26 -0400 Received: from mail-oi0-f67.google.com ([209.85.218.67]:38894 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755147AbeD3VeX (ORCPT ); Mon, 30 Apr 2018 17:34:23 -0400 Received: by mail-oi0-f67.google.com with SMTP id k17-v6so8700073oih.5; Mon, 30 Apr 2018 14:34:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=PVhuKxW0oM6txohKywzo3/c+qXgx87mLLX+fhGq3KhU=; b=Ah1o52gv7Y/iHa87xUMgZVWpOSnFnG9fVkFXoSyjL0WkFCf5xRFqY0XzEIacI3rMmp Kh5/B/Ni6lJPL0w4QRAfvWyfon+qlo9R6fL8OobvrBY6fVQeT+BtbZ+/a61sBTPHeA/B x6/bf6bKh/RyAAcf1TgDh59oFSfDNRFCNBOtSjTpMQjk59gD9zB+g8tk2j8bGQm2Kw5s 697fro20OmMh62+eFTQzqomyc8AwQaftrKyBp8WlkWZchiD6fjl0HBCVvVMUTE7gMZ+7 9LEI/tMm3v2IX+5CSadudADdwCrkwhlC7fEM5VSe2Q3ZerzxWlKWAIuhyN37va1mmB2v omTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=PVhuKxW0oM6txohKywzo3/c+qXgx87mLLX+fhGq3KhU=; b=rRvPF50oDOHMui78HJurqqEkhc9Fq+k4xVxEG7a93MCp1PpXTq2TZ/3YPP6H0RV0oi ipHMe4c76ablriVfKIBtlXVYPVMMoou+alr6q22wxoSEIsFaz5zAgCxyJazvjVKD3EX0 7DJfYj8xnARZ7zvHWAG2jBeLUUreVCVgpvruSFiNaFeXppQXaxRtVwIJ1SwMMFJ3msLH ydTMmxCC+ZZj06TqpHrGUEDpt/gkumZzeRFC++3uUCVHrsxDeiDpoUvu2/TqzLZmaA12 40MqXpD7AWTLpoi59D7sq3AGjPdtSlcDubFVwl7NEiiUagsO/qHUkuEhf6bq0mXP64a1 Z1oQ== X-Gm-Message-State: ALQs6tD0FkmO4zoSvNH2s4hJA3LuOmgINIR1wSwHL9hynSOkkezFgpQ5 OlGKxwLBXHqdLC2227Q0oFk= X-Google-Smtp-Source: AB8JxZro5sWjy7JvjlBi+2a8+T23BR3C67wWqMHXIX3k6H+tvCIXSuav8i+2OngS6YTDx2S6ebJ2Ig== X-Received: by 2002:aca:d16:: with SMTP id 22-v6mr5855725oin.152.1525124062587; Mon, 30 Apr 2018 14:34:22 -0700 (PDT) Received: from nuclearis2_1.lan (c-98-197-2-30.hsd1.tx.comcast.net. [98.197.2.30]) by smtp.gmail.com with ESMTPSA id m3-v6sm1475866oif.50.2018.04.30.14.34.21 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 30 Apr 2018 14:34:22 -0700 (PDT) From: Alexandru Gagniuc To: bp@alien8.de Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, Alexandru Gagniuc , "Rafael J. Wysocki" , Len Brown , Tony Luck , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, devel@acpica.org Subject: [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Date: Mon, 30 Apr 2018 16:33:52 -0500 Message-Id: <20180430213358.8319-3-mr.nuke.me@gmail.com> X-Mailer: git-send-email 2.14.3 In-Reply-To: <20180430213358.8319-1-mr.nuke.me@gmail.com> References: <20180430212836.7807-1-mr.nuke.me@gmail.com> <20180430213358.8319-1-mr.nuke.me@gmail.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The policy was to panic() when GHES said that an error is "Fatal". This logic is wrong for several reasons, as it doesn't take into account what caused the error. PCIe fatal errors indicate that the link to a device is either unstable or unusable. They don't indicate that the machine is on fire, and they are not severe enough that we need to panic(). Instead of relying on crackmonkey firmware, evaluate the error severity based on what caused the error (GHES subsections). Signed-off-by: Alexandru Gagniuc --- drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index c9f1971333c1..49318fba409c 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_PANIC does not make it to this handling since the kernel must - * panic. + * GHES_SEV_PANIC -> AER_FATAL */ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) #endif } +/* PCIe errors should not cause a panic. */ +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) +{ + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) + return CPER_SEV_RECOVERABLE; + + return ghes_cper_severity(gdata->error_severity); +} +/* + * The severity field in the status block is oftentimes more severe than it + * needs to be. This makes it an unreliable metric for the severity. A more + * reliable way is to look at each subsection and correlate it with how well + * the error can be handled. + * - SEC_PCIE: All PCIe errors can be handled by AER. + */ +static int ghes_severity(struct ghes *ghes) +{ + int worst_sev, sec_sev; + struct acpi_hest_generic_data *gdata; + const guid_t *section_type; + const struct acpi_hest_generic_status *estatus = ghes->estatus; + + worst_sev = GHES_SEV_NO; + apei_estatus_for_each_section(estatus, gdata) { + section_type = (guid_t *)gdata->section_type; + sec_sev = ghes_cper_severity(gdata->error_severity); + + if (guid_equal(section_type, &CPER_SEC_PCIE)) + sec_sev = ghes_sec_pcie_severity(gdata); + + worst_sev = max(worst_sev, sec_sev); + } + + return worst_sev; +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } - sev = ghes_cper_severity(ghes->estatus->error_severity); + sev = ghes_severity(ghes); if (sev >= GHES_SEV_PANIC) { oops_begin(); ghes_print_queued_estatus();