From patchwork Thu Nov 15 00:51:00 2012 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bjorn Helgaas X-Patchwork-Id: 1744621 X-Patchwork-Delegate: bhelgaas@google.com Return-Path: X-Original-To: patchwork-linux-pci@patchwork.kernel.org Delivered-To: patchwork-process-083081@patchwork2.kernel.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by patchwork2.kernel.org (Postfix) with ESMTP id C8E69DF264 for ; Thu, 15 Nov 2012 00:51:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161081Ab2KOAvF (ORCPT ); Wed, 14 Nov 2012 19:51:05 -0500 Received: from mail-fa0-f74.google.com ([209.85.161.74]:43672 "EHLO mail-fa0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755462Ab2KOAvE (ORCPT ); Wed, 14 Nov 2012 19:51:04 -0500 Received: by mail-fa0-f74.google.com with SMTP id w1so71652fad.1 for ; Wed, 14 Nov 2012 16:51:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent; bh=rexU93M5CEMmfsOsbWpJk7tkNbVJlM2rlQHnOHlThPI=; b=n+8/Zd9kwXEG6YD6ui5USwt91izbBC0duSzAQn+6pqsxUiLT/P6UoA/2FfKKyQU/PE 4UV16IhPkNiGgvVismcPBvOA8xhUxiGKf+vA8n/kB+brCx+tgjXG66kNAzVMIqc29dgP 1Ms8xktukKK/JLcrUXaViywP0lPwOsKw1KArTikd0czj52kRY8VhsfFc/GhOsMyOLUG0 2Ijcb2uh8zEtIWa6RXjtoFAhmXj6Yz+tIwrOFGdxsdow50cK26v4H+cpZxvzUnOfyZCe CN1OS3zJvlwe+6nXopM0dVOhEcumC7BiVfQzlV/eiZkxSha97nGok3saH5uHdwP/XgHT FChw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-type:content-disposition:in-reply-to:user-agent :x-gm-message-state; bh=rexU93M5CEMmfsOsbWpJk7tkNbVJlM2rlQHnOHlThPI=; b=AGjHM7MmRuZBjbTs16UCC/e+Ceb0ApatHsJviTUx0rRvQCKcl4vJVnIRoFZgEHaElx gaX1zVrptnG3ebD+jdhrNxdHqFXJ7n+//i10w/9egsuyWjuhaz6fGD8oxX3wOy2fzmL9 QRSN/Ba3cgW/DhajcmIn+T5WHF4ojin/qYwU+af/Uvz9TnpXwlOOwRCJfGeveY7PK0kz htnDSPQ9uwP/LH5Rv60x7k8GA75Nhd8ULek5HGlDRniw75z8H1jyd1zzt1EuoXu1Dd/E qLUG7eTNSp937dMk5x7y6E55dx4Uk5T+7GqALQJRmmbZPnUT1+Xsc+XvvvyNx14Zq37C Av2Q== Received: by 10.14.199.6 with SMTP id w6mr29012162een.0.1352940661977; Wed, 14 Nov 2012 16:51:01 -0800 (PST) Received: from hpza9.eem.corp.google.com ([74.125.121.33]) by gmr-mx.google.com with ESMTPS id u8si3477606een.1.2012.11.14.16.51.01 (version=TLSv1/SSLv3 cipher=AES128-SHA); Wed, 14 Nov 2012 16:51:01 -0800 (PST) Received: from bhelgaas.mtv.corp.google.com (bhelgaas.mtv.corp.google.com [172.17.131.112]) by hpza9.eem.corp.google.com (Postfix) with ESMTP id 72B0E5C0060; Wed, 14 Nov 2012 16:51:01 -0800 (PST) Received: by bhelgaas.mtv.corp.google.com (Postfix, from userid 131485) id BCF67180C00; Wed, 14 Nov 2012 16:51:00 -0800 (PST) Date: Wed, 14 Nov 2012 17:51:00 -0700 From: Bjorn Helgaas To: "Pandarathil, Vijaymohan R" Cc: "linux-kernel@vger.kernel.org" , "linux-pci@vger.kernel.org" , "linasvepstas@gmail.com" , Myron Stowe , Lance Ortiz , Huang Ying , Hidetoshi Seto , Andrew Patterson , Zhang Yanmin Subject: Re: [ PATCH RESEND ] PCI-AER: Do not report successful error recovery for devices with AER-unaware drivers Message-ID: <20121115005100.GA6613@google.com> References: MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.20 (2009-06-14) X-Gm-Message-State: ALoCoQnrVhViJotad6uq4cN0XpNjDN9NUhdsSfoqlj3jKuTh+YNTV8x5CDoQziynhWS09V0cVkAzVq1J48HozHXMzL551B08ZGP+p/c6neLNeZG+0xKYNANV6/RaPs1nARGEVoIhED5nOT85O1dbNmCXCuKDF+9M8cHUdFDiT2GjmcE5OWwpO/7AMEt0uwnBFuWsLDg+VJOLpINlIg1Zn45kI24XB967KA== Sender: linux-pci-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-pci@vger.kernel.org [+cc Lance, Huang, Hidetoshi, Andrew, Zhang] On Sat, Nov 10, 2012 at 07:41:04AM +0000, Pandarathil, Vijaymohan R wrote: > When an error is detected on a PCIe device which does not have an > AER-aware driver, prevent AER infrastructure from reporting > successful error recovery. > > This is because the report_error_detected() function that gets > called in the first phase of recovery process allows forward > progress even when the driver for the device does not have AER > capabilities. It seems that all callbacks (in pci_error_handlers > structure) registered by drivers that gets called during error > recovery are not mandatory. So the intention of the infrastructure > design seems to be to allow forward progress even when a specific > callback has not been registered by a driver. However, if error > handler structure itself has not been registered, it doesn't make > sense to allow forward progress. > > As a result of the current design, in the case of a single device > having an AER-unaware driver or in the case of any function in a > multi-function card having an AER-unaware driver, a successful > recovery is reported. > > Typical scenario this happens is when a PCI device is detached > from a KVM host and the pci-stub driver on the host claims the > device. The pci-stub driver does not have error handling capabilities > but the AER infrastructure still reports that the device recovered > successfully. > > The changes proposed here leaves the device in an unrecovered state > if the driver for the device or for any function in a multi-function > card does not have error handler structure registered. This reflects > the true state of the device and prevents any partial recovery (or no > recovery at all) reported as successful. > > Please also see comments from Linas Vepstas at the following link > http://www.spinics.net/lists/linux-pci/msg18572.html > > Reviewed-by: Linas Vepstas gmail.com> > Reviewed-by: Myron Stowe redhat.com> > Signed-off-by: Vijay Mohan Pandarathil hp.com> > > --- > > drivers/pci/pcie/aer/aerdrv_core.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c > index 06bad96..030b229 100644 > --- a/drivers/pci/pcie/aer/aerdrv_core.c > +++ b/drivers/pci/pcie/aer/aerdrv_core.c > @@ -215,6 +215,12 @@ static int report_error_detected(struct pci_dev *dev, void *data) > > dev->error_state = result_data->state; > > + if ((!dev->driver || !dev->driver->err_handler) && > + !(dev->hdr_type & PCI_HEADER_TYPE_BRIDGE)) { > + dev_info(&dev->dev, "AER: Error detected but no driver has claimed this device or the driver is AER-unaware\n"); > + result_data->result = PCI_ERS_RESULT_NONE; > + return 1; This doesn't seem right because returning 1 here causes pci_walk_bus() to terminate, which means we won't set dev->error_state or notify drivers for any devices we haven't visited yet. > + } > if (!dev->driver || > !dev->driver->err_handler || > !dev->driver->err_handler->error_detected) { If none of the drivers in the affected hierarchy supports error handling, I think the call tree looks like this: do_recovery # uncorrectable only broadcast_error_message(dev, ..., report_error_detected) result_data.result = CAN_RECOVER pci_walk_bus(..., report_error_detected) report_error_detected # (each dev in subtree) return 0 # no change to result return result_data.result broadcast_error_message(dev, ..., report_mmio_enabled) result_data.result = PCI_ERS_RESULT_RECOVERED pci_walk_bus(..., report_mmio_enabled) report_mmio_enabled # (each dev in subtree) return 0 # no change to result dev_info("recovery successful") Specifically, there are no error_handler functions, so we never change result_data.result, and the default is that we treat the error as "recovered successfully." That seems broken. An uncorrectable error is by definition recoverable only by device-specific software, i.e., the driver. We didn't call any drivers, so we can't have recovered anything. What do you think of the following alternative? I don't know why you checked for bridge devices in your patch, so I don't know whether that's important here or not. --- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c index 06bad96..a109c68 100644 --- a/drivers/pci/pcie/aer/aerdrv_core.c +++ b/drivers/pci/pcie/aer/aerdrv_core.c @@ -231,11 +231,11 @@ static int report_error_detected(struct pci_dev *dev, void *data) dev->driver ? "no AER-aware driver" : "no driver"); } - return 0; + vote = PCI_ERS_RESULT_DISCONNECT; + } else { + err_handler = dev->driver->err_handler; + vote = err_handler->error_detected(dev, result_data->state); } - - err_handler = dev->driver->err_handler; - vote = err_handler->error_detected(dev, result_data->state); result_data->result = merge_result(result_data->result, vote); return 0; }