Message ID | 1481184974-12505-1-git-send-email-caoj.fnst@cn.fujitsu.com (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Thu, 8 Dec 2016 16:16:14 +0800 Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > The platform resets the link, and then calls the link_reset() callback > on all affected device drivers. This is a PCI-Express specific state > -and is done whenever a non-fatal error has been detected that can be > +and is done whenever a fatal error has been detected that can be > "solved" by resetting the link. This call informs the driver of the As far as I can tell, the original text was correct here; why do you think this change needs to be made? Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/08/2016 10:05 PM, Jonathan Corbet wrote: > On Thu, 8 Dec 2016 16:16:14 +0800 > Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > >> The platform resets the link, and then calls the link_reset() callback >> on all affected device drivers. This is a PCI-Express specific state >> -and is done whenever a non-fatal error has been detected that can be >> +and is done whenever a fatal error has been detected that can be >> "solved" by resetting the link. This call informs the driver of the > > As far as I can tell, the original text was correct here; why do you > think this change needs to be made? > See do_recovery() in aer core, reset_link() is called only seeing fatal error.
I suppose I'm confused, but I recall that link resets are non-fatal. Fatal errors typically require that the the pci adapter be completely reset, any adapter firmware to be reloaded from scratch, the device driver has to kill all device state and start from scratch. Its huge. If the fatal error is on pci device that is under a block device holding a file system, then (usually) there is no way to recover, because the block layer (and file system) cannot deal with a block device that disappeared and then reappeared some few seconds later. (maybe some future zfs or lvm or btrfs might be able to deal with this, but not today) By contrast, link resets are far more gentle: the device driver might have to discard some half-full FIFO's, or cancel some in-flight commands, but can otherwise gracefully recover without telling the higher layers that there were any problems. --linas On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > > > On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >> On Thu, 8 Dec 2016 16:16:14 +0800 >> Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >> >>> The platform resets the link, and then calls the link_reset() callback >>> on all affected device drivers. This is a PCI-Express specific state >>> -and is done whenever a non-fatal error has been detected that can be >>> +and is done whenever a fatal error has been detected that can be >>> "solved" by resetting the link. This call informs the driver of the >> >> As far as I can tell, the original text was correct here; why do you >> think this change needs to be made? >> > > See do_recovery() in aer core, reset_link() is called only seeing fatal > error. > > -- > Sincerely, > Cao jin > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 12/09/2016 02:24 PM, Linas Vepstas wrote: > I suppose I'm confused, but I recall that link resets are non-fatal. > Fatal errors typically require that the the pci adapter be completely > reset, any adapter firmware to be reloaded from scratch, the device > driver has to kill all device state and start from scratch. Its huge. > If the fatal error is on pci device that is under a block device > holding a file system, then (usually) there is no way to recover, > because the block layer (and file system) cannot deal with a block > device that disappeared and then reappeared some few seconds later. > (maybe some future zfs or lvm or btrfs might be able to deal with > this, but not today) > > By contrast, link resets are far more gentle: the device driver might > have to discard some half-full FIFO's, or cancel some in-flight > commands, but can otherwise gracefully recover without telling the > higher layers that there were any problems. > > --linas > I am little confused too, even not sure if we are talking the same *fatal error*, I am talking the fatal error defined in PCI Express spec, chapter 6.2.2.2.1: Fatal errors are uncorrectable error conditions which render the particular Link and related hardware unreliable. For Fatal errors, a reset of the components on the Link may be required to return to reliable operation. Platform handling of Fatal errors, and any efforts to limit the effects of these errors, is platform implementation specific. Link reset means set *secondary bus reset* bit in pci bridge config space, can reset the link and device simultaneously, is the strongest kind of reset as I know. > On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >> >> >> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >>> On Thu, 8 Dec 2016 16:16:14 +0800 >>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >>> >>>> The platform resets the link, and then calls the link_reset() callback >>>> on all affected device drivers. This is a PCI-Express specific state >>>> -and is done whenever a non-fatal error has been detected that can be >>>> +and is done whenever a fatal error has been detected that can be >>>> "solved" by resetting the link. This call informs the driver of the >>> >>> As far as I can tell, the original text was correct here; why do you >>> think this change needs to be made? >>> >> >> See do_recovery() in aer core, reset_link() is called only seeing fatal >> error. >> >> -- >> Sincerely, >> Cao jin >> >> > > >
On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: >> I suppose I'm confused, but I recall that link resets are non-fatal. >> Fatal errors typically require that the the pci adapter be completely >> reset, any adapter firmware to be reloaded from scratch, the device >> driver has to kill all device state and start from scratch. Its huge. >> If the fatal error is on pci device that is under a block device >> holding a file system, then (usually) there is no way to recover, >> because the block layer (and file system) cannot deal with a block >> device that disappeared and then reappeared some few seconds later. >> (maybe some future zfs or lvm or btrfs might be able to deal with >> this, but not today) >> >> By contrast, link resets are far more gentle: the device driver might >> have to discard some half-full FIFO's, or cancel some in-flight >> commands, but can otherwise gracefully recover without telling the >> higher layers that there were any problems. >> >> --linas >> > > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: > > Fatal errors are uncorrectable error conditions which render the > particular Link and related hardware unreliable. For Fatal errors, a > reset of the components on the Link may be required to return to > reliable operation. Platform handling of Fatal errors, and any efforts > to limit the effects of these errors, is platform implementation specific. > > Link reset means set *secondary bus reset* bit in pci bridge config > space, can reset the link and device simultaneously, is the strongest > kind of reset as I know. OK, well, its been far too many years, and I don't have the PCI spec at my fingertips. Isn't there a link reset that can be performed, without forcing a device reset? The intent was that some PCI link errors are due to vibration, ground-bounce, humidity, etc. and that these errors can be detected and do not corrupt the device state or the device driver state. Since they are not associated with data corruption (or rather, the corruption is local to the link), these can be recovered by reseting just the link, without resetting the whole adapter. They may require reseting some device-driver state, but not all of it. However, this was all decided before the PCI-E spec was written, so maybe the newer PCI-E specs now say something different. --linas > >> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >>> >>> >>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote: >>>> On Thu, 8 Dec 2016 16:16:14 +0800 >>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >>>> >>>>> The platform resets the link, and then calls the link_reset() callback >>>>> on all affected device drivers. This is a PCI-Express specific state >>>>> -and is done whenever a non-fatal error has been detected that can be >>>>> +and is done whenever a fatal error has been detected that can be >>>>> "solved" by resetting the link. This call informs the driver of the >>>> >>>> As far as I can tell, the original text was correct here; why do you >>>> think this change needs to be made? >>>> >>> >>> See do_recovery() in aer core, reset_link() is called only seeing fatal >>> error. >>> >>> -- >>> Sincerely, >>> Cao jin >>> >>> >> >> >> > > -- > Sincerely, > Cao jin > > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 09/12/16 17:24, Linas Vepstas wrote: > I suppose I'm confused, but I recall that link resets are non-fatal. > Fatal errors typically require that the the pci adapter be completely > reset, any adapter firmware to be reloaded from scratch, the device > driver has to kill all device state and start from scratch. Its huge. Is there a difference in terminology between an AER fatal error and what EEH/IBM people think of as a fatal error? > If the fatal error is on pci device that is under a block device > holding a file system, then (usually) there is no way to recover, > because the block layer (and file system) cannot deal with a block > device that disappeared and then reappeared some few seconds later. > (maybe some future zfs or lvm or btrfs might be able to deal with > this, but not today) Is this still true? I'm not at all familiar with the block device side of it, but the cxlflash driver has reasonably full EEH support, including surviving a full PHB fence and complete reset.
On 12/09/2016 02:44 PM, Linas Vepstas wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: >> >> >> On 12/09/2016 02:24 PM, Linas Vepstas wrote: >>> I suppose I'm confused, but I recall that link resets are non-fatal. >>> Fatal errors typically require that the the pci adapter be completely >>> reset, any adapter firmware to be reloaded from scratch, the device >>> driver has to kill all device state and start from scratch. Its huge. >>> If the fatal error is on pci device that is under a block device >>> holding a file system, then (usually) there is no way to recover, >>> because the block layer (and file system) cannot deal with a block >>> device that disappeared and then reappeared some few seconds later. >>> (maybe some future zfs or lvm or btrfs might be able to deal with >>> this, but not today) >>> >>> By contrast, link resets are far more gentle: the device driver might >>> have to discard some half-full FIFO's, or cancel some in-flight >>> commands, but can otherwise gracefully recover without telling the >>> higher layers that there were any problems. >>> >>> --linas >>> >> >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: >> >> Fatal errors are uncorrectable error conditions which render the >> particular Link and related hardware unreliable. For Fatal errors, a >> reset of the components on the Link may be required to return to >> reliable operation. Platform handling of Fatal errors, and any efforts >> to limit the effects of these errors, is platform implementation specific. >> >> Link reset means set *secondary bus reset* bit in pci bridge config >> space, can reset the link and device simultaneously, is the strongest >> kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device reset? > At least I don't find the exact words saying that.
On Fri, 9 Dec 2016 14:37:47 +0800 Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > I am little confused too, even not sure if we are talking the same > *fatal error*, I am talking the fatal error defined in PCI Express spec, > chapter 6.2.2.2.1: Therein lies my original discomfort with the change; it didn't seem to make sense to talk about recovering from a fatal error. Perhaps making it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has been detected that can be "solved" by resetting the link" or something like that to make it clear how the term is being used? Thanks, jon -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 9 Dec 2016 14:44:25 +0800 Linas Vepstas <linasvepstas@gmail.com> wrote: > On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > > > > > > On 12/09/2016 02:24 PM, Linas Vepstas wrote: > >> I suppose I'm confused, but I recall that link resets are non-fatal. > >> Fatal errors typically require that the the pci adapter be completely > >> reset, any adapter firmware to be reloaded from scratch, the device > >> driver has to kill all device state and start from scratch. Its huge. > >> If the fatal error is on pci device that is under a block device > >> holding a file system, then (usually) there is no way to recover, > >> because the block layer (and file system) cannot deal with a block > >> device that disappeared and then reappeared some few seconds later. > >> (maybe some future zfs or lvm or btrfs might be able to deal with > >> this, but not today) > >> > >> By contrast, link resets are far more gentle: the device driver might > >> have to discard some half-full FIFO's, or cancel some in-flight > >> commands, but can otherwise gracefully recover without telling the > >> higher layers that there were any problems. > >> > >> --linas > >> > > > > I am little confused too, even not sure if we are talking the same > > *fatal error*, I am talking the fatal error defined in PCI Express spec, > > chapter 6.2.2.2.1: > > > > Fatal errors are uncorrectable error conditions which render the > > particular Link and related hardware unreliable. For Fatal errors, a > > reset of the components on the Link may be required to return to > > reliable operation. Platform handling of Fatal errors, and any efforts > > to limit the effects of these errors, is platform implementation specific. > > > > Link reset means set *secondary bus reset* bit in pci bridge config > > space, can reset the link and device simultaneously, is the strongest > > kind of reset as I know. > > OK, well, its been far too many years, and I don't have the PCI spec > at my fingertips. > Isn't there a link reset that can be performed, without forcing a device reset? > > The intent was that some PCI link errors are due to vibration, > ground-bounce, humidity, etc. and that these errors can be detected > and do not corrupt the device state or the device driver state. Since > they are not associated with data corruption (or rather, the > corruption is local to the link), these can be recovered by reseting > just the link, without resetting the whole adapter. They may require > reseting some device-driver state, but not all of it. > > However, this was all decided before the PCI-E spec was written, so > maybe the newer PCI-E specs now say something different. Perhaps you're thinking of link retraining? That sort of error would be considered correctable, not fatal. Fatal errors are uncorrected errors and a bigger hammer is needed to deal with them, such as a link reset. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote: >On 09/12/16 17:24, Linas Vepstas wrote: >>I suppose I'm confused, but I recall that link resets are non-fatal. >>Fatal errors typically require that the the pci adapter be completely >>reset, any adapter firmware to be reloaded from scratch, the device >>driver has to kill all device state and start from scratch. Its huge. > >Is there a difference in terminology between an AER fatal error and what >EEH/IBM people think of as a fatal error? > They are different things. AER fatal error can lead to frozen PE error, not fenced PHB error basing on the configuration on PHB. >>If the fatal error is on pci device that is under a block device >>holding a file system, then (usually) there is no way to recover, >>because the block layer (and file system) cannot deal with a block >>device that disappeared and then reappeared some few seconds later. >>(maybe some future zfs or lvm or btrfs might be able to deal with >>this, but not today) > >Is this still true? I'm not at all familiar with the block device side of it, >but the cxlflash driver has reasonably full EEH support, including surviving >a full PHB fence and complete reset. > It's still true, especially when the recovery is going to affect the rootfs. On completion of error recovery, the driver (if necessary) and filesystem needs to be reloaded which depends on script or daemon and they are unavailable in this scenario. Thanks, Gavin -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Sorry for late. On 12/09/2016 10:37 PM, Jonathan Corbet wrote: > On Fri, 9 Dec 2016 14:37:47 +0800 > Cao jin <caoj.fnst@cn.fujitsu.com> wrote: > >> I am little confused too, even not sure if we are talking the same >> *fatal error*, I am talking the fatal error defined in PCI Express spec, >> chapter 6.2.2.2.1: > > Therein lies my original discomfort with the change; it didn't seem to > make sense to talk about recovering from a fatal error. Perhaps making > it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has > been detected that can be "solved" by resetting the link" or something > like that to make it clear how the term is being used? > I find that the .link_reset callback of struct pci_error_handlers isn't called by anyone(if I didn't miss anything), and just a few drivers implement this callback, and their implementation seems meaningless. And the reset_link() provided by aer driver seems is a different thing with .link_reset callback. So I am guessing this patch probably is not quite suitable, and the doc maybe need update totally.
diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt index ac26869c7db4..fcb29cdbeb1b 100644 --- a/Documentation/PCI/pci-error-recovery.txt +++ b/Documentation/PCI/pci-error-recovery.txt @@ -11,7 +11,7 @@ Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, such as parity errors on the data and address -busses, as well as SERR and PERR errors. Some of the more advanced +buses, as well as SERR and PERR errors. Some of the more advanced chipsets are able to deal with these errors; these include PCI-E chipsets, and the PCI-host bridges found on IBM Power4, Power5 and Power6-based pSeries boxes. A typical action taken is to disconnect the affected device, @@ -175,7 +175,7 @@ is STEP 6 (Permanent Failure). >>> a value of 0xff on read, and writes will be dropped. If more than >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH >>> assumes that the device driver has gone into an infinite loop ->>> and prints an error to syslog. A reboot is then required to +>>> and prints an error to syslog. A reboot is then required to >>> get the device working again. STEP 2: MMIO Enabled @@ -234,7 +234,7 @@ STEP 3: Link Reset ------------------ The platform resets the link, and then calls the link_reset() callback on all affected device drivers. This is a PCI-Express specific state -and is done whenever a non-fatal error has been detected that can be +and is done whenever a fatal error has been detected that can be "solved" by resetting the link. This call informs the driver of the reset and the driver should check to see if the device appears to be in working condition. @@ -256,7 +256,7 @@ STEP 4: Slot Reset ------------------ In response to a return value of PCI_ERS_RESULT_NEED_RESET, the -the platform will perform a slot reset on the requesting PCI device(s). +the platform will perform a slot reset on the requesting PCI device(s). The actual steps taken by a platform to perform a slot reset will be platform-dependent. Upon completion of slot reset, the platform will call the device slot_reset() callback. @@ -276,7 +276,7 @@ configuration registers to initialize to their default conditions. For most PCI devices, a soft reset will be sufficient for recovery. Optional fundamental reset is provided to support a limited number -of PCI Express PCI devices for which a soft reset is not sufficient +of PCI Express PCI devices for which a soft reset is not sufficient for recovery. If the platform supports PCI hotplug, then the reset might be @@ -321,7 +321,7 @@ driver performs device init only from PCI function 0: Same as above. Drivers for PCI Express cards that require a fundamental reset must -set the needs_freset bit in the pci_dev structure in their probe function. +set the needs_freset bit in the pci_dev structure in their probe function. For example, the QLogic qla2xxx driver sets the needs_freset bit for certain PCI card types:
Include typo fix; white space shooting; mistake correction. Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> --- Documentation/PCI/pci-error-recovery.txt | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-)