diff mbox

pci-error-recover: doc cleanup

Message ID 1481184974-12505-1-git-send-email-caoj.fnst@cn.fujitsu.com (mailing list archive)
State New, archived
Delegated to: Bjorn Helgaas
Headers show

Commit Message

Cao jin Dec. 8, 2016, 8:16 a.m. UTC
Include typo fix; white space shooting; mistake correction.

Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com>
---
 Documentation/PCI/pci-error-recovery.txt | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

Comments

Jonathan Corbet Dec. 8, 2016, 2:05 p.m. UTC | #1
On Thu, 8 Dec 2016 16:16:14 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

>  The platform resets the link, and then calls the link_reset() callback
>  on all affected device drivers.  This is a PCI-Express specific state
> -and is done whenever a non-fatal error has been detected that can be
> +and is done whenever a fatal error has been detected that can be
>  "solved" by resetting the link. This call informs the driver of the

As far as I can tell, the original text was correct here; why do you
think this change needs to be made?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cao jin Dec. 8, 2016, 2:13 p.m. UTC | #2
On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
> On Thu, 8 Dec 2016 16:16:14 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>>  The platform resets the link, and then calls the link_reset() callback
>>  on all affected device drivers.  This is a PCI-Express specific state
>> -and is done whenever a non-fatal error has been detected that can be
>> +and is done whenever a fatal error has been detected that can be
>>  "solved" by resetting the link. This call informs the driver of the
> 
> As far as I can tell, the original text was correct here; why do you
> think this change needs to be made?
> 

See do_recovery() in aer core, reset_link() is called only seeing fatal
error.
Linas Vepstas Dec. 9, 2016, 6:24 a.m. UTC | #3
I suppose I'm confused, but I recall that link resets are non-fatal.
Fatal errors typically require that the the pci adapter be completely
reset, any adapter firmware to be reloaded from scratch, the device
driver has to kill all device state and start from scratch. Its huge.
If the fatal error is on pci device that is under a block device
holding a file system, then (usually) there is no way to recover,
because the block layer (and file system) cannot deal with a block
device that disappeared and then reappeared some few seconds later.
(maybe some future zfs or lvm or btrfs might be able to deal with
this, but not today)

By contrast, link resets are far more gentle: the device driver might
have to discard some half-full FIFO's, or cancel some in-flight
commands, but can otherwise gracefully recover without telling the
higher layers that there were any problems.

--linas

On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>
> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>> On Thu, 8 Dec 2016 16:16:14 +0800
>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>>  The platform resets the link, and then calls the link_reset() callback
>>>  on all affected device drivers.  This is a PCI-Express specific state
>>> -and is done whenever a non-fatal error has been detected that can be
>>> +and is done whenever a fatal error has been detected that can be
>>>  "solved" by resetting the link. This call informs the driver of the
>>
>> As far as I can tell, the original text was correct here; why do you
>> think this change needs to be made?
>>
>
> See do_recovery() in aer core, reset_link() is called only seeing fatal
> error.
>
> --
> Sincerely,
> Cao jin
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cao jin Dec. 9, 2016, 6:37 a.m. UTC | #4
On 12/09/2016 02:24 PM, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.
> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)
> 
> By contrast, link resets are far more gentle: the device driver might
> have to discard some half-full FIFO's, or cancel some in-flight
> commands, but can otherwise gracefully recover without telling the
> higher layers that there were any problems.
> 
> --linas
> 

I am little confused too, even not sure if we are talking the same
*fatal error*, I am talking the fatal error defined in PCI Express spec,
chapter 6.2.2.2.1:

Fatal errors are uncorrectable error conditions which render the
particular Link and related hardware unreliable. For Fatal errors, a
reset of the components on the Link may be required to return to
reliable operation. Platform handling of Fatal errors, and any efforts
to limit the effects of these errors, is platform implementation specific.

Link reset means set *secondary bus reset* bit in pci bridge config
space, can reset the link and device simultaneously, is the strongest
kind of reset as I know.

> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>
>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>
>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>> -and is done whenever a non-fatal error has been detected that can be
>>>> +and is done whenever a fatal error has been detected that can be
>>>>  "solved" by resetting the link. This call informs the driver of the
>>>
>>> As far as I can tell, the original text was correct here; why do you
>>> think this change needs to be made?
>>>
>>
>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>> error.
>>
>> --
>> Sincerely,
>> Cao jin
>>
>>
> 
> 
>
Linas Vepstas Dec. 9, 2016, 6:44 a.m. UTC | #5
On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>
>
> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>> I suppose I'm confused, but I recall that link resets are non-fatal.
>> Fatal errors typically require that the the pci adapter be completely
>> reset, any adapter firmware to be reloaded from scratch, the device
>> driver has to kill all device state and start from scratch. Its huge.
>> If the fatal error is on pci device that is under a block device
>> holding a file system, then (usually) there is no way to recover,
>> because the block layer (and file system) cannot deal with a block
>> device that disappeared and then reappeared some few seconds later.
>> (maybe some future zfs or lvm or btrfs might be able to deal with
>> this, but not today)
>>
>> By contrast, link resets are far more gentle: the device driver might
>> have to discard some half-full FIFO's, or cancel some in-flight
>> commands, but can otherwise gracefully recover without telling the
>> higher layers that there were any problems.
>>
>> --linas
>>
>
> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:
>
> Fatal errors are uncorrectable error conditions which render the
> particular Link and related hardware unreliable. For Fatal errors, a
> reset of the components on the Link may be required to return to
> reliable operation. Platform handling of Fatal errors, and any efforts
> to limit the effects of these errors, is platform implementation specific.
>
> Link reset means set *secondary bus reset* bit in pci bridge config
> space, can reset the link and device simultaneously, is the strongest
> kind of reset as I know.

OK, well, its been far too many years, and I don't have the PCI spec
at my fingertips.
Isn't there a link reset that can be performed, without forcing a device reset?

The intent was that some PCI link errors are due to vibration,
ground-bounce, humidity, etc. and that these errors can be detected
and do not corrupt the device state or the device driver state.  Since
they are not associated with data corruption (or rather, the
corruption is local to the link), these can be recovered by reseting
just the link, without resetting the whole adapter. They may require
reseting some device-driver state, but not all of it.

However, this was all decided before the PCI-E spec was written, so
maybe the newer PCI-E specs now say something different.

--linas

>
>> On Thu, Dec 8, 2016 at 10:13 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>
>>>
>>> On 12/08/2016 10:05 PM, Jonathan Corbet wrote:
>>>> On Thu, 8 Dec 2016 16:16:14 +0800
>>>> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>>>
>>>>>  The platform resets the link, and then calls the link_reset() callback
>>>>>  on all affected device drivers.  This is a PCI-Express specific state
>>>>> -and is done whenever a non-fatal error has been detected that can be
>>>>> +and is done whenever a fatal error has been detected that can be
>>>>>  "solved" by resetting the link. This call informs the driver of the
>>>>
>>>> As far as I can tell, the original text was correct here; why do you
>>>> think this change needs to be made?
>>>>
>>>
>>> See do_recovery() in aer core, reset_link() is called only seeing fatal
>>> error.
>>>
>>> --
>>> Sincerely,
>>> Cao jin
>>>
>>>
>>
>>
>>
>
> --
> Sincerely,
> Cao jin
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andrew Donnellan Dec. 9, 2016, 6:50 a.m. UTC | #6
On 09/12/16 17:24, Linas Vepstas wrote:
> I suppose I'm confused, but I recall that link resets are non-fatal.
> Fatal errors typically require that the the pci adapter be completely
> reset, any adapter firmware to be reloaded from scratch, the device
> driver has to kill all device state and start from scratch. Its huge.

Is there a difference in terminology between an AER fatal error and what 
EEH/IBM people think of as a fatal error?

> If the fatal error is on pci device that is under a block device
> holding a file system, then (usually) there is no way to recover,
> because the block layer (and file system) cannot deal with a block
> device that disappeared and then reappeared some few seconds later.
> (maybe some future zfs or lvm or btrfs might be able to deal with
> this, but not today)

Is this still true? I'm not at all familiar with the block device side 
of it, but the cxlflash driver has reasonably full EEH support, 
including surviving a full PHB fence and complete reset.
Cao jin Dec. 9, 2016, 7:59 a.m. UTC | #7
On 12/09/2016 02:44 PM, Linas Vepstas wrote:
> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
>>
>>
>> On 12/09/2016 02:24 PM, Linas Vepstas wrote:
>>> I suppose I'm confused, but I recall that link resets are non-fatal.
>>> Fatal errors typically require that the the pci adapter be completely
>>> reset, any adapter firmware to be reloaded from scratch, the device
>>> driver has to kill all device state and start from scratch. Its huge.
>>> If the fatal error is on pci device that is under a block device
>>> holding a file system, then (usually) there is no way to recover,
>>> because the block layer (and file system) cannot deal with a block
>>> device that disappeared and then reappeared some few seconds later.
>>> (maybe some future zfs or lvm or btrfs might be able to deal with
>>> this, but not today)
>>>
>>> By contrast, link resets are far more gentle: the device driver might
>>> have to discard some half-full FIFO's, or cancel some in-flight
>>> commands, but can otherwise gracefully recover without telling the
>>> higher layers that there were any problems.
>>>
>>> --linas
>>>
>>
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
>>
>> Fatal errors are uncorrectable error conditions which render the
>> particular Link and related hardware unreliable. For Fatal errors, a
>> reset of the components on the Link may be required to return to
>> reliable operation. Platform handling of Fatal errors, and any efforts
>> to limit the effects of these errors, is platform implementation specific.
>>
>> Link reset means set *secondary bus reset* bit in pci bridge config
>> space, can reset the link and device simultaneously, is the strongest
>> kind of reset as I know.
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
> 

At least I don't find the exact words saying that.
Jonathan Corbet Dec. 9, 2016, 2:37 p.m. UTC | #8
On Fri, 9 Dec 2016 14:37:47 +0800
Cao jin <caoj.fnst@cn.fujitsu.com> wrote:

> I am little confused too, even not sure if we are talking the same
> *fatal error*, I am talking the fatal error defined in PCI Express spec,
> chapter 6.2.2.2.1:

Therein lies my original discomfort with the change; it didn't seem to
make sense to talk about recovering from a fatal error.  Perhaps making
it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
been detected that can be "solved" by resetting the link" or something
like that to make it clear how the term is being used?

Thanks,

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson Dec. 9, 2016, 4:11 p.m. UTC | #9
On Fri, 9 Dec 2016 14:44:25 +0800
Linas Vepstas <linasvepstas@gmail.com> wrote:

> On Fri, Dec 9, 2016 at 2:37 PM, Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> >
> >
> > On 12/09/2016 02:24 PM, Linas Vepstas wrote:  
> >> I suppose I'm confused, but I recall that link resets are non-fatal.
> >> Fatal errors typically require that the the pci adapter be completely
> >> reset, any adapter firmware to be reloaded from scratch, the device
> >> driver has to kill all device state and start from scratch. Its huge.
> >> If the fatal error is on pci device that is under a block device
> >> holding a file system, then (usually) there is no way to recover,
> >> because the block layer (and file system) cannot deal with a block
> >> device that disappeared and then reappeared some few seconds later.
> >> (maybe some future zfs or lvm or btrfs might be able to deal with
> >> this, but not today)
> >>
> >> By contrast, link resets are far more gentle: the device driver might
> >> have to discard some half-full FIFO's, or cancel some in-flight
> >> commands, but can otherwise gracefully recover without telling the
> >> higher layers that there were any problems.
> >>
> >> --linas
> >>  
> >
> > I am little confused too, even not sure if we are talking the same
> > *fatal error*, I am talking the fatal error defined in PCI Express spec,
> > chapter 6.2.2.2.1:
> >
> > Fatal errors are uncorrectable error conditions which render the
> > particular Link and related hardware unreliable. For Fatal errors, a
> > reset of the components on the Link may be required to return to
> > reliable operation. Platform handling of Fatal errors, and any efforts
> > to limit the effects of these errors, is platform implementation specific.
> >
> > Link reset means set *secondary bus reset* bit in pci bridge config
> > space, can reset the link and device simultaneously, is the strongest
> > kind of reset as I know.  
> 
> OK, well, its been far too many years, and I don't have the PCI spec
> at my fingertips.
> Isn't there a link reset that can be performed, without forcing a device reset?
> 
> The intent was that some PCI link errors are due to vibration,
> ground-bounce, humidity, etc. and that these errors can be detected
> and do not corrupt the device state or the device driver state.  Since
> they are not associated with data corruption (or rather, the
> corruption is local to the link), these can be recovered by reseting
> just the link, without resetting the whole adapter. They may require
> reseting some device-driver state, but not all of it.
> 
> However, this was all decided before the PCI-E spec was written, so
> maybe the newer PCI-E specs now say something different.

Perhaps you're thinking of link retraining?  That sort of error would
be considered correctable, not fatal.  Fatal errors are uncorrected
errors and a bigger hammer is needed to deal with them, such as a link
reset.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Gavin Shan Dec. 14, 2016, 2:39 a.m. UTC | #10
On Fri, Dec 09, 2016 at 05:50:17PM +1100, Andrew Donnellan wrote:
>On 09/12/16 17:24, Linas Vepstas wrote:
>>I suppose I'm confused, but I recall that link resets are non-fatal.
>>Fatal errors typically require that the the pci adapter be completely
>>reset, any adapter firmware to be reloaded from scratch, the device
>>driver has to kill all device state and start from scratch. Its huge.
>
>Is there a difference in terminology between an AER fatal error and what
>EEH/IBM people think of as a fatal error?
>

They are different things. AER fatal error can lead to frozen PE error,
not fenced PHB error basing on the configuration on PHB.

>>If the fatal error is on pci device that is under a block device
>>holding a file system, then (usually) there is no way to recover,
>>because the block layer (and file system) cannot deal with a block
>>device that disappeared and then reappeared some few seconds later.
>>(maybe some future zfs or lvm or btrfs might be able to deal with
>>this, but not today)
>
>Is this still true? I'm not at all familiar with the block device side of it,
>but the cxlflash driver has reasonably full EEH support, including surviving
>a full PHB fence and complete reset.
>

It's still true, especially when the recovery is going to affect the
rootfs. On completion of error recovery, the driver (if necessary)
and filesystem needs to be reloaded which depends on script or daemon
and they are unavailable in this scenario.

Thanks,
Gavin

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Cao jin Dec. 19, 2016, 3:25 a.m. UTC | #11
Sorry for late.

On 12/09/2016 10:37 PM, Jonathan Corbet wrote:
> On Fri, 9 Dec 2016 14:37:47 +0800
> Cao jin <caoj.fnst@cn.fujitsu.com> wrote:
> 
>> I am little confused too, even not sure if we are talking the same
>> *fatal error*, I am talking the fatal error defined in PCI Express spec,
>> chapter 6.2.2.2.1:
> 
> Therein lies my original discomfort with the change; it didn't seem to
> make sense to talk about recovering from a fatal error.  Perhaps making
> it "is done whenever a fatal error (as defined in section 6.2.2.2.1) has
> been detected that can be "solved" by resetting the link" or something
> like that to make it clear how the term is being used?
> 

I find that the .link_reset callback of struct pci_error_handlers isn't
called by anyone(if I didn't miss anything), and just a few drivers
implement this callback, and their implementation seems meaningless.

And the reset_link() provided by aer driver seems is a different thing
with .link_reset callback. So I am guessing this patch probably is not
quite suitable, and the doc maybe need update totally.
diff mbox

Patch

diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index ac26869c7db4..fcb29cdbeb1b 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -11,7 +11,7 @@ 
 
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
-busses, as well as SERR and PERR errors.  Some of the more advanced
+buses, as well as SERR and PERR errors.  Some of the more advanced
 chipsets are able to deal with these errors; these include PCI-E chipsets,
 and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
 pSeries boxes. A typical action taken is to disconnect the affected device,
@@ -175,7 +175,7 @@  is STEP 6 (Permanent Failure).
 >>> a value of 0xff on read, and writes will be dropped. If more than
 >>> EEH_MAX_FAILS I/O's are attempted to a frozen adapter, EEH
 >>> assumes that the device driver has gone into an infinite loop
->>> and prints an error to syslog.  A reboot is then required to 
+>>> and prints an error to syslog.  A reboot is then required to
 >>> get the device working again.
 
 STEP 2: MMIO Enabled
@@ -234,7 +234,7 @@  STEP 3: Link Reset
 ------------------
 The platform resets the link, and then calls the link_reset() callback
 on all affected device drivers.  This is a PCI-Express specific state
-and is done whenever a non-fatal error has been detected that can be
+and is done whenever a fatal error has been detected that can be
 "solved" by resetting the link. This call informs the driver of the
 reset and the driver should check to see if the device appears to be
 in working condition.
@@ -256,7 +256,7 @@  STEP 4: Slot Reset
 ------------------
 
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
-the platform will perform a slot reset on the requesting PCI device(s). 
+the platform will perform a slot reset on the requesting PCI device(s).
 The actual steps taken by a platform to perform a slot reset
 will be platform-dependent. Upon completion of slot reset, the
 platform will call the device slot_reset() callback.
@@ -276,7 +276,7 @@  configuration registers to initialize to their default conditions.
 
 For most PCI devices, a soft reset will be sufficient for recovery.
 Optional fundamental reset is provided to support a limited number
-of PCI Express PCI devices  for which a soft reset is not sufficient
+of PCI Express PCI devices for which a soft reset is not sufficient
 for recovery.
 
 If the platform supports PCI hotplug, then the reset might be
@@ -321,7 +321,7 @@  driver performs device init only from PCI function 0:
 		Same as above.
 
 Drivers for PCI Express cards that require a fundamental reset must
-set the needs_freset bit in the pci_dev structure in their probe function.  
+set the needs_freset bit in the pci_dev structure in their probe function.
 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
 PCI card types: