mbox series

[RFC,0/2] CXL UE RAS Multiple Header Logging support

Message ID 20230113154011.16205-1-Jonathan.Cameron@huawei.com
Headers show
Series CXL UE RAS Multiple Header Logging support | expand

Message

Jonathan Cameron Jan. 13, 2023, 3:40 p.m. UTC
CXL UE RAS Error reporting allows an EP to report the capability of
recording Multiple Header Logs for uncorrectable errors.
Unlike equivalent feature in PCIe, there is no enable control
for this feature, so a supporting device may be expecting
a more complex software flow than that necessary for devices
that do not support this feature. Documentation of this feature
is sparse, with assumption it works the same as PCIe.

There are hardware implementation choices allowed in the
equivalent PCIe r6.0 base spec section (6.4.2.4) that could
be safely used with the existing code, even with Multiple
Header Recording support but there are others that cannot.

The issue is what happens when the EP is doing Multiple Header
Recording but then the software writes 1 to clear more than one
status bit at the time (PCIe spec warns against doing this
 - but it is what the current kernel code will do):
Option 1)
  It does the nice thing and clears all matching errors.
  Note this is a bit strange for the case where the device
  supports logging multiple instances of a given error - so
  the two can't be combined cleanly.  With that feature
  I can't see how anyone could implement hardware that coped
  cleanly with the wrong software flow.
Option 2)
  It clears only the first error bit leaving a bunch of error
  bits set (note that if it has recorded multiple errors of
  same type it might not even do that). These are sticky
  across resets, so you will probably end up coming back up
  and immediately seeing an error.

So whilst you can design an EP to safe against non MH recording
aware software, it isn't generally the case. As we don't have
an explicit enable on CXL we have to handle anything reporting
the capability in a MH safe fashion.

This feature was developed against emulation in QEMU.
The relevant patches have not yet been posted but can be found on
https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
along with description of how to inject errors in the patch
descriptions.  I'll post them for review for QEMU inclusion
shortly.

RFC simply because the lack of specification detail means I am
less sure on this code than I would normally be. Unfortunately it
could be argued that the first patch is a fix for the
current upstream CXL RAS support.  If we want a simpler fix
one option would be to just fail to enable RAS support if
Multiple Header recording capability bit is set.  Or we
decide that it doesn't matter for now and add support for this
feature via the normal merge cycle.

Second patch is just there to make this easier to test as
no additional software is needed to print the header log.

Base is rather messy due to a clash between multiple cxl tree
branches.
cxl/fixes with the trace move on cxl/next cherry picked on top
as it moves the code that was fixed.

Jonathan Cameron (2):
  cxl: RAS: Multiple header recording support
  cxl: Add tprintk support for header log hex dump

 drivers/cxl/core/pci.c   | 17 ++++++++++++-----
 drivers/cxl/core/trace.h |  7 +++++--
 drivers/cxl/cxl.h        |  1 +
 3 files changed, 18 insertions(+), 7 deletions(-)

Comments

Jonathan Cameron Jan. 13, 2023, 3:53 p.m. UTC | #1
On Fri, 13 Jan 2023 15:40:09 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

Missed Dave Jiang off cc so resent
(I thought I'd hit cancel fast enough but apparently not)
Sorry for the noise!

> CXL UE RAS Error reporting allows an EP to report the capability of
> recording Multiple Header Logs for uncorrectable errors.
> Unlike equivalent feature in PCIe, there is no enable control
> for this feature, so a supporting device may be expecting
> a more complex software flow than that necessary for devices
> that do not support this feature. Documentation of this feature
> is sparse, with assumption it works the same as PCIe.
> 
> There are hardware implementation choices allowed in the
> equivalent PCIe r6.0 base spec section (6.4.2.4) that could
> be safely used with the existing code, even with Multiple
> Header Recording support but there are others that cannot.
> 
> The issue is what happens when the EP is doing Multiple Header
> Recording but then the software writes 1 to clear more than one
> status bit at the time (PCIe spec warns against doing this
>  - but it is what the current kernel code will do):
> Option 1)
>   It does the nice thing and clears all matching errors.
>   Note this is a bit strange for the case where the device
>   supports logging multiple instances of a given error - so
>   the two can't be combined cleanly.  With that feature
>   I can't see how anyone could implement hardware that coped
>   cleanly with the wrong software flow.
> Option 2)
>   It clears only the first error bit leaving a bunch of error
>   bits set (note that if it has recorded multiple errors of
>   same type it might not even do that). These are sticky
>   across resets, so you will probably end up coming back up
>   and immediately seeing an error.
> 
> So whilst you can design an EP to safe against non MH recording
> aware software, it isn't generally the case. As we don't have
> an explicit enable on CXL we have to handle anything reporting
> the capability in a MH safe fashion.
> 
> This feature was developed against emulation in QEMU.
> The relevant patches have not yet been posted but can be found on
> https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
> along with description of how to inject errors in the patch
> descriptions.  I'll post them for review for QEMU inclusion
> shortly.
> 
> RFC simply because the lack of specification detail means I am
> less sure on this code than I would normally be. Unfortunately it
> could be argued that the first patch is a fix for the
> current upstream CXL RAS support.  If we want a simpler fix
> one option would be to just fail to enable RAS support if
> Multiple Header recording capability bit is set.  Or we
> decide that it doesn't matter for now and add support for this
> feature via the normal merge cycle.
> 
> Second patch is just there to make this easier to test as
> no additional software is needed to print the header log.
> 
> Base is rather messy due to a clash between multiple cxl tree
> branches.
> cxl/fixes with the trace move on cxl/next cherry picked on top
> as it moves the code that was fixed.
> 
> Jonathan Cameron (2):
>   cxl: RAS: Multiple header recording support
>   cxl: Add tprintk support for header log hex dump
> 
>  drivers/cxl/core/pci.c   | 17 ++++++++++++-----
>  drivers/cxl/core/trace.h |  7 +++++--
>  drivers/cxl/cxl.h        |  1 +
>  3 files changed, 18 insertions(+), 7 deletions(-)
>
Dave Jiang Jan. 17, 2023, 5:36 p.m. UTC | #2
On 1/13/23 8:53 AM, Jonathan Cameron wrote:
> On Fri, 13 Jan 2023 15:40:09 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> 
> Missed Dave Jiang off cc so resent
> (I thought I'd hit cancel fast enough but apparently not)
> Sorry for the noise!

No worries. I see them from the list either way.

> 
>> CXL UE RAS Error reporting allows an EP to report the capability of
>> recording Multiple Header Logs for uncorrectable errors.
>> Unlike equivalent feature in PCIe, there is no enable control
>> for this feature, so a supporting device may be expecting
>> a more complex software flow than that necessary for devices
>> that do not support this feature. Documentation of this feature
>> is sparse, with assumption it works the same as PCIe.
>>
>> There are hardware implementation choices allowed in the
>> equivalent PCIe r6.0 base spec section (6.4.2.4) that could
>> be safely used with the existing code, even with Multiple
>> Header Recording support but there are others that cannot.
>>
>> The issue is what happens when the EP is doing Multiple Header
>> Recording but then the software writes 1 to clear more than one
>> status bit at the time (PCIe spec warns against doing this
>>   - but it is what the current kernel code will do):
>> Option 1)
>>    It does the nice thing and clears all matching errors.
>>    Note this is a bit strange for the case where the device
>>    supports logging multiple instances of a given error - so
>>    the two can't be combined cleanly.  With that feature
>>    I can't see how anyone could implement hardware that coped
>>    cleanly with the wrong software flow.
>> Option 2)
>>    It clears only the first error bit leaving a bunch of error
>>    bits set (note that if it has recorded multiple errors of
>>    same type it might not even do that). These are sticky
>>    across resets, so you will probably end up coming back up
>>    and immediately seeing an error.
>>
>> So whilst you can design an EP to safe against non MH recording
>> aware software, it isn't generally the case. As we don't have
>> an explicit enable on CXL we have to handle anything reporting
>> the capability in a MH safe fashion.
>>
>> This feature was developed against emulation in QEMU.
>> The relevant patches have not yet been posted but can be found on
>> https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11
>> along with description of how to inject errors in the patch
>> descriptions.  I'll post them for review for QEMU inclusion
>> shortly.
>>
>> RFC simply because the lack of specification detail means I am
>> less sure on this code than I would normally be. Unfortunately it
>> could be argued that the first patch is a fix for the
>> current upstream CXL RAS support.  If we want a simpler fix
>> one option would be to just fail to enable RAS support if
>> Multiple Header recording capability bit is set.  Or we
>> decide that it doesn't matter for now and add support for this
>> feature via the normal merge cycle.
>>
>> Second patch is just there to make this easier to test as
>> no additional software is needed to print the header log.
>>
>> Base is rather messy due to a clash between multiple cxl tree
>> branches.
>> cxl/fixes with the trace move on cxl/next cherry picked on top
>> as it moves the code that was fixed.
>>
>> Jonathan Cameron (2):
>>    cxl: RAS: Multiple header recording support
>>    cxl: Add tprintk support for header log hex dump
>>
>>   drivers/cxl/core/pci.c   | 17 ++++++++++++-----
>>   drivers/cxl/core/trace.h |  7 +++++--
>>   drivers/cxl/cxl.h        |  1 +
>>   3 files changed, 18 insertions(+), 7 deletions(-)
>>
>