Message ID | 20230113154011.16205-1-Jonathan.Cameron@huawei.com |
---|---|
Headers | show |
Series | CXL UE RAS Multiple Header Logging support | expand |
On Fri, 13 Jan 2023 15:40:09 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: Missed Dave Jiang off cc so resent (I thought I'd hit cancel fast enough but apparently not) Sorry for the noise! > CXL UE RAS Error reporting allows an EP to report the capability of > recording Multiple Header Logs for uncorrectable errors. > Unlike equivalent feature in PCIe, there is no enable control > for this feature, so a supporting device may be expecting > a more complex software flow than that necessary for devices > that do not support this feature. Documentation of this feature > is sparse, with assumption it works the same as PCIe. > > There are hardware implementation choices allowed in the > equivalent PCIe r6.0 base spec section (6.4.2.4) that could > be safely used with the existing code, even with Multiple > Header Recording support but there are others that cannot. > > The issue is what happens when the EP is doing Multiple Header > Recording but then the software writes 1 to clear more than one > status bit at the time (PCIe spec warns against doing this > - but it is what the current kernel code will do): > Option 1) > It does the nice thing and clears all matching errors. > Note this is a bit strange for the case where the device > supports logging multiple instances of a given error - so > the two can't be combined cleanly. With that feature > I can't see how anyone could implement hardware that coped > cleanly with the wrong software flow. > Option 2) > It clears only the first error bit leaving a bunch of error > bits set (note that if it has recorded multiple errors of > same type it might not even do that). These are sticky > across resets, so you will probably end up coming back up > and immediately seeing an error. > > So whilst you can design an EP to safe against non MH recording > aware software, it isn't generally the case. As we don't have > an explicit enable on CXL we have to handle anything reporting > the capability in a MH safe fashion. > > This feature was developed against emulation in QEMU. > The relevant patches have not yet been posted but can be found on > https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11 > along with description of how to inject errors in the patch > descriptions. I'll post them for review for QEMU inclusion > shortly. > > RFC simply because the lack of specification detail means I am > less sure on this code than I would normally be. Unfortunately it > could be argued that the first patch is a fix for the > current upstream CXL RAS support. If we want a simpler fix > one option would be to just fail to enable RAS support if > Multiple Header recording capability bit is set. Or we > decide that it doesn't matter for now and add support for this > feature via the normal merge cycle. > > Second patch is just there to make this easier to test as > no additional software is needed to print the header log. > > Base is rather messy due to a clash between multiple cxl tree > branches. > cxl/fixes with the trace move on cxl/next cherry picked on top > as it moves the code that was fixed. > > Jonathan Cameron (2): > cxl: RAS: Multiple header recording support > cxl: Add tprintk support for header log hex dump > > drivers/cxl/core/pci.c | 17 ++++++++++++----- > drivers/cxl/core/trace.h | 7 +++++-- > drivers/cxl/cxl.h | 1 + > 3 files changed, 18 insertions(+), 7 deletions(-) >
On 1/13/23 8:53 AM, Jonathan Cameron wrote: > On Fri, 13 Jan 2023 15:40:09 +0000 > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote: > > Missed Dave Jiang off cc so resent > (I thought I'd hit cancel fast enough but apparently not) > Sorry for the noise! No worries. I see them from the list either way. > >> CXL UE RAS Error reporting allows an EP to report the capability of >> recording Multiple Header Logs for uncorrectable errors. >> Unlike equivalent feature in PCIe, there is no enable control >> for this feature, so a supporting device may be expecting >> a more complex software flow than that necessary for devices >> that do not support this feature. Documentation of this feature >> is sparse, with assumption it works the same as PCIe. >> >> There are hardware implementation choices allowed in the >> equivalent PCIe r6.0 base spec section (6.4.2.4) that could >> be safely used with the existing code, even with Multiple >> Header Recording support but there are others that cannot. >> >> The issue is what happens when the EP is doing Multiple Header >> Recording but then the software writes 1 to clear more than one >> status bit at the time (PCIe spec warns against doing this >> - but it is what the current kernel code will do): >> Option 1) >> It does the nice thing and clears all matching errors. >> Note this is a bit strange for the case where the device >> supports logging multiple instances of a given error - so >> the two can't be combined cleanly. With that feature >> I can't see how anyone could implement hardware that coped >> cleanly with the wrong software flow. >> Option 2) >> It clears only the first error bit leaving a bunch of error >> bits set (note that if it has recorded multiple errors of >> same type it might not even do that). These are sticky >> across resets, so you will probably end up coming back up >> and immediately seeing an error. >> >> So whilst you can design an EP to safe against non MH recording >> aware software, it isn't generally the case. As we don't have >> an explicit enable on CXL we have to handle anything reporting >> the capability in a MH safe fashion. >> >> This feature was developed against emulation in QEMU. >> The relevant patches have not yet been posted but can be found on >> https://gitlab.com/jic23/qemu/-/commits/cxl-2023-01-11 >> along with description of how to inject errors in the patch >> descriptions. I'll post them for review for QEMU inclusion >> shortly. >> >> RFC simply because the lack of specification detail means I am >> less sure on this code than I would normally be. Unfortunately it >> could be argued that the first patch is a fix for the >> current upstream CXL RAS support. If we want a simpler fix >> one option would be to just fail to enable RAS support if >> Multiple Header recording capability bit is set. Or we >> decide that it doesn't matter for now and add support for this >> feature via the normal merge cycle. >> >> Second patch is just there to make this easier to test as >> no additional software is needed to print the header log. >> >> Base is rather messy due to a clash between multiple cxl tree >> branches. >> cxl/fixes with the trace move on cxl/next cherry picked on top >> as it moves the code that was fixed. >> >> Jonathan Cameron (2): >> cxl: RAS: Multiple header recording support >> cxl: Add tprintk support for header log hex dump >> >> drivers/cxl/core/pci.c | 17 ++++++++++++----- >> drivers/cxl/core/trace.h | 7 +++++-- >> drivers/cxl/cxl.h | 1 + >> 3 files changed, 18 insertions(+), 7 deletions(-) >> >