Message ID | 20121130213330.7649.51101.stgit@grignak.americas.hpqcorp.net (mailing list archive) |
---|---|
State | New, archived |
Delegated to: | Bjorn Helgaas |
Headers | show |
On Fri, Nov 30, 2012 at 02:33:30PM -0700, Lance Ortiz wrote: > This header file will define a new trace event that will be triggered when > a AER event occurs. The following data will be provided to the trace > event. > > char * name - String containing the device path Can we explain that device path more, even with an example? Something like "position of the device in the PCI hierarchy, i.e. 0000:05:00.0" > u32 status - Either the correctable or uncorrectable register > indicating what error or errors have been see. seen > u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > > The trace event will also provide a trace string that may look like: > > "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned > TLP" > > v1-v2 Move header from include/ras/aer_event.h to > include/trace/events/ras.h > > Signed-off-by: Lance Ortiz <lance.ortiz@hp.com> > --- > > include/trace/events/ras.h | 77 ++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 77 insertions(+), 0 deletions(-) > create mode 100644 include/trace/events/ras.h > > diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h > new file mode 100644 > index 0000000..f77d009 > --- /dev/null > +++ b/include/trace/events/ras.h > @@ -0,0 +1,77 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM aer_event > +#define TRACE_INCLUDE_FILE ras Applying: aerdrv: Trace Event for AER /home/boris/kernel/linux-2.6/.git/rebase-apply/patch:15: trailing whitespace. #define TRACE_INCLUDE_FILE ras warning: 1 line adds whitespace errors. > + > +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_AER_H > + > +#include <linux/tracepoint.h> > +#include <linux/edac.h> > + > + > +/* > + * Anhance Error Reporting (AER) PCIE Report Error PCIE Advanced Error Reporting (AER) tracepoint > + * > + * These events are generated when hardware detects a corrected or > + * uncorrected event on a pci express device and reports PCIE device. (no need for "and reports errors") > + * errors. The event reports the following data. The event record has the following structure: > + * > + * char * dev_name - String containing the device identification > + * u32 status - Either the correctable or uncorrectable register > + * indicating what error or errors have been seen > + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > + */ > + > +#define correctable_error_string \ > + {BIT(0), "Receiver Error"}, \ > + {BIT(6), "Bad TLP"}, \ > + {BIT(7), "Bad DLLP"}, \ > + {BIT(8), "RELAY_NUM Rollover"}, \ > + {BIT(12), "Replay Timer Timeout"}, \ > + {BIT(13), "Advisory Non-Fatal"} > + > +#define uncorrectable_error_string \ > + {BIT(4), "Data Link Protocol"}, \ > + {BIT(12), "Poisoned TLP"}, \ > + {BIT(13), "Flow Control Protocol"}, \ > + {BIT(14), "Completion Timeout"}, \ > + {BIT(15), "Completer Abort"}, \ > + {BIT(16), "Unexpected Completion"}, \ > + {BIT(17), "Receiver Overflow"}, \ > + {BIT(18), "Malformed TLP"}, \ > + {BIT(19), "ECRC"}, \ > + {BIT(20), "Unsupported Request"} I'm wondering whether something else PCIe could use those error types but I guess they'll pick them up from here if needed. > + > +TRACE_EVENT(aer_event, > + TP_PROTO(const char *dev_name, > + const u32 status, > + const u8 severity), > + > + TP_ARGS(dev_name, status, severity), > + > + TP_STRUCT__entry( > + __string( dev_name, dev_name ) > + __field( u32, status ) > + __field( u8, severity ) > + ), > + > + TP_fast_assign( > + __assign_str(dev_name, dev_name); > + __entry->status = status; > + __entry->severity = severity; > + ), > + > + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > + __get_str(dev_name), > + (__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" : > + ((__entry->severity == HW_EVENT_ERR_FATAL) ? > + "Fatal" : "Uncorrected"), > + __entry->severity == HW_EVENT_ERR_CORRECTED ? > + __print_flags(__entry->status, "|", correctable_error_string) : > + __print_flags(__entry->status, "|", uncorrectable_error_string)) > +); > + > +#endif /* _TRACE_AER_H */ > + > +/* This part must be outside protection */ > +#include <trace/define_trace.h> Apart from the minor issues above: Acked-by: Borislav Petkov <bp@alien8.de> Thanks.
Em Fri, 30 Nov 2012 14:33:30 -0700 Lance Ortiz <lance.ortiz@hp.com> escreveu: > This header file will define a new trace event that will be triggered when > a AER event occurs. The following data will be provided to the trace > event. > > char * name - String containing the device path You renamed it to dev_name. Please fix it at the commit comments. > > u32 status - Either the correctable or uncorrectable register > indicating what error or errors have been see. > > u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > > The trace event will also provide a trace string that may look like: > > "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned > TLP" > > v1-v2 Move header from include/ras/aer_event.h to > include/trace/events/ras.h Please don't call it as "ras.h". Call it, instead, ras_aer.h (or something similar) as, if we're moving the tracing back to include/trace/events/, the same should happen for the memory error events. > > Signed-off-by: Lance Ortiz <lance.ortiz@hp.com> > --- > > include/trace/events/ras.h | 77 ++++++++++++++++++++++++++++++++++++++++++++ > 1 files changed, 77 insertions(+), 0 deletions(-) > create mode 100644 include/trace/events/ras.h > > diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h > new file mode 100644 > index 0000000..f77d009 > --- /dev/null > +++ b/include/trace/events/ras.h > @@ -0,0 +1,77 @@ > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM aer_event > +#define TRACE_INCLUDE_FILE ras > + > +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_AER_H > + > +#include <linux/tracepoint.h> > +#include <linux/edac.h> > + > + > +/* > + * Anhance Error Reporting (AER) PCIE Report Error > + * > + * These events are generated when hardware detects a corrected or > + * uncorrected event on a pci express device and reports > + * errors. The event reports the following data. > + * > + * char * dev_name - String containing the device identification > + * u32 status - Either the correctable or uncorrectable register > + * indicating what error or errors have been seen > + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED > + */ > + > +#define correctable_error_string \ > + {BIT(0), "Receiver Error"}, \ > + {BIT(6), "Bad TLP"}, \ > + {BIT(7), "Bad DLLP"}, \ > + {BIT(8), "RELAY_NUM Rollover"}, \ > + {BIT(12), "Replay Timer Timeout"}, \ > + {BIT(13), "Advisory Non-Fatal"} Hmm... isn't something missing here? I'm seeing more bits defined at the PCIe V3.0 spec for Offset 10h: bit 14 - Corrected Internal Error bit 15 - Header Log Overflow > +#define uncorrectable_error_string \ > + {BIT(4), "Data Link Protocol"}, \ > + {BIT(12), "Poisoned TLP"}, \ > + {BIT(13), "Flow Control Protocol"}, \ > + {BIT(14), "Completion Timeout"}, \ > + {BIT(15), "Completer Abort"}, \ > + {BIT(16), "Unexpected Completion"}, \ > + {BIT(17), "Receiver Overflow"}, \ > + {BIT(18), "Malformed TLP"}, \ > + {BIT(19), "ECRC"}, \ > + {BIT(20), "Unsupported Request"} Hmm... isn't something missing here? I'm seeing more bits defined at the PCIe V3.0 spec for Offset 04h: bit 5 - Surprise Down Error bit 21 - ACS Violation bit 22 - Uncorrectable Internal Error bit 23 - MC Blocked TLP bit 24 - AtomicOp Egress Blocked bit 25 - TLP Prefix Blocked Error > + > +TRACE_EVENT(aer_event, > + TP_PROTO(const char *dev_name, > + const u32 status, > + const u8 severity), > + > + TP_ARGS(dev_name, status, severity), > + > + TP_STRUCT__entry( > + __string( dev_name, dev_name ) > + __field( u32, status ) > + __field( u8, severity ) > + ), > + > + TP_fast_assign( > + __assign_str(dev_name, dev_name); > + __entry->status = status; > + __entry->severity = severity; > + ), > + > + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", > + __get_str(dev_name), > + (__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" : > + ((__entry->severity == HW_EVENT_ERR_FATAL) ? > + "Fatal" : "Uncorrected"), > + __entry->severity == HW_EVENT_ERR_CORRECTED ? > + __print_flags(__entry->status, "|", correctable_error_string) : > + __print_flags(__entry->status, "|", uncorrectable_error_string)) > +); > + > +#endif /* _TRACE_AER_H */ > + > +/* This part must be outside protection */ > +#include <trace/define_trace.h> >
On Sat, Dec 01, 2012 at 09:36:14AM -0200, Mauro Carvalho Chehab wrote: > Please don't call it as "ras.h". Call it, instead, ras_aer.h > (or something similar) as, if we're moving the tracing back to > include/trace/events/, the same should happen for the memory error > events. As you can see yourself, include/trace/events/ contains one header per topic so having ras_event.h and ras_aer.h or whatever, doesn't fit the scheme. IOW, we want all RAS-specific tracepoints to be collected in a header called ras.h like the rest of the subsystems do it. See rcu.h there for a good example.
Cheers, Mauro On Sat, 1 Dec 2012, Borislav Petkov wrote: > On Sat, Dec 01, 2012 at 09:36:14AM -0200, Mauro Carvalho Chehab wrote: >> Please don't call it as "ras.h". Call it, instead, ras_aer.h >> (or something similar) as, if we're moving the tracing back to >> include/trace/events/, the same should happen for the memory error >> events. > > As you can see yourself, include/trace/events/ contains one header per > topic so having ras_event.h and ras_aer.h or whatever, doesn't fit the > scheme. IOW, we want all RAS-specific tracepoints to be collected in a > header called ras.h like the rest of the subsystems do it. See rcu.h > there for a good example. Works for me. Regards, Mauro -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> > +#define correctable_error_string \ > > + {BIT(0), "Receiver Error"}, \ > > + {BIT(6), "Bad TLP"}, \ > > + {BIT(7), "Bad DLLP"}, \ > > + {BIT(8), "RELAY_NUM Rollover"}, \ > > + {BIT(12), "Replay Timer Timeout"}, \ > > + {BIT(13), "Advisory Non-Fatal"} > > Hmm... isn't something missing here? I'm seeing more bits defined at > the > PCIe V3.0 spec for Offset 10h: > > bit 14 - Corrected Internal Error > bit 15 - Header Log Overflow > > > +#define uncorrectable_error_string \ > > + {BIT(4), "Data Link Protocol"}, \ > > + {BIT(12), "Poisoned TLP"}, \ > > + {BIT(13), "Flow Control Protocol"}, \ > > + {BIT(14), "Completion Timeout"}, \ > > + {BIT(15), "Completer Abort"}, \ > > + {BIT(16), "Unexpected Completion"}, \ > > + {BIT(17), "Receiver Overflow"}, \ > > + {BIT(18), "Malformed TLP"}, \ > > + {BIT(19), "ECRC"}, \ > > + {BIT(20), "Unsupported Request"} > > Hmm... isn't something missing here? I'm seeing more bits defined at > the > PCIe V3.0 spec for Offset 04h: > > bit 5 - Surprise Down Error > bit 21 - ACS Violation > bit 22 - Uncorrectable Internal Error > bit 23 - MC Blocked TLP > bit 24 - AtomicOp Egress Blocked > bit 25 - TLP Prefix Blocked Error > I used the same errors defined in the string arrays at the top of aerdrv_errprint.c. I am not sure why they were left out in that file. I will investigate and probably add them as a later patch and then include them in aerdrv_errprint.c also. Sound good? Lance -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 03, 2012 at 08:01:46PM +0000, Ortiz, Lance E wrote: > > > +#define correctable_error_string \ > > > + {BIT(0), "Receiver Error"}, \ > > > + {BIT(6), "Bad TLP"}, \ > > > + {BIT(7), "Bad DLLP"}, \ > > > + {BIT(8), "RELAY_NUM Rollover"}, \ > > > + {BIT(12), "Replay Timer Timeout"}, \ > > > + {BIT(13), "Advisory Non-Fatal"} > > > > Hmm... isn't something missing here? I'm seeing more bits defined at > > the > > PCIe V3.0 spec for Offset 10h: > > > > bit 14 - Corrected Internal Error > > bit 15 - Header Log Overflow > > > > > +#define uncorrectable_error_string \ > > > + {BIT(4), "Data Link Protocol"}, \ > > > + {BIT(12), "Poisoned TLP"}, \ > > > + {BIT(13), "Flow Control Protocol"}, \ > > > + {BIT(14), "Completion Timeout"}, \ > > > + {BIT(15), "Completer Abort"}, \ > > > + {BIT(16), "Unexpected Completion"}, \ > > > + {BIT(17), "Receiver Overflow"}, \ > > > + {BIT(18), "Malformed TLP"}, \ > > > + {BIT(19), "ECRC"}, \ > > > + {BIT(20), "Unsupported Request"} > > > > Hmm... isn't something missing here? I'm seeing more bits defined at > > the > > PCIe V3.0 spec for Offset 04h: > > > > bit 5 - Surprise Down Error > > bit 21 - ACS Violation > > bit 22 - Uncorrectable Internal Error > > bit 23 - MC Blocked TLP > > bit 24 - AtomicOp Egress Blocked > > bit 25 - TLP Prefix Blocked Error > > > I used the same errors defined in the string arrays at the top of > aerdrv_errprint.c. I am not sure why they were left out in that file. > I will investigate and probably add them as a later patch and then > include them in aerdrv_errprint.c also. Maybe we should ask Yanmin who added those strings in 6c2b374d74857e892080ee726184ec1d15e7d4e4; CCed. Also, it would make a lot of sense to have those string definitions at one place and then reuse them instead of define them again in the tracepoint header and they get out of sync and have needless duplication in the kernel, etc, etc. So, instead, you could define some macros which can generate your strings above like this: #define aer_uncor_err_str(bitnum) \ { BIT(bitnum), aer_uncorrectable_error_string(bitnum) } and use that macro for the __print_flags flag_array argument. Or something to that effect. Even better would it be if the error strings in <drivers/pci/pcie/aer/aerdrv_errprint.c> could be shared with the tracepoint ones. That would require a bit more changes though but something like using an array of trace_print_flags instead an array of strings could be doable. Then, when you need the string, you do uncor_err_array[i].name and so on. Thanks.
PiANCj4gTWF5YmUgd2Ugc2hvdWxkIGFzayBZYW5taW4gd2hvIGFkZGVkIHRob3NlIHN0cmluZ3Mg aW4NCj4gNmMyYjM3NGQ3NDg1N2U4OTIwODBlZTcyNjE4NGVjMWQxNWU3ZDRlNDsgQ0NlZC4NCj4g DQo+IEFsc28sIGl0IHdvdWxkIG1ha2UgYSBsb3Qgb2Ygc2Vuc2UgdG8gaGF2ZSB0aG9zZSBzdHJp bmcgZGVmaW5pdGlvbnMNCj4gYXQgb25lIHBsYWNlIGFuZCB0aGVuIHJldXNlIHRoZW0gaW5zdGVh ZCBvZiBkZWZpbmUgdGhlbSBhZ2FpbiBpbiB0aGUNCj4gdHJhY2Vwb2ludCBoZWFkZXIgYW5kIHRo ZXkgZ2V0IG91dCBvZiBzeW5jIGFuZCBoYXZlIG5lZWRsZXNzDQo+IGR1cGxpY2F0aW9uDQo+IGlu IHRoZSBrZXJuZWwsIGV0YywgZXRjLg0KPiANCj4gU28sIGluc3RlYWQsIHlvdSBjb3VsZCBkZWZp bmUgc29tZSBtYWNyb3Mgd2hpY2ggY2FuIGdlbmVyYXRlIHlvdXINCj4gc3RyaW5ncyBhYm92ZSBs aWtlIHRoaXM6DQo+IA0KPiAjZGVmaW5lIGFlcl91bmNvcl9lcnJfc3RyKGJpdG51bSkJXA0KPiAJ eyBCSVQoYml0bnVtKSwgYWVyX3VuY29ycmVjdGFibGVfZXJyb3Jfc3RyaW5nKGJpdG51bSkgfQ0K PiANCj4gYW5kIHVzZSB0aGF0IG1hY3JvIGZvciB0aGUgX19wcmludF9mbGFncyBmbGFnX2FycmF5 IGFyZ3VtZW50Lg0KPiANCj4gT3Igc29tZXRoaW5nIHRvIHRoYXQgZWZmZWN0Lg0KPiANCj4gRXZl biBiZXR0ZXIgd291bGQgaXQgYmUgaWYgdGhlIGVycm9yIHN0cmluZ3MgaW4NCj4gPGRyaXZlcnMv cGNpL3BjaWUvYWVyL2FlcmRydl9lcnJwcmludC5jPiBjb3VsZCBiZSBzaGFyZWQgd2l0aCB0aGUN Cj4gdHJhY2Vwb2ludCBvbmVzLiBUaGF0IHdvdWxkIHJlcXVpcmUgYSBiaXQgbW9yZSBjaGFuZ2Vz IHRob3VnaCBidXQNCj4gc29tZXRoaW5nIGxpa2UgdXNpbmcgYW4gYXJyYXkgb2YgdHJhY2VfcHJp bnRfZmxhZ3MgaW5zdGVhZCBhbiBhcnJheSBvZg0KPiBzdHJpbmdzIGNvdWxkIGJlIGRvYWJsZS4N Cj4gDQo+IFRoZW4sIHdoZW4geW91IG5lZWQgdGhlIHN0cmluZywgeW91IGRvIHVuY29yX2Vycl9h cnJheVtpXS5uYW1lIGFuZCBzbw0KPiBvbi4NCg0KVGhpcyBhIGZpbmUgaWRlYS4gIEknbSB0aGlu a2luZyB3ZSBtaWdodCB3YW50IHRvIGltcGxlbWVudCB0aGF0IGluIGEgbGF0ZXIgcGF0Y2ggc2lu Y2UgaXQgaXMgYSBiaWdnZXIgY2hhbmdlIGFuZCB3ZSBzdGlsbCBuZWVkIHRvIHNlZSBhYm91dCBh ZGRpbmcgYWRkaXRpb25hbCBzdHJpbmdzLiAgSSBjYW4gc2VuZCBvdXQgb25lIG1vcmUgdmVyc2lv biBvZiB0aGUgcGF0Y2ggd2l0aCB0aGUgY29tbWVudCBoZWFkZXIgY2hhbmdlcyB5b3Ugc3VnZ2Vz dGVkLCBhbmQgdGhlIGNvbW1lbnRzIEpvZSBoYWQgYW5kIG1heWJlIHdlIGNhbiBtb3ZlIGZvcndh cmQgd2l0aCB0aGF0Lg0KDQpMYW5jZSANCg0K -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Dec 03, 2012 at 09:13:49PM +0000, Ortiz, Lance E wrote: > This a fine idea. I'm thinking we might want to implement that in a > later patch since it is a bigger change and we still need to see about > adding additional strings. Yes, definitely in a later patch or more depending on how much changes are needed (haven't looked closely at the whole deal). > I can send out one more version of the patch with the comment header > changes you suggested, and the comments Joe had and maybe we can move > forward with that. Sure, thanks.
diff --git a/include/trace/events/ras.h b/include/trace/events/ras.h new file mode 100644 index 0000000..f77d009 --- /dev/null +++ b/include/trace/events/ras.h @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM aer_event +#define TRACE_INCLUDE_FILE ras + +#if !defined(_TRACE_AER_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_AER_H + +#include <linux/tracepoint.h> +#include <linux/edac.h> + + +/* + * Anhance Error Reporting (AER) PCIE Report Error + * + * These events are generated when hardware detects a corrected or + * uncorrected event on a pci express device and reports + * errors. The event reports the following data. + * + * char * dev_name - String containing the device identification + * u32 status - Either the correctable or uncorrectable register + * indicating what error or errors have been seen + * u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED + */ + +#define correctable_error_string \ + {BIT(0), "Receiver Error"}, \ + {BIT(6), "Bad TLP"}, \ + {BIT(7), "Bad DLLP"}, \ + {BIT(8), "RELAY_NUM Rollover"}, \ + {BIT(12), "Replay Timer Timeout"}, \ + {BIT(13), "Advisory Non-Fatal"} + +#define uncorrectable_error_string \ + {BIT(4), "Data Link Protocol"}, \ + {BIT(12), "Poisoned TLP"}, \ + {BIT(13), "Flow Control Protocol"}, \ + {BIT(14), "Completion Timeout"}, \ + {BIT(15), "Completer Abort"}, \ + {BIT(16), "Unexpected Completion"}, \ + {BIT(17), "Receiver Overflow"}, \ + {BIT(18), "Malformed TLP"}, \ + {BIT(19), "ECRC"}, \ + {BIT(20), "Unsupported Request"} + +TRACE_EVENT(aer_event, + TP_PROTO(const char *dev_name, + const u32 status, + const u8 severity), + + TP_ARGS(dev_name, status, severity), + + TP_STRUCT__entry( + __string( dev_name, dev_name ) + __field( u32, status ) + __field( u8, severity ) + ), + + TP_fast_assign( + __assign_str(dev_name, dev_name); + __entry->status = status; + __entry->severity = severity; + ), + + TP_printk("%s PCIe Bus Error: severity=%s, %s\n", + __get_str(dev_name), + (__entry->severity == HW_EVENT_ERR_CORRECTED) ? "Corrected" : + ((__entry->severity == HW_EVENT_ERR_FATAL) ? + "Fatal" : "Uncorrected"), + __entry->severity == HW_EVENT_ERR_CORRECTED ? + __print_flags(__entry->status, "|", correctable_error_string) : + __print_flags(__entry->status, "|", uncorrectable_error_string)) +); + +#endif /* _TRACE_AER_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h>
This header file will define a new trace event that will be triggered when a AER event occurs. The following data will be provided to the trace event. char * name - String containing the device path u32 status - Either the correctable or uncorrectable register indicating what error or errors have been see. u8 severity - error severity 0:NONFATAL 1:FATAL 2:CORRECTED The trace event will also provide a trace string that may look like: "0000:05:00.0 PCIe Bus Error:severity=Uncorrected (Non-Fatal), Poisoned TLP" v1-v2 Move header from include/ras/aer_event.h to include/trace/events/ras.h Signed-off-by: Lance Ortiz <lance.ortiz@hp.com> --- include/trace/events/ras.h | 77 ++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 77 insertions(+), 0 deletions(-) create mode 100644 include/trace/events/ras.h -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html