Message ID | 20240125184857.851355-1-avadhut.naik@amd.com (mailing list archive) |
---|---|
Headers | show |
Series | Update mce_record tracepoint | expand |
On Thu, Jan 25, 2024 at 12:48:55PM -0600, Avadhut Naik wrote: > This patchset updates the mce_record tracepoint so that the recently added > fields of struct mce are exported through it to userspace. > > The first patch adds PPIN (Protected Processor Inventory Number) field to > the tracepoint. > > The second patch adds the microcode field (Microcode Revision) to the > tracepoint. This is a lot of static information to add to *every* MCE. And where does it end? Stick full dmesg in the tracepoint too? What is the real-life use case here?
> > The first patch adds PPIN (Protected Processor Inventory Number) field to > > the tracepoint. > > > > The second patch adds the microcode field (Microcode Revision) to the > > tracepoint. > > This is a lot of static information to add to *every* MCE. 8 bytes for PPIN, 4 more for microcode. Number of recoverable machine checks per system .... I hope the monthly rate should be countable on my fingers. If a system is getting more than that, then people should be looking at fixing the underlying problem. Corrected errors are much more common. Though Linux takes action to limit the rate when storms occur. So maybe hundreds or small numbers of thousands of error trace records? Increase in trace buffer consumption still measured in Kbytes not Mbytes. Server systems that do machine check reporting now start at tens of GBytes memory. > And where does it end? Stick full dmesg in the tracepoint too? Seems like overkill. > What is the real-life use case here? Systems using rasdaemon to track errors will be able to track both of these (I assume that Naik has plans to update rasdaemon to capture and save these new fields). PPIN is useful when talking to the CPU vendor about patterns of similar errors seen across a cluster. MICROCODE - gives a fast path to root cause problems that have already been fixed in a microcode update. -Tony
Hi, On 1/25/2024 1:19 PM, Luck, Tony wrote: >>> The first patch adds PPIN (Protected Processor Inventory Number) field to >>> the tracepoint. >>> >>> The second patch adds the microcode field (Microcode Revision) to the >>> tracepoint. >> >> This is a lot of static information to add to *every* MCE. > > 8 bytes for PPIN, 4 more for microcode. > > Number of recoverable machine checks per system .... I hope the monthly rate should > be countable on my fingers. If a system is getting more than that, then people should > be looking at fixing the underlying problem. > > Corrected errors are much more common. Though Linux takes action to limit the > rate when storms occur. So maybe hundreds or small numbers of thousands of > error trace records? Increase in trace buffer consumption still measured in Kbytes > not Mbytes. Server systems that do machine check reporting now start at tens of > GBytes memory. > >> And where does it end? Stick full dmesg in the tracepoint too? > > Seems like overkill. > >> What is the real-life use case here? > > Systems using rasdaemon to track errors will be able to track both of these > (I assume that Naik has plans to update rasdaemon to capture and save these > new fields). > Yes, I do intend to submit a pull request to the rasdaemon to parse and log these new fields. > PPIN is useful when talking to the CPU vendor about patterns of similar errors > seen across a cluster. > > MICROCODE - gives a fast path to root cause problems that have already > been fixed in a microcode update. > > -Tony
On Thu, Jan 25, 2024 at 07:19:22PM +0000, Luck, Tony wrote: > 8 bytes for PPIN, 4 more for microcode. I know, nothing leads to bloat like 0.01% here, 0.001% there... > Number of recoverable machine checks per system .... I hope the > monthly rate should be countable on my fingers... That's not the point. Rather, when you look at MCE reports, you pretty much almost always go and collect additional information from the target machine because you want to figure out what exactly is going on. So what's stopping you from collecting all that static information instead of parrotting it through the tracepoint with each error? > PPIN is useful when talking to the CPU vendor about patterns of > similar errors seen across a cluster. I guess that is perhaps the only thing of the two that makes some sense at least - the identifier uniquely describes which CPU the error comes from... > MICROCODE - gives a fast path to root cause problems that have already > been fixed in a microcode update. But that, nah. See above. Thx.
> > 8 bytes for PPIN, 4 more for microcode. > > I know, nothing leads to bloat like 0.01% here, 0.001% there... 12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much) = 0.00000001746% We will need 57000 changes like this one before we get to 0.001% :-) > > Number of recoverable machine checks per system .... I hope the > > monthly rate should be countable on my fingers... > > That's not the point. Rather, when you look at MCE reports, you pretty > much almost always go and collect additional information from the target > machine because you want to figure out what exactly is going on. > > So what's stopping you from collecting all that static information > instead of parrotting it through the tracepoint with each error? PPIN is static. So with good tracking to keep source platform information attached to the error record as it gets passed around people trying to triage the problem, you could say it can be retrieved later (or earlier when setting up a database of attributes of each machine in the fleet. But the key there is keeping the details of the source machine attached to the error record. My first contact with machine check debugging is always just the raw error record (from mcelog, rasdaemon, or console log). > > PPIN is useful when talking to the CPU vendor about patterns of > > similar errors seen across a cluster. > > I guess that is perhaps the only thing of the two that makes some sense > at least - the identifier uniquely describes which CPU the error comes > from... > > > MICROCODE - gives a fast path to root cause problems that have already > > been fixed in a microcode update. > > But that, nah. See above. Knowing which microcode version was loaded on a core *at the time of the error* is critical. You've spent enough time with Ashok and Thomas tweaking the Linux microcode driver to know that going back to the machine the next day to ask about microcode version has a bunch of ways to get a wrong answer. -Tony
On Fri, Jan 26, 2024 at 05:10:20PM +0000, Luck, Tony wrote: > 12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much) > = 0.00000001746% > > We will need 57000 changes like this one before we get to 0.001% :-) You're forgetting that those 12 bytes repeat per MCE tracepoint logged. And there's other code which adds more 0.01% here and there, well, because we can. > But the key there is keeping the details of the source machine attached to > the error record. My first contact with machine check debugging is always > just the raw error record (from mcelog, rasdaemon, or console log). Yes, that is somewhat sensible reason to have the PPIN together with the MCE record. > Knowing which microcode version was loaded on a core *at the time of > the error* is critical. So is the rest of the debug info you're going to need from that machine. And yet we're not adding that to the tracepoint. > You've spent enough time with Ashok and Thomas tweaking the Linux > microcode driver to know that going back to the machine the next day > to ask about microcode version has a bunch of ways to get a wrong > answer. Huh, what does that have to do with this? IIUC, if someone changes something on the system, whether that is updating microcode or swapping a harddrive or swapping memory or whatever, that invalidates the errors reported, pretty much. You can't put it all in the trace record, you just can't.
> > You've spent enough time with Ashok and Thomas tweaking the Linux > > microcode driver to know that going back to the machine the next day > > to ask about microcode version has a bunch of ways to get a wrong > > answer. > > Huh, what does that have to do with this? If deployment of a microcode update across a fleet always went flawlessly, life would be simpler. But things can fail. And maybe the failure wasn't noticed. Perhaps a node was rebooting when the sysadmin pushed the update to the fleet and so missed the deployment. Perhaps one core was already acting weird and the microcode update didn't get applied to that core. > IIUC, if someone changes something on the system, whether that is > updating microcode or swapping a harddrive or swapping memory or > whatever, that invalidates the errors reported, pretty much. > > You can't put it all in the trace record, you just can't. Swapping a hard drive, or hot plugging a NIC isn't very likely to correlate with an error reported by the CPU in machine check banks. But microcode can be (and has been) the issue in enough cases that knowing the version at the time of the error matters. You seemed to agree with this argument when the microcode field was added to "struct mce" back in 2018 fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records") Is it so very different to add this to a trace record so that rasdaemon can have feature parity with mcelog(8)? -Tony
On Fri, Jan 26, 2024 at 07:15:50PM +0000, Luck, Tony wrote: > If deployment of a microcode update across a fleet always went > flawlessly, life would be simpler. But things can fail. And maybe the > failure wasn't noticed. Perhaps a node was rebooting when the sysadmin > pushed the update to the fleet and so missed the deployment. Perhaps > one core was already acting weird and the microcode update didn't get > applied to that core. Yes, and you go collect data from that box. You will have to anyway to figure out why the microcode didn't update. > Swapping a hard drive, or hot plugging a NIC isn't very likely > to correlate with an error reported by the CPU in machine > check banks. Ofc it is - coherent probe timeoutting due to problematic insertion could be reported with a MCE, and so on and so on. > Is it so very different to add this to a trace record so that rasdaemon > can have feature parity with mcelog(8)? I knew you were gonna say that. When someone decides that it is a splendid idea to add more stuff to struct mce then said someone would want it in the tracepoint too. And then we're back to my original question: "And where does it end? Stick full dmesg in the tracepoint too?" Where do you draw the line in the sand and say, no more, especially static, fields bloating the trace record should be added and from then on, you should go collect the info from that box. Something which you're supposed to do anyway.
> > Is it so very different to add this to a trace record so that rasdaemon > > can have feature parity with mcelog(8)? > > I knew you were gonna say that. When someone decides that it is > a splendid idea to add more stuff to struct mce then said someone would > want it in the tracepoint too. > > And then we're back to my original question: > > "And where does it end? Stick full dmesg in the tracepoint too?" > > Where do you draw the line in the sand and say, no more, especially > static, fields bloating the trace record should be added and from then > on, you should go collect the info from that box. Something which you're > supposed to do anyway. Every patch that adds new code or data structures adds to the kernel memory footprint. Each should be considered on its merits. The basic question being: "Is the new functionality worth the cost?" Where does it end? It would end if Linus declared: "Linux is now complete. Stop sending patches". I.e. it is never going to end. If somebody posts a patch asking to add the full dmesg to a tracepoint, I'll stand with you to say: "Not only no, but hell no". So for Naik's two patches we have: 1) PPIN Cost = 8 bytes. Benefit: Emdeds a system identifier into the trace record so there can be no ambiguity about which machine generated this error. Also definitively indicates which socket on a multi-socket system. 2) MICROCODE Cost = 4 bytes Benefit: Certainty about the microcode version active on the core at the time the error was detected. RAS = Reliability, Availability, Serviceability These changes fall into the serviceability bucket. They make it easier to diagnose what went wrong. -Tony
On Fri, Jan 26, 2024 at 08:49:03PM +0000, Luck, Tony wrote: > Every patch that adds new code or data structures adds to the kernel > memory footprint. Each should be considered on its merits. The basic > question being: > > "Is the new functionality worth the cost?" > > Where does it end? It would end if Linus declared: > > "Linux is now complete. Stop sending patches". > > I.e. it is never going to end. No, it's not that - it is the merit thing which determines. > 1) PPIN > Cost = 8 bytes. > Benefit: Emdeds a system identifier into the trace record so there > can be no ambiguity about which machine generated this error. > Also definitively indicates which socket on a multi-socket system. > > 2) MICROCODE > Cost = 4 bytes > Benefit: Certainty about the microcode version active on the core > at the time the error was detected. > > RAS = Reliability, Availability, Serviceability > > These changes fall into the serviceability bucket. They make it > easier to diagnose what went wrong. So does dmesg. Let's add it to the tracepoint... But no, that's not the right question to ask. It is rather: which bits of information are very relevant to an error record and which are transient enough so that they cannot be gathered from a system by other means or only gathered in a difficult way, and should be part of that record. The PPIN is not transient but you have to go map ->extcpu to the PPIN so adding it to the tracepoint is purely a convenience thing. More or less. The microcode revision thing I still don't buy but it is already there so whateva... So we'd need a rule hammered out and put there in a prominent place so that it is clear what goes into struct mce and what not. Thx.
> But no, that's not the right question to ask. > > It is rather: which bits of information are very relevant to an error > record and which are transient enough so that they cannot be gathered > from a system by other means or only gathered in a difficult way, and > should be part of that record. > > The PPIN is not transient but you have to go map ->extcpu to the PPIN so > adding it to the tracepoint is purely a convenience thing. More or less. > > The microcode revision thing I still don't buy but it is already there > so whateva... > > So we'd need a rule hammered out and put there in a prominent place so > that it is clear what goes into struct mce and what not. My personal evaluation of the value of these two additions to the trace record: PPIN: Nice to have. People that send stuff to me are terrible about providing surrounding details. The record already includes CPUID(1).EAX ... so I can at least skip the step of asking them which CPU family/model/stepping they were using). But PPIN can be recovered (so long as the submitter kept good records about which system generated the record). MICROCODE: Must have. Microcode version can be changed at run time. Going back to the system to check later may not give the correct answer to what was active at the time of the error. Especially for an error reported while a microcode update is waling across the CPUs poking the MSR on each in turn. -Tony
On Fri, Jan 26, 2024 at 10:01:29PM +0000, Luck, Tony wrote: > PPIN: Nice to have. People that send stuff to me are terrible about > providing surrounding details. The record already includes > CPUID(1).EAX ... so I can at least skip the step of asking them which > CPU family/model/stepping they were using). But PPIN can be recovered > (so long as the submitter kept good records about which system > generated the record). Yes. > MICROCODE: Must have. Microcode version can be changed at run time. > Going back to the system to check later may not give the correct > answer to what was active at the time of the error. Especially for an > error reported while a microcode update is waling across the CPUs > poking the MSR on each in turn. Easy: - You've got an MCE? Was it during scheduled microcode updates? - Yes. - Come back to me when it happens again, *outside* of the microcode update schedule. Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk to data center operators more but this sounds like microcode update failing is such a common thing to happen so that we *absolutely* *must* capture the microcode revision when an MCE happens. Maybe we should make microcode updates more resilient and add a retry mechanism which doesn't back off as easily. Or maybe people should script around it and keep retrying, dunno. In my world, microcode update just works in the vast majority of the cases and if it doesn't, then those cases need a specific look. And if I am debugging an issue and I want to see the microcode revision, I look at /proc/cpuinfo. Thx.