[v2,0/2] Update mce_record tracepoint

Message ID	20240125184857.851355-1-avadhut.naik@amd.com (mailing list archive)
Headers	show Received: from NAM04-DM6-obe.outbound.protection.outlook.com (mail-dm6nam04on2061.outbound.protection.outlook.com [40.107.102.61]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E63C31E501; Thu, 25 Jan 2024 18:49:08 +0000 (UTC) Received-SPF: Pass (protection.outlook.com: domain of amd.com designates 165.204.84.17 as permitted sender) receiver=protection.outlook.com; client-ip=165.204.84.17; helo=SATLEXMB04.amd.com; pr=C From: Avadhut Naik <avadhut.naik@amd.com> To: <linux-trace-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org> CC: <rostedt@goodmis.org>, <tony.luck@intel.com>, <bp@alien8.de>, <x86@kernel.org>, <linux-kernel@vger.kernel.org>, <yazen.ghannam@amd.com>, <avadnaik@amd.com> Subject: [PATCH v2 0/2] Update mce_record tracepoint Date: Thu, 25 Jan 2024 12:48:55 -0600 Message-ID: <20240125184857.851355-1-avadhut.naik@amd.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain
Series	Update mce_record tracepoint \| expand [v2,0/2] Update mce_record tracepoint [v2,1/2] tracing: Include PPIN in mce_record tracepoint [v2,2/2] tracing: Include Microcode Revision in mce_record tracepoint

Avadhut Naik Jan. 25, 2024, 6:48 p.m. UTC

This patchset updates the mce_record tracepoint so that the recently added
fields of struct mce are exported through it to userspace.

The first patch adds PPIN (Protected Processor Inventory Number) field to
the tracepoint.

The second patch adds the microcode field (Microcode Revision) to the
tracepoint.

Changes in v2:
 - Export microcode field (Microcode Revision) through the tracepoiont in
   addition to PPIN.

Avadhut Naik (2):
  tracing: Include PPIN in mce_record tracepoint
  tracing: Include Microcode Revision in mce_record tracepoint

 include/trace/events/mce.h | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)


base-commit: 0d4f19418f067465b0a84a287d9a51e443a0bc3a

Borislav Petkov Jan. 25, 2024, 6:58 p.m. UTC | #1

On Thu, Jan 25, 2024 at 12:48:55PM -0600, Avadhut Naik wrote:
> This patchset updates the mce_record tracepoint so that the recently added
> fields of struct mce are exported through it to userspace.
> 
> The first patch adds PPIN (Protected Processor Inventory Number) field to
> the tracepoint.
> 
> The second patch adds the microcode field (Microcode Revision) to the
> tracepoint.

This is a lot of static information to add to *every* MCE.

And where does it end? Stick full dmesg in the tracepoint too?

What is the real-life use case here?

Luck, Tony Jan. 25, 2024, 7:19 p.m. UTC | #2

> > The first patch adds PPIN (Protected Processor Inventory Number) field to
> > the tracepoint.
> >
> > The second patch adds the microcode field (Microcode Revision) to the
> > tracepoint.
>
> This is a lot of static information to add to *every* MCE.

8 bytes for PPIN, 4 more for microcode.

Number of recoverable machine checks per system .... I hope the monthly rate should
be countable on my fingers. If a system is getting more than that, then people should
be looking at fixing the underlying problem.

Corrected errors are much more common. Though Linux takes action to limit the
rate when storms occur. So maybe hundreds or small numbers of thousands of
error trace records? Increase in trace buffer consumption still measured in Kbytes
not Mbytes. Server systems that do machine check reporting now start at tens of
GBytes memory.

> And where does it end? Stick full dmesg in the tracepoint too?

Seems like overkill.

> What is the real-life use case here?

Systems using rasdaemon to track errors will be able to track both of these
(I assume that Naik has plans to update rasdaemon to capture and save these
new fields).

PPIN is useful when talking to the CPU vendor about patterns of similar errors
seen across a cluster.

MICROCODE - gives a fast path to root cause problems that have already
been fixed in a microcode update.

-Tony

Naik, Avadhut Jan. 25, 2024, 8:27 p.m. UTC | #3

Hi,

On 1/25/2024 1:19 PM, Luck, Tony wrote:
>>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>>> the tracepoint.
>>>
>>> The second patch adds the microcode field (Microcode Revision) to the
>>> tracepoint.
>>
>> This is a lot of static information to add to *every* MCE.
> 
> 8 bytes for PPIN, 4 more for microcode.
> 
> Number of recoverable machine checks per system .... I hope the monthly rate should
> be countable on my fingers. If a system is getting more than that, then people should
> be looking at fixing the underlying problem.
> 
> Corrected errors are much more common. Though Linux takes action to limit the
> rate when storms occur. So maybe hundreds or small numbers of thousands of
> error trace records? Increase in trace buffer consumption still measured in Kbytes
> not Mbytes. Server systems that do machine check reporting now start at tens of
> GBytes memory.
> 
>> And where does it end? Stick full dmesg in the tracepoint too?
> 
> Seems like overkill.
> 
>> What is the real-life use case here?
> 
> Systems using rasdaemon to track errors will be able to track both of these
> (I assume that Naik has plans to update rasdaemon to capture and save these
> new fields).
> 
Yes, I do intend to submit a pull request to the rasdaemon to parse and log these
new fields.

> PPIN is useful when talking to the CPU vendor about patterns of similar errors
> seen across a cluster.
> 
> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.
> 
> -Tony

Borislav Petkov Jan. 26, 2024, 10:27 a.m. UTC | #4

On Thu, Jan 25, 2024 at 07:19:22PM +0000, Luck, Tony wrote:
> 8 bytes for PPIN, 4 more for microcode.

I know, nothing leads to bloat like 0.01% here, 0.001% there...

> Number of recoverable machine checks per system .... I hope the
> monthly rate should be countable on my fingers...

That's not the point. Rather, when you look at MCE reports, you pretty
much almost always go and collect additional information from the target
machine because you want to figure out what exactly is going on.

So what's stopping you from collecting all that static information
instead of parrotting it through the tracepoint with each error?

> PPIN is useful when talking to the CPU vendor about patterns of
> similar errors seen across a cluster.

I guess that is perhaps the only thing of the two that makes some sense
at least - the identifier uniquely describes which CPU the error comes
from...

> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.

But that, nah. See above.

Thx.

Luck, Tony Jan. 26, 2024, 5:10 p.m. UTC | #5

> > 8 bytes for PPIN, 4 more for microcode.
>
> I know, nothing leads to bloat like 0.01% here, 0.001% there...

12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much)
   = 0.00000001746%

We will need 57000 changes like this one before we get to 0.001% :-)

> > Number of recoverable machine checks per system .... I hope the
> > monthly rate should be countable on my fingers...
>
> That's not the point. Rather, when you look at MCE reports, you pretty
> much almost always go and collect additional information from the target
> machine because you want to figure out what exactly is going on.
>
> So what's stopping you from collecting all that static information
> instead of parrotting it through the tracepoint with each error?

PPIN is static. So with good tracking to keep source platform information
attached to the error record as it gets passed around people trying to triage
the problem, you could say it can be retrieved later (or earlier when setting
up a database of attributes of each machine in the fleet.

But the key there is keeping the details of the source machine attached to
the error record. My first contact with machine check debugging is always
just the raw error record (from mcelog, rasdaemon, or console log).

> > PPIN is useful when talking to the CPU vendor about patterns of
> > similar errors seen across a cluster.
>
> I guess that is perhaps the only thing of the two that makes some sense
> at least - the identifier uniquely describes which CPU the error comes
> from...
>
> > MICROCODE - gives a fast path to root cause problems that have already
> > been fixed in a microcode update.
>
> But that, nah. See above.

Knowing which microcode version was loaded on a core *at the time of the error*
is critical. You've spent enough time with Ashok and Thomas tweaking the Linux
microcode driver to know that going back to the machine the next day to ask about
microcode version has a bunch of ways to get a wrong answer.

-Tony

Borislav Petkov Jan. 26, 2024, 6:56 p.m. UTC | #6

On Fri, Jan 26, 2024 at 05:10:20PM +0000, Luck, Tony wrote:
> 12 extra bytes divided by (say) 64GB (a very small server these days, may laptop has that much)
>    = 0.00000001746%
> 
> We will need 57000 changes like this one before we get to 0.001% :-)

You're forgetting that those 12 bytes repeat per MCE tracepoint logged.
And there's other code which adds more 0.01% here and there, well,
because we can.

> But the key there is keeping the details of the source machine attached to
> the error record. My first contact with machine check debugging is always
> just the raw error record (from mcelog, rasdaemon, or console log).

Yes, that is somewhat sensible reason to have the PPIN together with the
MCE record.

> Knowing which microcode version was loaded on a core *at the time of
> the error* is critical. 

So is the rest of the debug info you're going to need from that machine.
And yet we're not adding that to the tracepoint.

> You've spent enough time with Ashok and Thomas tweaking the Linux
> microcode driver to know that going back to the machine the next day
> to ask about microcode version has a bunch of ways to get a wrong
> answer.

Huh, what does that have to do with this?

IIUC, if someone changes something on the system, whether that is
updating microcode or swapping a harddrive or swapping memory or
whatever, that invalidates the errors reported, pretty much.

You can't put it all in the trace record, you just can't.

Luck, Tony Jan. 26, 2024, 7:15 p.m. UTC | #7

> > You've spent enough time with Ashok and Thomas tweaking the Linux
> > microcode driver to know that going back to the machine the next day
> > to ask about microcode version has a bunch of ways to get a wrong
> > answer.
>
> Huh, what does that have to do with this?

If deployment of a microcode update across a fleet always went
flawlessly, life would be simpler. But things can fail. And maybe the
failure wasn't noticed. Perhaps a node was rebooting when the
sysadmin pushed the update to the fleet and so missed the
deployment. Perhaps one core was already acting weird and
the microcode update didn't get applied to that core.

> IIUC, if someone changes something on the system, whether that is
> updating microcode or swapping a harddrive or swapping memory or
> whatever, that invalidates the errors reported, pretty much.
>
> You can't put it all in the trace record, you just can't.

Swapping a hard drive, or hot plugging a NIC isn't very likely
to correlate with an error reported by the CPU in machine
check banks. But microcode can be (and has been) the issue
in enough cases that knowing the version at the time of the
error matters.

You seemed to agree with this argument when the microcode
field was added to "struct mce" back in 2018

fa94d0c6e0f3 ("x86/MCE: Save microcode revision in machine check records")

Is it so very different to add this to a trace record so that rasdaemon
can have feature parity with mcelog(8)?

-Tony

Borislav Petkov Jan. 26, 2024, 7:45 p.m. UTC | #8

On Fri, Jan 26, 2024 at 07:15:50PM +0000, Luck, Tony wrote:
> If deployment of a microcode update across a fleet always went
> flawlessly, life would be simpler. But things can fail. And maybe the
> failure wasn't noticed. Perhaps a node was rebooting when the sysadmin
> pushed the update to the fleet and so missed the deployment. Perhaps
> one core was already acting weird and the microcode update didn't get
> applied to that core.

Yes, and you go collect data from that box. You will have to anyway to
figure out why the microcode didn't update.

> Swapping a hard drive, or hot plugging a NIC isn't very likely
> to correlate with an error reported by the CPU in machine
> check banks.

Ofc it is - coherent probe timeoutting due to problematic insertion
could be reported with a MCE, and so on and so on.

> Is it so very different to add this to a trace record so that rasdaemon
> can have feature parity with mcelog(8)?

I knew you were gonna say that. When someone decides that it is
a splendid idea to add more stuff to struct mce then said someone would
want it in the tracepoint too.

And then we're back to my original question: 

"And where does it end? Stick full dmesg in the tracepoint too?"

Where do you draw the line in the sand and say, no more, especially
static, fields bloating the trace record should be added and from then
on, you should go collect the info from that box. Something which you're
supposed to do anyway.

Luck, Tony Jan. 26, 2024, 8:49 p.m. UTC | #9

> > Is it so very different to add this to a trace record so that rasdaemon
> > can have feature parity with mcelog(8)?
>
> I knew you were gonna say that. When someone decides that it is
> a splendid idea to add more stuff to struct mce then said someone would
> want it in the tracepoint too.
>
> And then we're back to my original question:
>
> "And where does it end? Stick full dmesg in the tracepoint too?"
>
> Where do you draw the line in the sand and say, no more, especially
> static, fields bloating the trace record should be added and from then
> on, you should go collect the info from that box. Something which you're
> supposed to do anyway.

Every patch that adds new code or data structures adds to the kernel
memory footprint. Each should be considered on its merits. The basic
question being:

   "Is the new functionality worth the cost?"

Where does it end? It would end if Linus declared:

  "Linux is now complete. Stop sending patches".

I.e. it is never going to end.

If somebody posts a patch asking to add the full dmesg to a
tracepoint, I'll stand with you to say: "Not only no, but hell no".

So for Naik's two patches we have:

1) PPIN
Cost = 8 bytes.
Benefit: Emdeds a system identifier into the trace record so there
can be no ambiguity about which machine generated this error.
Also definitively indicates which socket on a multi-socket system.

2) MICROCODE
Cost = 4 bytes
Benefit: Certainty about the microcode version active on the core
at the time the error was detected.

RAS = Reliability, Availability, Serviceability

These changes fall into the serviceability bucket. They make it
easier to diagnose what went wrong.


-Tony

Borislav Petkov Jan. 26, 2024, 9:11 p.m. UTC | #10

On Fri, Jan 26, 2024 at 08:49:03PM +0000, Luck, Tony wrote:
> Every patch that adds new code or data structures adds to the kernel
> memory footprint. Each should be considered on its merits. The basic
> question being:
> 
>    "Is the new functionality worth the cost?"
> 
> Where does it end? It would end if Linus declared:
> 
>   "Linux is now complete. Stop sending patches".
> 
> I.e. it is never going to end.

No, it's not that - it is the merit thing which determines.

> 1) PPIN
> Cost = 8 bytes.
> Benefit: Emdeds a system identifier into the trace record so there
> can be no ambiguity about which machine generated this error.
> Also definitively indicates which socket on a multi-socket system.
> 
> 2) MICROCODE
> Cost = 4 bytes
> Benefit: Certainty about the microcode version active on the core
> at the time the error was detected.
> 
> RAS = Reliability, Availability, Serviceability
> 
> These changes fall into the serviceability bucket. They make it
> easier to diagnose what went wrong.

So does dmesg. Let's add it to the tracepoint...

But no, that's not the right question to ask.

It is rather: which bits of information are very relevant to an error
record and which are transient enough so that they cannot be gathered
from a system by other means or only gathered in a difficult way, and
should be part of that record.

The PPIN is not transient but you have to go map ->extcpu to the PPIN so
adding it to the tracepoint is purely a convenience thing. More or less.

The microcode revision thing I still don't buy but it is already there
so whateva...

So we'd need a rule hammered out and put there in a prominent place so
that it is clear what goes into struct mce and what not.

Thx.

Luck, Tony Jan. 26, 2024, 10:01 p.m. UTC | #11

> But no, that's not the right question to ask.
>
> It is rather: which bits of information are very relevant to an error
> record and which are transient enough so that they cannot be gathered
> from a system by other means or only gathered in a difficult way, and
> should be part of that record.
>
> The PPIN is not transient but you have to go map ->extcpu to the PPIN so
> adding it to the tracepoint is purely a convenience thing. More or less.
>
> The microcode revision thing I still don't buy but it is already there
> so whateva...
>
> So we'd need a rule hammered out and put there in a prominent place so
> that it is clear what goes into struct mce and what not.

My personal evaluation of the value of these two additions to the trace record:

PPIN: Nice to have. People that send stuff to me are terrible about providing surrounding
details. The record already includes CPUID(1).EAX ... so I can at least skip the step of
asking them which CPU family/model/stepping they were using). But PPIN
can be recovered (so long as the submitter kept good records about which system
generated the record).

MICROCODE: Must have. Microcode version can be changed at run time. Going
back to the system to check later may not give the correct answer to what was active
at the time of the error. Especially for an error reported while a microcode update is
waling across the CPUs poking the MSR on each in turn.

-Tony

Borislav Petkov Jan. 27, 2024, 12:19 p.m. UTC | #12

On Fri, Jan 26, 2024 at 10:01:29PM +0000, Luck, Tony wrote:
> PPIN: Nice to have. People that send stuff to me are terrible about
> providing surrounding details. The record already includes
> CPUID(1).EAX ... so I can at least skip the step of asking them which
> CPU family/model/stepping they were using). But PPIN can be recovered
> (so long as the submitter kept good records about which system
> generated the record).

Yes.

> MICROCODE: Must have. Microcode version can be changed at run time.
> Going back to the system to check later may not give the correct
> answer to what was active at the time of the error. Especially for an
> error reported while a microcode update is waling across the CPUs
> poking the MSR on each in turn.

Easy:

- You've got an MCE? Was it during scheduled microcode updates?
- Yes.
- Come back to me when it happens again, *outside* of the microcode
  update schedule.

Anyway, I still don't buy that. Maybe I'm wrong and maybe I need to talk
to data center operators more but this sounds like microcode update
failing is such a common thing to happen so that we *absolutely* *must*
capture the microcode revision when an MCE happens.

Maybe we should make microcode updates more resilient and add a retry
mechanism which doesn't back off as easily.

Or maybe people should script around it and keep retrying, dunno.

In my world, microcode update just works in the vast majority of the
cases and if it doesn't, then those cases need a specific look.

And if I am debugging an issue and I want to see the microcode revision,
I look at /proc/cpuinfo.

Thx.

[v2,0/2] Update mce_record tracepoint

Message

Comments