diff mbox

[v6,1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()

Message ID 20180521135003.32459-2-mr.nuke.me@gmail.com (mailing list archive)
State Rejected, archived
Headers show

Commit Message

Alex G. May 21, 2018, 1:49 p.m. UTC
ghes_severity() is a misnomer in this case, as it implies the severity
of the entire GHES structure. Instead, it maps one CPER value to a
GHES_SEV* value. ghes_cper_severity() is clearer.

Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
 drivers/acpi/apei/ghes.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

Comments

Rafael J. Wysocki May 22, 2018, 8:55 a.m. UTC | #1
On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
> ghes_severity() is a misnomer in this case, as it implies the severity
> of the entire GHES structure. Instead, it maps one CPER value to a
> GHES_SEV* value. ghes_cper_severity() is clearer.

It looks like the *real* reason for this change is that you
re-introduce ghes_severity() as a different function in the second
patch.

There are a couple of reasons to avoid that, one of them being that
people will now have to remember what the function did in which kernel
versions.  Also, the current name is good enough IMO, so I'm not going
to apply this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex G. May 22, 2018, 1:38 p.m. UTC | #2
On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote:
> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
>> ghes_severity() is a misnomer in this case, as it implies the severity
>> of the entire GHES structure. Instead, it maps one CPER value to a
>> GHES_SEV* value. ghes_cper_severity() is clearer.
> 
> It looks like the *real* reason for this change is that you
> re-introduce ghes_severity() as a different function in the second
> patch.

/me holds fist at Borislav

> There are a couple of reasons to avoid that, one of them being that
> people will now have to remember what the function did in which kernel
> versions. 

So?

> Also, the current name is good enough IMO,

Two other reviewers were extremely confused by the vague name, so no,
this is not good enough.

> so I'm not going to apply this patch.


--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 22, 2018, 1:50 p.m. UTC | #3
On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote:
> > It looks like the *real* reason for this change is that you
> > re-introduce ghes_severity() as a different function in the second
> > patch.
> 
> /me holds fist at Borislav

That was a misunderstanding with Rafael and me - we fixed it on IRC.

But this is not the real problem with your approach - it is the marking
of all PCIe errors as recoverable, regardless of the signature. That's a
no-no, IMO.
Alex G. May 22, 2018, 2:39 p.m. UTC | #4
On 05/22/2018 08:50 AM, Borislav Petkov wrote:
> On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote:
>>> It looks like the *real* reason for this change is that you
>>> re-introduce ghes_severity() as a different function in the second
>>> patch.
>>
>> /me holds fist at Borislav
> 
> That was a misunderstanding with Rafael and me - we fixed it on IRC.

You mean to say this whole time I've been struggling to write emails,
there was an IRC?

> But this is not the real problem with your approach - it is the marking
> of all PCIe errors as recoverable, regardless of the signature. That's a
> no-no, IMO.

No, the problem is with the current approach, not with mine. The problem
is trying to handle the error outside of the existing handler. That's a
no-no, IMO.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 22, 2018, 2:54 p.m. UTC | #5
On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote:
> No, the problem is with the current approach, not with mine. The problem
> is trying to handle the error outside of the existing handler. That's a
> no-no, IMO.

Let me save you some time: until you come up with a proper solution for
*all* PCIe errors so that the kernel can correctly decide what to do for
each error based on its actual severity, consider this NAKed.

I don't care about outside or inside of the handler - this thing needs
to be done properly and not just to serve your particular use case of
abrupt removal of devices causing PCIe errors, and punish the rest.

I especially don't want to have the case where a PCIe error is *really*
fatal and then we noodle in some handlers debating about the severity
because it got marked as recoverable intermittently and end up causing
data corruption on the storage device. Here's a real no-no for ya.
Alex G. May 22, 2018, 3:22 p.m. UTC | #6
On 05/22/2018 09:54 AM, Borislav Petkov wrote:
> On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote:
>> No, the problem is with the current approach, not with mine. The problem
>> is trying to handle the error outside of the existing handler. That's a
>> no-no, IMO.
> 
> Let me save you some time: until you come up with a proper solution for
> *all* PCIe errors so that the kernel can correctly decide what to do for
> each error based on its actual severity, consider this NAKed.

I do have a proper solution for _all_ PCIe errors. In fact, we discussed
several valid approaches already.

> I don't care about outside or inside of the handler 

I do. I have a handler that can handle (no pun intended) errors. I want
to use the same code path in native and GHES cases. If I allow ghes.c to
take different decisions than what aer_do_recovery() would, I've failed.

>- this thing needs to be done properly 

Exactly!

> and not just to serve your particular use case of
> abrupt removal of devices causing PCIe errors, and punish the rest.

I think you're confused about what I'm actually trying to do. Or maybe
you're confused about how PCIe errors work. That's understandable. PCIe
uses the term "fatal" for errors that may make the link unusable, and
which may require a link reset, and in most other specs "fatal" means
"on fire". I understand your confusion, and I hope I cleared it up.

You're trying to make the case that surprise removal is my only concern
and use case, because that's the example that I gave. It makes your
argument stronger, but it's wrong. You don't know our test setup, and
all the things I'm testing for, and whenever I try to tell you, you fall
back to the 'surprise removal' example.

I don't know why you'd think Dell would pay me to work on this if I were
to allow things like silent data corruption to creep in. This isn't a
traditional company from Redmond, Washington.

> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.

I especially don't want a kernel maintainer who hasn't even read the
recovery handler (let alone the spec around which the handler was
written) tell me how the recovery handler works and what it's supposed
to do (see, I can be an ass).
PCIe errors really are fatal. They might need to unload the driver and
remove the device. But somebody set the questionable policy that
"fatal"=="panic", and that is completely inappropriate for a larger
class of errors -- PCIe happens to be the easiest example to pick on.

And even you realize that the argument that a panic() will somehow
prevent data corruption is complete noodle sauce.

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Borislav Petkov May 22, 2018, 3:33 p.m. UTC | #7
On Tue, May 22, 2018 at 10:22:19AM -0500, Alex G. wrote:
> I especially don't want a kernel maintainer who hasn't even read the
> recovery handler (let alone the spec around which the handler was
> written) tell me how the recovery handler works and what it's supposed
> to do (see, I can be an ass).

Well, since you're such an ass and since you know better, go find
someone else to review your crap. I'm done wasting time with you.
Tony Luck May 22, 2018, 5:57 p.m. UTC | #8
On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
> I especially don't want to have the case where a PCIe error is *really*
> fatal and then we noodle in some handlers debating about the severity
> because it got marked as recoverable intermittently and end up causing
> data corruption on the storage device. Here's a real no-no for ya.

All that we have is a message from the BIOS that this is a "fatal"
error.  When did we start trusting the BIOS to give us accurate
information?

PCIe fatal means that the link or the device is broken. But that
seems a poor reason to take down a large server that may have
dozens of devices (some of them set up specifically to handle
errors ... e.g. mirrored disks on separate controllers, or NIC
devices that have been "bonded" together).

So, as long as the action for a "fatal" error is to mark a link
down and offline the device, that seems a pretty reasonable course
of action.

The argument gets a lot more marginal if you simply reset the
link and re-enable the device to "fix" it. That might be enough,
but I don't think the OS has enough data to make the call.

-Tony

P.S. I deliberately put "fatal" in quotes above because to
quote "The Princess Bride" -- "that word, I do not think it
means what you think it means". :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki May 22, 2018, 6:10 p.m. UTC | #9
On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <tony.luck@intel.com> wrote:
> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>> I especially don't want to have the case where a PCIe error is *really*
>> fatal and then we noodle in some handlers debating about the severity
>> because it got marked as recoverable intermittently and end up causing
>> data corruption on the storage device. Here's a real no-no for ya.
>
> All that we have is a message from the BIOS that this is a "fatal"
> error.  When did we start trusting the BIOS to give us accurate
> information?

Some time ago, actually.

This is about changing the existing behavior which has been to treat
"fatal" errors reported by the BIOS as good enough reasons for a panic
for quite a while AFAICS.

> PCIe fatal means that the link or the device is broken.

And that may really mean that the component in question is on fire.
We just don't know.

> But that seems a poor reason to take down a large server that may have
> dozens of devices (some of them set up specifically to handle
> errors ... e.g. mirrored disks on separate controllers, or NIC
> devices that have been "bonded" together).
>
> So, as long as the action for a "fatal" error is to mark a link
> down and offline the device, that seems a pretty reasonable course
> of action.
>
> The argument gets a lot more marginal if you simply reset the
> link and re-enable the device to "fix" it. That might be enough,
> but I don't think the OS has enough data to make the call.

Again, that's about changing the existing behavior or the existing policy even.

What exactly has changed to make us consider this now?

Thanks,
Rafael
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex G. May 22, 2018, 6:13 p.m. UTC | #10
On 05/22/2018 12:57 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>> I especially don't want to have the case where a PCIe error is *really*
>> fatal and then we noodle in some handlers debating about the severity
>> because it got marked as recoverable intermittently and end up causing
>> data corruption on the storage device. Here's a real no-no for ya.
> 
> All that we have is a message from the BIOS that this is a "fatal"
> error.  When did we start trusting the BIOS to give us accurate
> information?

When we merged ACPI handling.

> PCIe fatal means that the link or the device is broken. But that
> seems a poor reason to take down a large server that may have
> dozens of devices (some of them set up specifically to handle
> errors ... e.g. mirrored disks on separate controllers, or NIC
> devices that have been "bonded" together).
> 
> So, as long as the action for a "fatal" error is to mark a link
> down and offline the device, that seems a pretty reasonable course
> of action.
> 
> The argument gets a lot more marginal if you simply reset the
> link and re-enable the device to "fix" it. That might be enough,
> but I don't think the OS has enough data to make the call.

I'm not 100% satisfied with how AER handler works, and how certain 
drivers (nvme!!!) interface with AER handling. But this is an arguments 
that belongs in PCI code, and a fight I will fight with Bjorn and Keith. 
The issue we're having with Borislav and Rafael's estate is that we 
can't make it to PCI land.

I'm seeing here the same fight that I saw with firmware vs OS, where fw 
wants to have control, and OS wants to have control. I saw the same with 
ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible 
to make sure only they can access the boot vector and boot the 
processor, and ChromeOS team couldn't use this approach because they 
wanted their own root of trust. I've seen this in other places as well, 
though confidentiality agreements prevent me from talking about it.

It's the issue of control, and it's a fact of life. Borislav and Rafael 
don't want to relinquish control until they can be 100% certain that 
going further will result in 100% recovery. That is a goal I aspire to 
as well, but an unachievable ideal nonetheless.

I thought the best compromise would be to be as close as possible to 
native handling. That is, if AER can't recover, we don't recover the 
device, but the machine keeps running. I think there's some deeper 
history to GHES handling, which I didn't take into consideration. The 
fight is to convince appropriate parties to share the responsibility in 
a way which doesn't kill the machine. We still have a ways to go until 
we get there.

Alex

> -Tony
> 
> P.S. I deliberately put "fatal" in quotes above because to
> quote "The Princess Bride" -- "that word, I do not think it
> means what you think it means". :-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki May 22, 2018, 6:13 p.m. UTC | #11
On Tue, May 22, 2018 at 3:38 PM, Alex G. <mr.nuke.me@gmail.com> wrote:
>
>
> On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote:
>> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
>>> ghes_severity() is a misnomer in this case, as it implies the severity
>>> of the entire GHES structure. Instead, it maps one CPER value to a
>>> GHES_SEV* value. ghes_cper_severity() is clearer.
>>
>> It looks like the *real* reason for this change is that you
>> re-introduce ghes_severity() as a different function in the second
>> patch.
>
> /me holds fist at Borislav
>
>> There are a couple of reasons to avoid that, one of them being that
>> people will now have to remember what the function did in which kernel
>> versions.
>
> So?
>
>> Also, the current name is good enough IMO,
>
> Two other reviewers were extremely confused by the vague name, so no,
> this is not good enough.

Of course, you are free to have a differing opinion and I don't have
to convince you about my point.  You need to convince me about your
point to get the patch in through my tree, which you haven't done so
far.

>> so I'm not going to apply this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex G. May 22, 2018, 6:19 p.m. UTC | #12
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote:
> On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <tony.luck@intel.com> wrote:
>> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>>> I especially don't want to have the case where a PCIe error is *really*
>>> fatal and then we noodle in some handlers debating about the severity
>>> because it got marked as recoverable intermittently and end up causing
>>> data corruption on the storage device. Here's a real no-no for ya.
>>
>> All that we have is a message from the BIOS that this is a "fatal"
>> error.  When did we start trusting the BIOS to give us accurate
>> information?
> 
> Some time ago, actually.
> 
> This is about changing the existing behavior which has been to treat
> "fatal" errors reported by the BIOS as good enough reasons for a panic
> for quite a while AFAICS.

Yes, you are correct. I'd actually like to go deeper, and remove the 
policy to panic() on fatal errors. Now whether we blacklist or whitelist 
which errors can go through is up for debate, but the current policy 
seems broken.

>> PCIe fatal means that the link or the device is broken.
> 
> And that may really mean that the component in question is on fire.
> We just don't know.

Should there be a physical fire, we have much bigger issues. At the same 
time, we could retrain the link, call the driver, and release freon gas 
to put out the fire. That's something we don't currently have the option 
to do.

>> But that seems a poor reason to take down a large server that may have
>> dozens of devices (some of them set up specifically to handle
>> errors ... e.g. mirrored disks on separate controllers, or NIC
>> devices that have been "bonded" together).
>>
>> So, as long as the action for a "fatal" error is to mark a link
>> down and offline the device, that seems a pretty reasonable course
>> of action.
>>
>> The argument gets a lot more marginal if you simply reset the
>> link and re-enable the device to "fix" it. That might be enough,
>> but I don't think the OS has enough data to make the call.
> 
> Again, that's about changing the existing behavior or the existing policy even.
> 
> What exactly has changed to make us consider this now?

Firmware started passing "fatal" GHES headers with the explicit intent 
of crashing an OS. At the same time, we've learnt how to handle these 
errors in a number of cases. With DPC (coming soon to firmware-first) 
the error is contained, and a non-issue.

As specs and hardware implementations evolve, we have to adapt. I'm here 
until November, and one of my goals is to involve linux upstream in the 
development of these features so that when the hardware hits the market, 
we're ready. That does mean we have to drop some of the silly things 
we're doing.

Alex

> Thanks,
> Rafael
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex G. May 22, 2018, 6:20 p.m. UTC | #13
On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
(snip)
> Of course, you are free to have a differing opinion and I don't have
> to convince you about my point.  You need to convince me about your
> point to get the patch in through my tree, which you haven't done so
> far.

My point is that crossing your arms and saying "impress me" doesn't give 
me a good idea of how you'd like to see the problem resolved.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck May 22, 2018, 6:33 p.m. UTC | #14
On Tue, May 22, 2018 at 08:10:47PM +0200, Rafael J. Wysocki wrote:
> > PCIe fatal means that the link or the device is broken.
> 
> And that may really mean that the component in question is on fire.
> We just don't know.

Components on fire could be the root cause of many errors. If we really
believe that is a problem we should power the system off rather than
just calling panic() [not just for PCIe errors, but also for machine
checks, and perhaps a bunch of other places in the kernel].

True story: I used to work for Stratus Computer on fault tolerant
systems. A customer once called in with a "my computer is on fire"
report and asked what to do. The support person told them to power it
off. Customer asked "Isn't there something else? It's still running
just fine".

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck May 22, 2018, 6:45 p.m. UTC | #15
On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
> Firmware started passing "fatal" GHES headers with the explicit intent of
> crashing an OS. At the same time, we've learnt how to handle these errors in
> a number of cases. With DPC (coming soon to firmware-first) the error is
> contained, and a non-issue.

Perhaps DPC is the change that you need to emphasize as to
why things are different now, so we can change the default
Linux behavior.

With the h/w guaranteeing that corrupt data is contained, we
should be safe to disregard BIOS indications of "fatal" problems
that could be anything and might show up in unknown ways some
time later if we keep running.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex G. May 22, 2018, 6:49 p.m. UTC | #16
On 05/22/2018 01:45 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote:
>> Firmware started passing "fatal" GHES headers with the explicit intent of
>> crashing an OS. At the same time, we've learnt how to handle these errors in
>> a number of cases. With DPC (coming soon to firmware-first) the error is
>> contained, and a non-issue.
> 
> Perhaps DPC is the change that you need to emphasize as to
> why things are different now, so we can change the default
> Linux behavior.
> 
> With the h/w guaranteeing that corrupt data is contained, we
> should be safe to disregard BIOS indications of "fatal" problems
> that could be anything and might show up in unknown ways some
> time later if we keep running.

Sure. DPC is much harder to contest as a reason. However, the AER path
benefits as well from this change in behavior. I'm certain there are
other classes of errors that benefit as well from the change, though I
haven't had the time or the inclination to look for them.

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rafael J. Wysocki May 22, 2018, 9:20 p.m. UTC | #17
On Tue, May 22, 2018 at 8:20 PM, Alex G. <mr.nuke.me@gmail.com> wrote:
> On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote:
> (snip)
>>
>> Of course, you are free to have a differing opinion and I don't have
>> to convince you about my point.  You need to convince me about your
>> point to get the patch in through my tree, which you haven't done so
>> far.
>
>
> My point is that crossing your arms and saying "impress me" doesn't give me
> a good idea of how you'd like to see the problem resolved.

I do not recall having said "impress me" at any time during our
conversation so far.

Also this problem space is not the one I'm focusing on and I can't
tell you how to address the problem you want to address in that space
from the top of my head.  I can only look at what you post and decide
whether or not I agree with that.  It has to be well documented and
convincing enough, however, and it hasn't been so far in my view.  As
a rule, I'd rather not apply patches that I don't feel confident of.

Tony seems to agree with you, so maybe work with him on making your
case stronger.
--
To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..7c1a16b106ba 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -271,7 +271,7 @@  static void ghes_fini(struct ghes *ghes)
 		unmap_gen_v2(ghes);
 }
 
-static inline int ghes_severity(int severity)
+static inline int ghes_cper_severity(int severity)
 {
 	switch (severity) {
 	case CPER_SEV_INFORMATIONAL:
@@ -388,7 +388,7 @@  static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
 #ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
 	unsigned long pfn;
 	int flags = -1;
-	int sec_sev = ghes_severity(gdata->error_severity);
+	int sec_sev = ghes_cper_severity(gdata->error_severity);
 	struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
 
 	if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
@@ -468,10 +468,10 @@  static void ghes_do_proc(struct ghes *ghes,
 	guid_t *fru_id = &NULL_UUID_LE;
 	char *fru_text = "";
 
-	sev = ghes_severity(estatus->error_severity);
+	sev = ghes_cper_severity(estatus->error_severity);
 	apei_estatus_for_each_section(estatus, gdata) {
 		sec_type = (guid_t *)gdata->section_type;
-		sec_sev = ghes_severity(gdata->error_severity);
+		sec_sev = ghes_cper_severity(gdata->error_severity);
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
 			fru_id = (guid_t *)gdata->fru_id;
 
@@ -512,7 +512,7 @@  static void __ghes_print_estatus(const char *pfx,
 	char pfx_seq[64];
 
 	if (pfx == NULL) {
-		if (ghes_severity(estatus->error_severity) <=
+		if (ghes_cper_severity(estatus->error_severity) <=
 		    GHES_SEV_CORRECTED)
 			pfx = KERN_WARNING;
 		else
@@ -534,7 +534,7 @@  static int ghes_print_estatus(const char *pfx,
 	static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
 	struct ratelimit_state *ratelimit;
 
-	if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+	if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
 		ratelimit = &ratelimit_corrected;
 	else
 		ratelimit = &ratelimit_uncorrected;
@@ -705,9 +705,8 @@  static int ghes_proc(struct ghes *ghes)
 	if (rc)
 		goto out;
 
-	if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+	if (ghes_cper_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC)
 		__ghes_panic(ghes);
-	}
 
 	if (!ghes_estatus_cached(ghes->estatus)) {
 		if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
@@ -945,7 +944,7 @@  static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 			ret = NMI_HANDLED;
 		}
 
-		sev = ghes_severity(ghes->estatus->error_severity);
+		sev = ghes_cper_severity(ghes->estatus->error_severity);
 		if (sev >= GHES_SEV_PANIC) {
 			oops_begin();
 			ghes_print_queued_estatus();