Message ID | 20180521135003.32459-2-mr.nuke.me@gmail.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote: > ghes_severity() is a misnomer in this case, as it implies the severity > of the entire GHES structure. Instead, it maps one CPER value to a > GHES_SEV* value. ghes_cper_severity() is clearer. It looks like the *real* reason for this change is that you re-introduce ghes_severity() as a different function in the second patch. There are a couple of reasons to avoid that, one of them being that people will now have to remember what the function did in which kernel versions. Also, the current name is good enough IMO, so I'm not going to apply this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote: > On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote: >> ghes_severity() is a misnomer in this case, as it implies the severity >> of the entire GHES structure. Instead, it maps one CPER value to a >> GHES_SEV* value. ghes_cper_severity() is clearer. > > It looks like the *real* reason for this change is that you > re-introduce ghes_severity() as a different function in the second > patch. /me holds fist at Borislav > There are a couple of reasons to avoid that, one of them being that > people will now have to remember what the function did in which kernel > versions. So? > Also, the current name is good enough IMO, Two other reviewers were extremely confused by the vague name, so no, this is not good enough. > so I'm not going to apply this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote: > > It looks like the *real* reason for this change is that you > > re-introduce ghes_severity() as a different function in the second > > patch. > > /me holds fist at Borislav That was a misunderstanding with Rafael and me - we fixed it on IRC. But this is not the real problem with your approach - it is the marking of all PCIe errors as recoverable, regardless of the signature. That's a no-no, IMO.
On 05/22/2018 08:50 AM, Borislav Petkov wrote: > On Tue, May 22, 2018 at 08:38:39AM -0500, Alex G. wrote: >>> It looks like the *real* reason for this change is that you >>> re-introduce ghes_severity() as a different function in the second >>> patch. >> >> /me holds fist at Borislav > > That was a misunderstanding with Rafael and me - we fixed it on IRC. You mean to say this whole time I've been struggling to write emails, there was an IRC? > But this is not the real problem with your approach - it is the marking > of all PCIe errors as recoverable, regardless of the signature. That's a > no-no, IMO. No, the problem is with the current approach, not with mine. The problem is trying to handle the error outside of the existing handler. That's a no-no, IMO. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote: > No, the problem is with the current approach, not with mine. The problem > is trying to handle the error outside of the existing handler. That's a > no-no, IMO. Let me save you some time: until you come up with a proper solution for *all* PCIe errors so that the kernel can correctly decide what to do for each error based on its actual severity, consider this NAKed. I don't care about outside or inside of the handler - this thing needs to be done properly and not just to serve your particular use case of abrupt removal of devices causing PCIe errors, and punish the rest. I especially don't want to have the case where a PCIe error is *really* fatal and then we noodle in some handlers debating about the severity because it got marked as recoverable intermittently and end up causing data corruption on the storage device. Here's a real no-no for ya.
On 05/22/2018 09:54 AM, Borislav Petkov wrote: > On Tue, May 22, 2018 at 09:39:15AM -0500, Alex G. wrote: >> No, the problem is with the current approach, not with mine. The problem >> is trying to handle the error outside of the existing handler. That's a >> no-no, IMO. > > Let me save you some time: until you come up with a proper solution for > *all* PCIe errors so that the kernel can correctly decide what to do for > each error based on its actual severity, consider this NAKed. I do have a proper solution for _all_ PCIe errors. In fact, we discussed several valid approaches already. > I don't care about outside or inside of the handler I do. I have a handler that can handle (no pun intended) errors. I want to use the same code path in native and GHES cases. If I allow ghes.c to take different decisions than what aer_do_recovery() would, I've failed. >- this thing needs to be done properly Exactly! > and not just to serve your particular use case of > abrupt removal of devices causing PCIe errors, and punish the rest. I think you're confused about what I'm actually trying to do. Or maybe you're confused about how PCIe errors work. That's understandable. PCIe uses the term "fatal" for errors that may make the link unusable, and which may require a link reset, and in most other specs "fatal" means "on fire". I understand your confusion, and I hope I cleared it up. You're trying to make the case that surprise removal is my only concern and use case, because that's the example that I gave. It makes your argument stronger, but it's wrong. You don't know our test setup, and all the things I'm testing for, and whenever I try to tell you, you fall back to the 'surprise removal' example. I don't know why you'd think Dell would pay me to work on this if I were to allow things like silent data corruption to creep in. This isn't a traditional company from Redmond, Washington. > I especially don't want to have the case where a PCIe error is *really* > fatal and then we noodle in some handlers debating about the severity > because it got marked as recoverable intermittently and end up causing > data corruption on the storage device. Here's a real no-no for ya. I especially don't want a kernel maintainer who hasn't even read the recovery handler (let alone the spec around which the handler was written) tell me how the recovery handler works and what it's supposed to do (see, I can be an ass). PCIe errors really are fatal. They might need to unload the driver and remove the device. But somebody set the questionable policy that "fatal"=="panic", and that is completely inappropriate for a larger class of errors -- PCIe happens to be the easiest example to pick on. And even you realize that the argument that a panic() will somehow prevent data corruption is complete noodle sauce. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 10:22:19AM -0500, Alex G. wrote: > I especially don't want a kernel maintainer who hasn't even read the > recovery handler (let alone the spec around which the handler was > written) tell me how the recovery handler works and what it's supposed > to do (see, I can be an ass). Well, since you're such an ass and since you know better, go find someone else to review your crap. I'm done wasting time with you.
On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: > I especially don't want to have the case where a PCIe error is *really* > fatal and then we noodle in some handlers debating about the severity > because it got marked as recoverable intermittently and end up causing > data corruption on the storage device. Here's a real no-no for ya. All that we have is a message from the BIOS that this is a "fatal" error. When did we start trusting the BIOS to give us accurate information? PCIe fatal means that the link or the device is broken. But that seems a poor reason to take down a large server that may have dozens of devices (some of them set up specifically to handle errors ... e.g. mirrored disks on separate controllers, or NIC devices that have been "bonded" together). So, as long as the action for a "fatal" error is to mark a link down and offline the device, that seems a pretty reasonable course of action. The argument gets a lot more marginal if you simply reset the link and re-enable the device to "fix" it. That might be enough, but I don't think the OS has enough data to make the call. -Tony P.S. I deliberately put "fatal" in quotes above because to quote "The Princess Bride" -- "that word, I do not think it means what you think it means". :-) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <tony.luck@intel.com> wrote: > On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: >> I especially don't want to have the case where a PCIe error is *really* >> fatal and then we noodle in some handlers debating about the severity >> because it got marked as recoverable intermittently and end up causing >> data corruption on the storage device. Here's a real no-no for ya. > > All that we have is a message from the BIOS that this is a "fatal" > error. When did we start trusting the BIOS to give us accurate > information? Some time ago, actually. This is about changing the existing behavior which has been to treat "fatal" errors reported by the BIOS as good enough reasons for a panic for quite a while AFAICS. > PCIe fatal means that the link or the device is broken. And that may really mean that the component in question is on fire. We just don't know. > But that seems a poor reason to take down a large server that may have > dozens of devices (some of them set up specifically to handle > errors ... e.g. mirrored disks on separate controllers, or NIC > devices that have been "bonded" together). > > So, as long as the action for a "fatal" error is to mark a link > down and offline the device, that seems a pretty reasonable course > of action. > > The argument gets a lot more marginal if you simply reset the > link and re-enable the device to "fix" it. That might be enough, > but I don't think the OS has enough data to make the call. Again, that's about changing the existing behavior or the existing policy even. What exactly has changed to make us consider this now? Thanks, Rafael -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2018 12:57 PM, Luck, Tony wrote: > On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: >> I especially don't want to have the case where a PCIe error is *really* >> fatal and then we noodle in some handlers debating about the severity >> because it got marked as recoverable intermittently and end up causing >> data corruption on the storage device. Here's a real no-no for ya. > > All that we have is a message from the BIOS that this is a "fatal" > error. When did we start trusting the BIOS to give us accurate > information? When we merged ACPI handling. > PCIe fatal means that the link or the device is broken. But that > seems a poor reason to take down a large server that may have > dozens of devices (some of them set up specifically to handle > errors ... e.g. mirrored disks on separate controllers, or NIC > devices that have been "bonded" together). > > So, as long as the action for a "fatal" error is to mark a link > down and offline the device, that seems a pretty reasonable course > of action. > > The argument gets a lot more marginal if you simply reset the > link and re-enable the device to "fix" it. That might be enough, > but I don't think the OS has enough data to make the call. I'm not 100% satisfied with how AER handler works, and how certain drivers (nvme!!!) interface with AER handling. But this is an arguments that belongs in PCI code, and a fight I will fight with Bjorn and Keith. The issue we're having with Borislav and Rafael's estate is that we can't make it to PCI land. I'm seeing here the same fight that I saw with firmware vs OS, where fw wants to have control, and OS wants to have control. I saw the same with ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible to make sure only they can access the boot vector and boot the processor, and ChromeOS team couldn't use this approach because they wanted their own root of trust. I've seen this in other places as well, though confidentiality agreements prevent me from talking about it. It's the issue of control, and it's a fact of life. Borislav and Rafael don't want to relinquish control until they can be 100% certain that going further will result in 100% recovery. That is a goal I aspire to as well, but an unachievable ideal nonetheless. I thought the best compromise would be to be as close as possible to native handling. That is, if AER can't recover, we don't recover the device, but the machine keeps running. I think there's some deeper history to GHES handling, which I didn't take into consideration. The fight is to convince appropriate parties to share the responsibility in a way which doesn't kill the machine. We still have a ways to go until we get there. Alex > -Tony > > P.S. I deliberately put "fatal" in quotes above because to > quote "The Princess Bride" -- "that word, I do not think it > means what you think it means". :-) > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 3:38 PM, Alex G. <mr.nuke.me@gmail.com> wrote: > > > On 05/22/2018 03:55 AM, Rafael J. Wysocki wrote: >> On Mon, May 21, 2018 at 3:49 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote: >>> ghes_severity() is a misnomer in this case, as it implies the severity >>> of the entire GHES structure. Instead, it maps one CPER value to a >>> GHES_SEV* value. ghes_cper_severity() is clearer. >> >> It looks like the *real* reason for this change is that you >> re-introduce ghes_severity() as a different function in the second >> patch. > > /me holds fist at Borislav > >> There are a couple of reasons to avoid that, one of them being that >> people will now have to remember what the function did in which kernel >> versions. > > So? > >> Also, the current name is good enough IMO, > > Two other reviewers were extremely confused by the vague name, so no, > this is not good enough. Of course, you are free to have a differing opinion and I don't have to convince you about my point. You need to convince me about your point to get the patch in through my tree, which you haven't done so far. >> so I'm not going to apply this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2018 01:10 PM, Rafael J. Wysocki wrote: > On Tue, May 22, 2018 at 7:57 PM, Luck, Tony <tony.luck@intel.com> wrote: >> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote: >>> I especially don't want to have the case where a PCIe error is *really* >>> fatal and then we noodle in some handlers debating about the severity >>> because it got marked as recoverable intermittently and end up causing >>> data corruption on the storage device. Here's a real no-no for ya. >> >> All that we have is a message from the BIOS that this is a "fatal" >> error. When did we start trusting the BIOS to give us accurate >> information? > > Some time ago, actually. > > This is about changing the existing behavior which has been to treat > "fatal" errors reported by the BIOS as good enough reasons for a panic > for quite a while AFAICS. Yes, you are correct. I'd actually like to go deeper, and remove the policy to panic() on fatal errors. Now whether we blacklist or whitelist which errors can go through is up for debate, but the current policy seems broken. >> PCIe fatal means that the link or the device is broken. > > And that may really mean that the component in question is on fire. > We just don't know. Should there be a physical fire, we have much bigger issues. At the same time, we could retrain the link, call the driver, and release freon gas to put out the fire. That's something we don't currently have the option to do. >> But that seems a poor reason to take down a large server that may have >> dozens of devices (some of them set up specifically to handle >> errors ... e.g. mirrored disks on separate controllers, or NIC >> devices that have been "bonded" together). >> >> So, as long as the action for a "fatal" error is to mark a link >> down and offline the device, that seems a pretty reasonable course >> of action. >> >> The argument gets a lot more marginal if you simply reset the >> link and re-enable the device to "fix" it. That might be enough, >> but I don't think the OS has enough data to make the call. > > Again, that's about changing the existing behavior or the existing policy even. > > What exactly has changed to make us consider this now? Firmware started passing "fatal" GHES headers with the explicit intent of crashing an OS. At the same time, we've learnt how to handle these errors in a number of cases. With DPC (coming soon to firmware-first) the error is contained, and a non-issue. As specs and hardware implementations evolve, we have to adapt. I'm here until November, and one of my goals is to involve linux upstream in the development of these features so that when the hardware hits the market, we're ready. That does mean we have to drop some of the silly things we're doing. Alex > Thanks, > Rafael > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote: (snip) > Of course, you are free to have a differing opinion and I don't have > to convince you about my point. You need to convince me about your > point to get the patch in through my tree, which you haven't done so > far. My point is that crossing your arms and saying "impress me" doesn't give me a good idea of how you'd like to see the problem resolved. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 08:10:47PM +0200, Rafael J. Wysocki wrote: > > PCIe fatal means that the link or the device is broken. > > And that may really mean that the component in question is on fire. > We just don't know. Components on fire could be the root cause of many errors. If we really believe that is a problem we should power the system off rather than just calling panic() [not just for PCIe errors, but also for machine checks, and perhaps a bunch of other places in the kernel]. True story: I used to work for Stratus Computer on fault tolerant systems. A customer once called in with a "my computer is on fire" report and asked what to do. The support person told them to power it off. Customer asked "Isn't there something else? It's still running just fine". -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote: > Firmware started passing "fatal" GHES headers with the explicit intent of > crashing an OS. At the same time, we've learnt how to handle these errors in > a number of cases. With DPC (coming soon to firmware-first) the error is > contained, and a non-issue. Perhaps DPC is the change that you need to emphasize as to why things are different now, so we can change the default Linux behavior. With the h/w guaranteeing that corrupt data is contained, we should be safe to disregard BIOS indications of "fatal" problems that could be anything and might show up in unknown ways some time later if we keep running. -Tony -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 05/22/2018 01:45 PM, Luck, Tony wrote: > On Tue, May 22, 2018 at 01:19:34PM -0500, Alex G. wrote: >> Firmware started passing "fatal" GHES headers with the explicit intent of >> crashing an OS. At the same time, we've learnt how to handle these errors in >> a number of cases. With DPC (coming soon to firmware-first) the error is >> contained, and a non-issue. > > Perhaps DPC is the change that you need to emphasize as to > why things are different now, so we can change the default > Linux behavior. > > With the h/w guaranteeing that corrupt data is contained, we > should be safe to disregard BIOS indications of "fatal" problems > that could be anything and might show up in unknown ways some > time later if we keep running. Sure. DPC is much harder to contest as a reason. However, the AER path benefits as well from this change in behavior. I'm certain there are other classes of errors that benefit as well from the change, though I haven't had the time or the inclination to look for them. Alex -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, May 22, 2018 at 8:20 PM, Alex G. <mr.nuke.me@gmail.com> wrote: > On 05/22/2018 01:13 PM, Rafael J. Wysocki wrote: > (snip) >> >> Of course, you are free to have a differing opinion and I don't have >> to convince you about my point. You need to convince me about your >> point to get the patch in through my tree, which you haven't done so >> far. > > > My point is that crossing your arms and saying "impress me" doesn't give me > a good idea of how you'd like to see the problem resolved. I do not recall having said "impress me" at any time during our conversation so far. Also this problem space is not the one I'm focusing on and I can't tell you how to address the problem you want to address in that space from the top of my head. I can only look at what you post and decide whether or not I agree with that. It has to be well documented and convincing enough, however, and it hasn't been so far in my view. As a rule, I'd rather not apply patches that I don't feel confident of. Tony seems to agree with you, so maybe work with him on making your case stronger. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index 1efefe919555..7c1a16b106ba 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -271,7 +271,7 @@ static void ghes_fini(struct ghes *ghes) unmap_gen_v2(ghes); } -static inline int ghes_severity(int severity) +static inline int ghes_cper_severity(int severity) { switch (severity) { case CPER_SEV_INFORMATIONAL: @@ -388,7 +388,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int #ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE unsigned long pfn; int flags = -1; - int sec_sev = ghes_severity(gdata->error_severity); + int sec_sev = ghes_cper_severity(gdata->error_severity); struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) @@ -468,10 +468,10 @@ static void ghes_do_proc(struct ghes *ghes, guid_t *fru_id = &NULL_UUID_LE; char *fru_text = ""; - sev = ghes_severity(estatus->error_severity); + sev = ghes_cper_severity(estatus->error_severity); apei_estatus_for_each_section(estatus, gdata) { sec_type = (guid_t *)gdata->section_type; - sec_sev = ghes_severity(gdata->error_severity); + sec_sev = ghes_cper_severity(gdata->error_severity); if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID) fru_id = (guid_t *)gdata->fru_id; @@ -512,7 +512,7 @@ static void __ghes_print_estatus(const char *pfx, char pfx_seq[64]; if (pfx == NULL) { - if (ghes_severity(estatus->error_severity) <= + if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED) pfx = KERN_WARNING; else @@ -534,7 +534,7 @@ static int ghes_print_estatus(const char *pfx, static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2); struct ratelimit_state *ratelimit; - if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED) + if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED) ratelimit = &ratelimit_corrected; else ratelimit = &ratelimit_uncorrected; @@ -705,9 +705,8 @@ static int ghes_proc(struct ghes *ghes) if (rc) goto out; - if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) { + if (ghes_cper_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) __ghes_panic(ghes); - } if (!ghes_estatus_cached(ghes->estatus)) { if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus)) @@ -945,7 +944,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } - sev = ghes_severity(ghes->estatus->error_severity); + sev = ghes_cper_severity(ghes->estatus->error_severity); if (sev >= GHES_SEV_PANIC) { oops_begin(); ghes_print_queued_estatus();
ghes_severity() is a misnomer in this case, as it implies the severity of the entire GHES structure. Instead, it maps one CPER value to a GHES_SEV* value. ghes_cper_severity() is clearer. Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com> --- drivers/acpi/apei/ghes.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-)