Message ID | 20170607184947.18733-1-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | Accepted |
Commit | 56b47fe65792 |
Headers | show |
On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device > in Table 5-169. > > 0x81 Unconsumed Uncorrectable Memory Error Detected > Used to pro-actively notify OSPM of uncorrectable memory errors > detected (for example a memory scrubbing engine that continuously > scans the NVDIMMs memory). This is an optional notification. Only > locations that were mapped in to SPA by the platform will generate > a notification. > > Add support of this notification value by initiating an ARS scan. This > will find new error locations and add their badblocks information. > > Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net> > Cc: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/acpi/nfit/core.c | 28 ++++++++++++++++++++++------ > drivers/acpi/nfit/nfit.h | 1 + > 2 files changed, 23 insertions(+), 6 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 656acb5..cc22778 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev) > return 0; > } > > -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle) > { > struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL }; > @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > acpi_status status; > int ret; > > - dev_dbg(dev, "%s: event: %d\n", __func__, event); > - > - if (event != NFIT_NOTIFY_UPDATE) > - return; > - > if (!dev->driver) { > /* dev->driver may be null if we're being removed */ > dev_dbg(dev, "%s: no driver found for dev\n", __func__); > @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > dev_err(dev, "Invalid _FIT\n"); > kfree(buf.pointer); > } > + > +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle) > +{ > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > + > + acpi_nfit_ars_rescan(acpi_desc); I wonder if we should gate re-scanning with a similar: if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) ...check that we do in the mce notification case? Maybe not since we don't get an indication of where the error is without a rescan. However, at a minimum I think we need support for the new Start ARS flag ("If set to 1 the firmware shall return data from a previous scrub, if any, without starting a new scrub") and use that for this case. Another thing that seems to be missing in both this and the mce case is a notification to userspace that something changed. We have calls to sysfs_notify_dirent() to notify scrub completion events and DIMM health status change events, I think we need a similar notifier mechanism for new un-correctable errors.
On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: > On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> > wrote: : > > + > > +static void acpi_nfit_uc_error_notify(struct device *dev, > > acpi_handle handle) > > +{ > > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > > + > > + acpi_nfit_ars_rescan(acpi_desc); > > I wonder if we should gate re-scanning with a similar: > > if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) > > ...check that we do in the mce notification case? Maybe not since we > don't get an indication of where the error is without a rescan. I think this mce case is different since the MCE handler already knows where the new poison location is and can update badblocks information for it. Starting ARS is an optional precaution. > However, at a minimum I think we need support for the new Start ARS > flag ("If set to 1 the firmware shall return data from a previous > scrub, if any, without starting a new scrub") and use that for this > case. That's an interesting idea. But I wonder how users know if it is OK to set this flag as it relies on BIOS implementation that is not described in ACPI... > Another thing that seems to be missing in both this and the mce case > is a notification to userspace that something changed. We have calls > to sysfs_notify_dirent() to notify scrub completion events and DIMM > health status change events, I think we need a similar notifier > mechanism for new un-correctable errors. Good point. I think this can be a badblocks population event, which gets generated when badblocks information is updated at boot-time and run-time via this notification and MCE. Thanks, -Toshi
On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: > On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: >> On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> >> wrote: > : >> > + >> > +static void acpi_nfit_uc_error_notify(struct device *dev, >> > acpi_handle handle) >> > +{ >> > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); >> > + >> > + acpi_nfit_ars_rescan(acpi_desc); >> >> I wonder if we should gate re-scanning with a similar: >> >> if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) >> >> ...check that we do in the mce notification case? Maybe not since we >> don't get an indication of where the error is without a rescan. > > I think this mce case is different since the MCE handler already knows > where the new poison location is and can update badblocks information > for it. Starting ARS is an optional precaution. > >> However, at a minimum I think we need support for the new Start ARS >> flag ("If set to 1 the firmware shall return data from a previous >> scrub, if any, without starting a new scrub") and use that for this >> case. > > That's an interesting idea. But I wonder how users know if it is OK to > set this flag as it relies on BIOS implementation that is not described > in ACPI... Ugh, you're right. We might need a revision-id=2 version of Start ARS so software knows that the BIOS is aware of the new flag.
On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote: > On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> > wrote: > > On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: > > > On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> > > > wrote: > > > > : > > > > + > > > > +static void acpi_nfit_uc_error_notify(struct device *dev, > > > > acpi_handle handle) > > > > +{ > > > > + struct acpi_nfit_desc *acpi_desc = > > > > dev_get_drvdata(dev); > > > > + > > > > + acpi_nfit_ars_rescan(acpi_desc); > > > > > > I wonder if we should gate re-scanning with a similar: > > > > > > if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) > > > > > > ...check that we do in the mce notification case? Maybe not since > > > we > > > don't get an indication of where the error is without a rescan. > > > > I think this mce case is different since the MCE handler already > > knows where the new poison location is and can update badblocks > > information for it. Starting ARS is an optional precaution. > > > > > However, at a minimum I think we need support for the new Start > > > ARS flag ("If set to 1 the firmware shall return data from a > > > previous scrub, if any, without starting a new scrub") and use > > > that for this case. > > > > That's an interesting idea. But I wonder how users know if it is > > OK to set this flag as it relies on BIOS implementation that is not > > described in ACPI... > > Ugh, you're right. We might need a revision-id=2 version of Start ARS > so software knows that the BIOS is aware of the new flag. My bad. Looking at ACPI 6.2, it actually defines what you described. Start ARS now defines bit[1] in Flags which can be set to avoid scanning for this notification. I will update the patch to set this flag when HW_ERROR_SCRUB_ON is not set. Thanks, -Toshi
On 6/7/2017 5:33 PM, Kani, Toshimitsu wrote: > On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote: >> On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> >> wrote: >>> On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: >>>> On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> >>>> wrote: >>> >>> : >>>>> + >>>>> +static void acpi_nfit_uc_error_notify(struct device *dev, >>>>> acpi_handle handle) >>>>> +{ >>>>> + struct acpi_nfit_desc *acpi_desc = >>>>> dev_get_drvdata(dev); >>>>> + >>>>> + acpi_nfit_ars_rescan(acpi_desc); >>>> >>>> I wonder if we should gate re-scanning with a similar: >>>> >>>> if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) >>>> >>>> ...check that we do in the mce notification case? Maybe not since >>>> we >>>> don't get an indication of where the error is without a rescan. >>> >>> I think this mce case is different since the MCE handler already >>> knows where the new poison location is and can update badblocks >>> information for it. Starting ARS is an optional precaution. >>> >>>> However, at a minimum I think we need support for the new Start >>>> ARS flag ("If set to 1 the firmware shall return data from a >>>> previous scrub, if any, without starting a new scrub") and use >>>> that for this case. >>> >>> That's an interesting idea. But I wonder how users know if it is >>> OK to set this flag as it relies on BIOS implementation that is not >>> described in ACPI... >> >> Ugh, you're right. We might need a revision-id=2 version of Start ARS >> so software knows that the BIOS is aware of the new flag. > > My bad. Looking at ACPI 6.2, it actually defines what you described. > Start ARS now defines bit[1] in Flags which can be set to avoid > scanning for this notification. I will update the patch to set this > flag when HW_ERROR_SCRUB_ON is not set. Wasn't Dan concerned about how the OS can know whether the FW supports that bit in the Start ARS? The Query ARS Capabilities DSM has a bit that tells the OS whether the platform supports the notification and the point of the notification was to tell the OS it could do a Start ARS with bit 1 set. Of course, if you get the notification then that means the platform has the capability to deliver it, but it might not hurt to check the flag from the Query Capabilities bit. If the spec isn't clear about the relationship between the bits, we could clarify that in the spec without bumping the revision number for the call. We could also check to see what a platform does if it gets a flag it doesn't know about. Would that be an Invalid Input Parameter? -- ljk > > Thanks, > -Toshi��칻�&�~�&���+-��ݶ��w��˛���m�b��Zr����^n�r���z���h����&���G���h�(�階�ݢj"���m�����z�ޖ���f���h���~�m� >
On Thu, Jun 8, 2017 at 10:30 AM, Linda Knippers <linda.knippers@hpe.com> wrote: [..] > Wasn't Dan concerned about how the OS can know whether the FW supports > that bit in the Start ARS? > > The Query ARS Capabilities DSM has a bit that tells the OS whether the > platform supports the notification and the point of the notification was > to tell the OS it could do a Start ARS with bit 1 set. Of course, if > you get the notification then that means the platform has the capability > to deliver it, but it might not hurt to check the flag from the Query > Capabilities bit. Good point, yes, I think it is safe to assume that a BIOS that claims to support un-correctable error notification also supports this Start ARS flag.
On Thu, 2017-06-08 at 10:34 -0700, Dan Williams wrote: > On Thu, Jun 8, 2017 at 10:30 AM, Linda Knippers <linda.knippers@hpe.c > om> wrote: > [..] > > Wasn't Dan concerned about how the OS can know whether the FW > > supports that bit in the Start ARS? > > > > The Query ARS Capabilities DSM has a bit that tells the OS whether > > the platform supports the notification and the point of the > > notification was to tell the OS it could do a Start ARS with bit 1 > > set. Of course, if you get the notification then that means the > > platform has the capability to deliver it, but it might not hurt to > > check the flag from the Query Capabilities bit. > > Good point, yes, I think it is safe to assume that a BIOS that claims > to support un-correctable error notification also supports this Start > ARS flag. Yes, ACPI 6.2, section 9.20.7.2, defines that: Upon receiving the notification, the OSPM may decide to issue a Start ARS with Flags Bit [1] set to prepare for the retrieval of existing records and issue the Query ARS Status function to retrieve the records. So, I believe it is safe to assume that BIOS supporting 0x81 also supports flags Big [1]. Sorry, this is what I should have said in my previous email... Thanks, -Toshi
On Thu, 2017-06-08 at 11:37 -0600, Toshi Kani wrote: > On Thu, 2017-06-08 at 10:34 -0700, Dan Williams wrote: > > On Thu, Jun 8, 2017 at 10:30 AM, Linda Knippers <linda.knippers@hpe > > .c > > om> wrote: > > [..] > > > Wasn't Dan concerned about how the OS can know whether the FW > > > supports that bit in the Start ARS? > > > > > > The Query ARS Capabilities DSM has a bit that tells the OS > > > whether the platform supports the notification and the point of > > > the notification was to tell the OS it could do a Start ARS with > > > bit 1 set. Of course, if you get the notification then that > > > means the platform has the capability to deliver it, but it might > > > not hurt to check the flag from the Query Capabilities bit. > > > > Good point, yes, I think it is safe to assume that a BIOS that > > claims to support un-correctable error notification also supports > > this Start ARS flag. > > Yes, ACPI 6.2, section 9.20.7.2, defines that: > > Upon receiving the notification, the OSPM may decide to issue > a Start ARS with Flags Bit [1] set to prepare for the retrieval > of existing records and issue the Query ARS Status function to > retrieve the records. > > So, I believe it is safe to assume that BIOS supporting 0x81 also > supports flags Bit [1]. Sorry, this is what I should have said in my > previous email... To reiterate my thinking, I believe the statement above clarifies that the OS can assume BIOS support of Flags Bit[1] upon receiving a 0x81 notification. Since BIOS may also support Flags Bit[1] without supporting this 0x81 (in which case I do not know how to detect it, but BIOS should simply ignore this bit when not supporting it), I am not going to add a check/restriction that 0x81 support is necessary to set Bit[1] in the scan function. Thanks, -Toshi
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 656acb5..cc22778 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev) return 0; } -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle) { struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL }; @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) acpi_status status; int ret; - dev_dbg(dev, "%s: event: %d\n", __func__, event); - - if (event != NFIT_NOTIFY_UPDATE) - return; - if (!dev->driver) { /* dev->driver may be null if we're being removed */ dev_dbg(dev, "%s: no driver found for dev\n", __func__); @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) dev_err(dev, "Invalid _FIT\n"); kfree(buf.pointer); } + +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle) +{ + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); + + acpi_nfit_ars_rescan(acpi_desc); +} + +void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) +{ + dev_dbg(dev, "%s: event: 0x%x\n", __func__, event); + + switch (event) { + case NFIT_NOTIFY_UPDATE: + return acpi_nfit_update_notify(dev, handle); + case NFIT_NOTIFY_UC_MEMORY_ERROR: + return acpi_nfit_uc_error_notify(dev, handle); + default: + return; + } +} EXPORT_SYMBOL_GPL(__acpi_nfit_notify); static void acpi_nfit_notify(struct acpi_device *adev, u32 event) diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h index 58fb7d6..6cf9d21 100644 --- a/drivers/acpi/nfit/nfit.h +++ b/drivers/acpi/nfit/nfit.h @@ -80,6 +80,7 @@ enum { enum nfit_root_notifiers { NFIT_NOTIFY_UPDATE = 0x80, + NFIT_NOTIFY_UC_MEMORY_ERROR = 0x81, }; enum nfit_dimm_notifiers {
ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device in Table 5-169. 0x81 Unconsumed Uncorrectable Memory Error Detected Used to pro-actively notify OSPM of uncorrectable memory errors detected (for example a memory scrubbing engine that continuously scans the NVDIMMs memory). This is an optional notification. Only locations that were mapped in to SPA by the platform will generate a notification. Add support of this notification value by initiating an ARS scan. This will find new error locations and add their badblocks information. Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Vishal Verma <vishal.l.verma@intel.com> --- drivers/acpi/nfit/core.c | 28 ++++++++++++++++++++++------ drivers/acpi/nfit/nfit.h | 1 + 2 files changed, 23 insertions(+), 6 deletions(-)