Message ID | 20170607184947.18733-1-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device > in Table 5-169. > > 0x81 Unconsumed Uncorrectable Memory Error Detected > Used to pro-actively notify OSPM of uncorrectable memory errors > detected (for example a memory scrubbing engine that continuously > scans the NVDIMMs memory). This is an optional notification. Only > locations that were mapped in to SPA by the platform will generate > a notification. > > Add support of this notification value by initiating an ARS scan. This > will find new error locations and add their badblocks information. > > Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Rafael J. Wysocki <rjw@rjwysocki.net> > Cc: Vishal Verma <vishal.l.verma@intel.com> > --- > drivers/acpi/nfit/core.c | 28 ++++++++++++++++++++++------ > drivers/acpi/nfit/nfit.h | 1 + > 2 files changed, 23 insertions(+), 6 deletions(-) > > diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c > index 656acb5..cc22778 100644 > --- a/drivers/acpi/nfit/core.c > +++ b/drivers/acpi/nfit/core.c > @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev) > return 0; > } > > -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle) > { > struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL }; > @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > acpi_status status; > int ret; > > - dev_dbg(dev, "%s: event: %d\n", __func__, event); > - > - if (event != NFIT_NOTIFY_UPDATE) > - return; > - > if (!dev->driver) { > /* dev->driver may be null if we're being removed */ > dev_dbg(dev, "%s: no driver found for dev\n", __func__); > @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) > dev_err(dev, "Invalid _FIT\n"); > kfree(buf.pointer); > } > + > +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle) > +{ > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > + > + acpi_nfit_ars_rescan(acpi_desc); I wonder if we should gate re-scanning with a similar: if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) ...check that we do in the mce notification case? Maybe not since we don't get an indication of where the error is without a rescan. However, at a minimum I think we need support for the new Start ARS flag ("If set to 1 the firmware shall return data from a previous scrub, if any, without starting a new scrub") and use that for this case. Another thing that seems to be missing in both this and the mce case is a notification to userspace that something changed. We have calls to sysfs_notify_dirent() to notify scrub completion events and DIMM health status change events, I think we need a similar notifier mechanism for new un-correctable errors. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: > On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> > wrote: : > > + > > +static void acpi_nfit_uc_error_notify(struct device *dev, > > acpi_handle handle) > > +{ > > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); > > + > > + acpi_nfit_ars_rescan(acpi_desc); > > I wonder if we should gate re-scanning with a similar: > > if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) > > ...check that we do in the mce notification case? Maybe not since we > don't get an indication of where the error is without a rescan. I think this mce case is different since the MCE handler already knows where the new poison location is and can update badblocks information for it. Starting ARS is an optional precaution. > However, at a minimum I think we need support for the new Start ARS > flag ("If set to 1 the firmware shall return data from a previous > scrub, if any, without starting a new scrub") and use that for this > case. That's an interesting idea. But I wonder how users know if it is OK to set this flag as it relies on BIOS implementation that is not described in ACPI... > Another thing that seems to be missing in both this and the mce case > is a notification to userspace that something changed. We have calls > to sysfs_notify_dirent() to notify scrub completion events and DIMM > health status change events, I think we need a similar notifier > mechanism for new un-correctable errors. Good point. I think this can be a badblocks population event, which gets generated when badblocks information is updated at boot-time and run-time via this notification and MCE. Thanks, -Toshi
On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: > On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: >> On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> >> wrote: > : >> > + >> > +static void acpi_nfit_uc_error_notify(struct device *dev, >> > acpi_handle handle) >> > +{ >> > + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); >> > + >> > + acpi_nfit_ars_rescan(acpi_desc); >> >> I wonder if we should gate re-scanning with a similar: >> >> if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) >> >> ...check that we do in the mce notification case? Maybe not since we >> don't get an indication of where the error is without a rescan. > > I think this mce case is different since the MCE handler already knows > where the new poison location is and can update badblocks information > for it. Starting ARS is an optional precaution. > >> However, at a minimum I think we need support for the new Start ARS >> flag ("If set to 1 the firmware shall return data from a previous >> scrub, if any, without starting a new scrub") and use that for this >> case. > > That's an interesting idea. But I wonder how users know if it is OK to > set this flag as it relies on BIOS implementation that is not described > in ACPI... Ugh, you're right. We might need a revision-id=2 version of Start ARS so software knows that the BIOS is aware of the new flag. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote: > On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> > wrote: > > On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: > > > On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> > > > wrote: > > > > : > > > > + > > > > +static void acpi_nfit_uc_error_notify(struct device *dev, > > > > acpi_handle handle) > > > > +{ > > > > + struct acpi_nfit_desc *acpi_desc = > > > > dev_get_drvdata(dev); > > > > + > > > > + acpi_nfit_ars_rescan(acpi_desc); > > > > > > I wonder if we should gate re-scanning with a similar: > > > > > > if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) > > > > > > ...check that we do in the mce notification case? Maybe not since > > > we > > > don't get an indication of where the error is without a rescan. > > > > I think this mce case is different since the MCE handler already > > knows where the new poison location is and can update badblocks > > information for it. Starting ARS is an optional precaution. > > > > > However, at a minimum I think we need support for the new Start > > > ARS flag ("If set to 1 the firmware shall return data from a > > > previous scrub, if any, without starting a new scrub") and use > > > that for this case. > > > > That's an interesting idea. But I wonder how users know if it is > > OK to set this flag as it relies on BIOS implementation that is not > > described in ACPI... > > Ugh, you're right. We might need a revision-id=2 version of Start ARS > so software knows that the BIOS is aware of the new flag. My bad. Looking at ACPI 6.2, it actually defines what you described. Start ARS now defines bit[1] in Flags which can be set to avoid scanning for this notification. I will update the patch to set this flag when HW_ERROR_SCRUB_ON is not set. Thanks, -Toshi
On 6/7/2017 5:33 PM, Kani, Toshimitsu wrote: > On Wed, 2017-06-07 at 14:06 -0700, Dan Williams wrote: >> On Wed, Jun 7, 2017 at 1:57 PM, Kani, Toshimitsu <toshi.kani@hpe.com> >> wrote: >>> On Wed, 2017-06-07 at 12:09 -0700, Dan Williams wrote: >>>> On Wed, Jun 7, 2017 at 11:49 AM, Toshi Kani <toshi.kani@hpe.com> >>>> wrote: >>> >>> : >>>>> + >>>>> +static void acpi_nfit_uc_error_notify(struct device *dev, >>>>> acpi_handle handle) >>>>> +{ >>>>> + struct acpi_nfit_desc *acpi_desc = >>>>> dev_get_drvdata(dev); >>>>> + >>>>> + acpi_nfit_ars_rescan(acpi_desc); >>>> >>>> I wonder if we should gate re-scanning with a similar: >>>> >>>> if (acpi_desc->scrub_mode == HW_ERROR_SCRUB_ON) >>>> >>>> ...check that we do in the mce notification case? Maybe not since >>>> we >>>> don't get an indication of where the error is without a rescan. >>> >>> I think this mce case is different since the MCE handler already >>> knows where the new poison location is and can update badblocks >>> information for it. Starting ARS is an optional precaution. >>> >>>> However, at a minimum I think we need support for the new Start >>>> ARS flag ("If set to 1 the firmware shall return data from a >>>> previous scrub, if any, without starting a new scrub") and use >>>> that for this case. >>> >>> That's an interesting idea. But I wonder how users know if it is >>> OK to set this flag as it relies on BIOS implementation that is not >>> described in ACPI... >> >> Ugh, you're right. We might need a revision-id=2 version of Start ARS >> so software knows that the BIOS is aware of the new flag. > > My bad. Looking at ACPI 6.2, it actually defines what you described. > Start ARS now defines bit[1] in Flags which can be set to avoid > scanning for this notification. I will update the patch to set this > flag when HW_ERROR_SCRUB_ON is not set. Wasn't Dan concerned about how the OS can know whether the FW supports that bit in the Start ARS? The Query ARS Capabilities DSM has a bit that tells the OS whether the platform supports the notification and the point of the notification was to tell the OS it could do a Start ARS with bit 1 set. Of course, if you get the notification then that means the platform has the capability to deliver it, but it might not hurt to check the flag from the Query Capabilities bit. If the spec isn't clear about the relationship between the bits, we could clarify that in the spec without bumping the revision number for the call. We could also check to see what a platform does if it gets a flag it doesn't know about. Would that be an Invalid Input Parameter? -- ljk > > Thanks, > -Toshi��칻�&�~�&���+-��ݶ��w��˛���m�b��Zr����^n�r���z���h����&���G���h�(�階�ݢj"���m�����z�ޖ���f���h���~�m� > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jun 8, 2017 at 10:30 AM, Linda Knippers <linda.knippers@hpe.com> wrote: [..] > Wasn't Dan concerned about how the OS can know whether the FW supports > that bit in the Start ARS? > > The Query ARS Capabilities DSM has a bit that tells the OS whether the > platform supports the notification and the point of the notification was > to tell the OS it could do a Start ARS with bit 1 set. Of course, if > you get the notification then that means the platform has the capability > to deliver it, but it might not hurt to check the flag from the Query > Capabilities bit. Good point, yes, I think it is safe to assume that a BIOS that claims to support un-correctable error notification also supports this Start ARS flag. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2017-06-08 at 10:34 -0700, Dan Williams wrote: > On Thu, Jun 8, 2017 at 10:30 AM, Linda Knippers <linda.knippers@hpe.c > om> wrote: > [..] > > Wasn't Dan concerned about how the OS can know whether the FW > > supports that bit in the Start ARS? > > > > The Query ARS Capabilities DSM has a bit that tells the OS whether > > the platform supports the notification and the point of the > > notification was to tell the OS it could do a Start ARS with bit 1 > > set. Of course, if you get the notification then that means the > > platform has the capability to deliver it, but it might not hurt to > > check the flag from the Query Capabilities bit. > > Good point, yes, I think it is safe to assume that a BIOS that claims > to support un-correctable error notification also supports this Start > ARS flag. Yes, ACPI 6.2, section 9.20.7.2, defines that: Upon receiving the notification, the OSPM may decide to issue a Start ARS with Flags Bit [1] set to prepare for the retrieval of existing records and issue the Query ARS Status function to retrieve the records. So, I believe it is safe to assume that BIOS supporting 0x81 also supports flags Big [1]. Sorry, this is what I should have said in my previous email... Thanks, -Toshi
T24gVGh1LCAyMDE3LTA2LTA4IGF0IDExOjM3IC0wNjAwLCBUb3NoaSBLYW5pIHdyb3RlOg0KPiBP biBUaHUsIDIwMTctMDYtMDggYXQgMTA6MzQgLTA3MDAsIERhbiBXaWxsaWFtcyB3cm90ZToNCj4g PiBPbiBUaHUsIEp1biA4LCAyMDE3IGF0IDEwOjMwIEFNLCBMaW5kYSBLbmlwcGVycyA8bGluZGEu a25pcHBlcnNAaHBlDQo+ID4gLmMNCj4gPiBvbT4gd3JvdGU6DQo+ID4gWy4uXQ0KPiA+ID4gV2Fz bid0IERhbiBjb25jZXJuZWQgYWJvdXQgaG93IHRoZSBPUyBjYW4ga25vdyB3aGV0aGVyIHRoZSBG Vw0KPiA+ID4gc3VwcG9ydHMgdGhhdCBiaXQgaW4gdGhlIFN0YXJ0IEFSUz8NCj4gPiA+IA0KPiA+ ID4gVGhlIFF1ZXJ5IEFSUyBDYXBhYmlsaXRpZXMgRFNNIGhhcyBhIGJpdCB0aGF0IHRlbGxzIHRo ZSBPUw0KPiA+ID4gd2hldGhlciB0aGUgcGxhdGZvcm0gc3VwcG9ydHMgdGhlIG5vdGlmaWNhdGlv biBhbmQgdGhlIHBvaW50IG9mDQo+ID4gPiB0aGUgbm90aWZpY2F0aW9uIHdhcyB0byB0ZWxsIHRo ZSBPUyBpdCBjb3VsZCBkbyBhIFN0YXJ0IEFSUyB3aXRoDQo+ID4gPiBiaXQgMSBzZXQuwqDCoE9m IGNvdXJzZSwgaWYgeW91IGdldCB0aGUgbm90aWZpY2F0aW9uIHRoZW4gdGhhdA0KPiA+ID4gbWVh bnMgdGhlIHBsYXRmb3JtIGhhcyB0aGUgY2FwYWJpbGl0eSB0byBkZWxpdmVyIGl0LCBidXQgaXQg bWlnaHQNCj4gPiA+IG5vdCBodXJ0IHRvIGNoZWNrIHRoZSBmbGFnIGZyb20gdGhlIFF1ZXJ5IENh cGFiaWxpdGllcyBiaXQuDQo+ID4gDQo+ID4gR29vZCBwb2ludCwgeWVzLCBJIHRoaW5rIGl0IGlz IHNhZmUgdG8gYXNzdW1lIHRoYXQgYSBCSU9TIHRoYXQNCj4gPiBjbGFpbXMgdG8gc3VwcG9ydCB1 bi1jb3JyZWN0YWJsZSBlcnJvciBub3RpZmljYXRpb24gYWxzbyBzdXBwb3J0cw0KPiA+IHRoaXMg U3RhcnQgQVJTIGZsYWcuDQo+IA0KPiBZZXMsIEFDUEkgNi4yLCBzZWN0aW9uIDkuMjAuNy4yLCBk ZWZpbmVzIHRoYXQ6DQo+IA0KPiDCoMKgwqDCoMKgwqBVcG9uIHJlY2VpdmluZyB0aGUgbm90aWZp Y2F0aW9uLCB0aGUgT1NQTSBtYXkgZGVjaWRlIHRvIGlzc3VlDQo+IMKgwqDCoMKgwqDCoGEgU3Rh cnQgQVJTIHdpdGggRmxhZ3MgQml0IFsxXSBzZXQgdG8gcHJlcGFyZSBmb3IgdGhlIHJldHJpZXZh bA0KPiDCoMKgwqDCoMKgwqBvZiBleGlzdGluZyByZWNvcmRzIGFuZCBpc3N1ZSB0aGUgUXVlcnkg QVJTIFN0YXR1cyBmdW5jdGlvbiB0bw0KPiDCoMKgwqDCoMKgwqByZXRyaWV2ZSB0aGUgcmVjb3Jk cy4NCj4gDQo+IFNvLMKgSSBiZWxpZXZlIGl0IGlzIHNhZmUgdG8gYXNzdW1lIHRoYXQgQklPUyBz dXBwb3J0aW5nIDB4ODEgYWxzbw0KPiBzdXBwb3J0cyBmbGFncyBCaXQgWzFdLsKgwqBTb3JyeSwg dGhpcyBpcyB3aGF0IEkgc2hvdWxkIGhhdmUgc2FpZCBpbiBteQ0KPiBwcmV2aW91cyBlbWFpbC4u Lg0KDQpUbyByZWl0ZXJhdGUgbXkgdGhpbmtpbmcsIEkgYmVsaWV2ZSB0aGUgc3RhdGVtZW50IGFi b3ZlIGNsYXJpZmllcyB0aGF0DQp0aGUgT1MgY2FuIGFzc3VtZSBCSU9TIHN1cHBvcnQgb2YgRmxh Z3MgQml0WzFdIHVwb24gcmVjZWl2aW5nIGEgMHg4MQ0Kbm90aWZpY2F0aW9uLiAgU2luY2UgQklP UyBtYXkgYWxzbyBzdXBwb3J0IEZsYWdzIEJpdFsxXSB3aXRob3V0DQpzdXBwb3J0aW5nIHRoaXMg MHg4MSAoaW4gd2hpY2ggY2FzZSBJIGRvIG5vdCBrbm93IGhvdyB0byBkZXRlY3QgaXQsIGJ1dA0K QklPUyBzaG91bGQgc2ltcGx5IGlnbm9yZSB0aGlzIGJpdCB3aGVuIG5vdCBzdXBwb3J0aW5nIGl0 KSwgSSBhbSBub3QNCmdvaW5nIHRvIGFkZCBhIGNoZWNrL3Jlc3RyaWN0aW9uIHRoYXQgMHg4MSBz dXBwb3J0IGlzIG5lY2Vzc2FyeSB0byBzZXQNCkJpdFsxXSBpbiB0aGUgc2NhbiBmdW5jdGlvbi4N Cg0KVGhhbmtzLA0KLVRvc2hp -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 656acb5..cc22778 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2967,7 +2967,7 @@ static int acpi_nfit_remove(struct acpi_device *adev) return 0; } -void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) +static void acpi_nfit_update_notify(struct device *dev, acpi_handle handle) { struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); struct acpi_buffer buf = { ACPI_ALLOCATE_BUFFER, NULL }; @@ -2975,11 +2975,6 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) acpi_status status; int ret; - dev_dbg(dev, "%s: event: %d\n", __func__, event); - - if (event != NFIT_NOTIFY_UPDATE) - return; - if (!dev->driver) { /* dev->driver may be null if we're being removed */ dev_dbg(dev, "%s: no driver found for dev\n", __func__); @@ -3016,6 +3011,27 @@ void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) dev_err(dev, "Invalid _FIT\n"); kfree(buf.pointer); } + +static void acpi_nfit_uc_error_notify(struct device *dev, acpi_handle handle) +{ + struct acpi_nfit_desc *acpi_desc = dev_get_drvdata(dev); + + acpi_nfit_ars_rescan(acpi_desc); +} + +void __acpi_nfit_notify(struct device *dev, acpi_handle handle, u32 event) +{ + dev_dbg(dev, "%s: event: 0x%x\n", __func__, event); + + switch (event) { + case NFIT_NOTIFY_UPDATE: + return acpi_nfit_update_notify(dev, handle); + case NFIT_NOTIFY_UC_MEMORY_ERROR: + return acpi_nfit_uc_error_notify(dev, handle); + default: + return; + } +} EXPORT_SYMBOL_GPL(__acpi_nfit_notify); static void acpi_nfit_notify(struct acpi_device *adev, u32 event) diff --git a/drivers/acpi/nfit/nfit.h b/drivers/acpi/nfit/nfit.h index 58fb7d6..6cf9d21 100644 --- a/drivers/acpi/nfit/nfit.h +++ b/drivers/acpi/nfit/nfit.h @@ -80,6 +80,7 @@ enum { enum nfit_root_notifiers { NFIT_NOTIFY_UPDATE = 0x80, + NFIT_NOTIFY_UC_MEMORY_ERROR = 0x81, }; enum nfit_dimm_notifiers {
ACPI 6.2 defines a new ACPI notification value to NVDIMM Root Device in Table 5-169. 0x81 Unconsumed Uncorrectable Memory Error Detected Used to pro-actively notify OSPM of uncorrectable memory errors detected (for example a memory scrubbing engine that continuously scans the NVDIMMs memory). This is an optional notification. Only locations that were mapped in to SPA by the platform will generate a notification. Add support of this notification value by initiating an ARS scan. This will find new error locations and add their badblocks information. Link: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Vishal Verma <vishal.l.verma@intel.com> --- drivers/acpi/nfit/core.c | 28 ++++++++++++++++++++++------ drivers/acpi/nfit/nfit.h | 1 + 2 files changed, 23 insertions(+), 6 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html