[v3] scsi: ufs: critical health condition

Message ID	20250210135814.50783-1-avri.altman@wdc.com (mailing list archive)
State	Superseded
Headers	show Received: from esa4.hgst.iphmx.com (esa4.hgst.iphmx.com [216.71.154.42]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 303472C9A; Mon, 10 Feb 2025 14:01:34 +0000 (UTC) IronPort-SDR: 67a9f908_4iXPJZcxQIqqfbEAaHZQFzl+4mDkycTC/WgMLf6CAKuBYfU OB4ZUmlGqRtbkVQ1Zgmxh0d8G4iWjgv271X5NeA== WDCIronportException: Internal From: Avri Altman <avri.altman@wdc.com> To: "Martin K . Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org, Bart Van Assche <bvanassche@acm.org>, Avri Altman <avri.altman@wdc.com> Subject: [PATCH v3] scsi: ufs: critical health condition Date: Mon, 10 Feb 2025 15:58:14 +0200 Message-Id: <20250210135814.50783-1-avri.altman@wdc.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v3] scsi: ufs: critical health condition \| expand [v3] scsi: ufs: critical health condition

Message ID

20250210135814.50783-1-avri.altman@wdc.com (mailing list archive)

State

Superseded

Headers

IronPort-SDR: 67a9f908_4iXPJZcxQIqqfbEAaHZQFzl+4mDkycTC/WgMLf6CAKuBYfU
 OB4ZUmlGqRtbkVQ1Zgmxh0d8G4iWjgv271X5NeA==
WDCIronportException: Internal
From: Avri Altman <avri.altman@wdc.com>
To: "Martin K . Petersen" <martin.petersen@oracle.com>
Cc: linux-scsi@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	Bart Van Assche <bvanassche@acm.org>,
	Avri Altman <avri.altman@wdc.com>
Subject: [PATCH v3] scsi: ufs: critical health condition
Date: Mon, 10 Feb 2025 15:58:14 +0200
Message-Id: <20250210135814.50783-1-avri.altman@wdc.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

[v3] scsi: ufs: critical health condition | expand

Commit Message

Avri Altman Feb. 10, 2025, 1:58 p.m. UTC

Martin hi,

The UFS4.1 standard, released on January 8, 2025, added a new exception
event: HEALTH_CRITICAL, which notifies the host of a device's critical
health condition. This notification implies that the device is
approaching the end of its lifetime based on the amount of performed
program/erase cycles.

Once an EOL (End-of-Life) exception event is received, we increment a
designated member, which is exposed via a `sysfs` entry. This new entry,
will report the number of times a critical health event has been
reported by a UFS device.

To handle this new `sysfs` entry, either `udev` rules or some other
polling code can be configured to monitor changes in the
`critical_health` attribute.

The host can gain further insight into the specific issue by reading one
of the following attributes: bPreEOLInfo, bDeviceLifeTimeEstA,
bDeviceLifeTimeEstB, bWriteBoosterBufferLifeTimeEst, and
bRPMBLifeTimeEst. All those are available for reading via the driver's
sysfs entries or through an applicable utility. It is up to user-space
to read these attributes if needed.

Please consider this for the next merge window.

Signed-off-by: Avri Altman <avri.altman@wdc.com>

---
Changes in v3:
 - Report a counter instead of a Boolean (Bart)
 - Support polling (Bart)

Changes in v2:
 - withdraw from using hw-monitor subsystem (Guenter)
---
 Documentation/ABI/testing/sysfs-driver-ufs | 13 +++++++++++++
 drivers/ufs/core/ufs-sysfs.c               | 10 ++++++++++
 drivers/ufs/core/ufshcd.c                  | 10 ++++++++++
 include/ufs/ufs.h                          |  1 +
 include/ufs/ufshcd.h                       |  4 ++++
 5 files changed, 38 insertions(+)

Comments

Bart Van Assche Feb. 10, 2025, 6:30 p.m. UTC | #1

On 2/10/25 5:58 AM, Avri Altman wrote:
> To handle this new `sysfs` entry, either `udev` rules or some other
> polling code can be configured to monitor changes in the
> `critical_health` attribute.

Hmm ... I'm not aware of any support in udevd to poll on sysfs
attributes? I think that calling select(), poll() or epoll() is required
to wait for a sysfs_notify() call.

> +Description:	Report the number of times a critical health event has been
> +		reported by a UFS device. further insight into the specific

further -> Further?

> +static ssize_t critical_health_show(struct device *dev,
> +				    struct device_attribute *attr, char *buf)
> +{
> +	struct ufs_hba *hba = dev_get_drvdata(dev);
> +
> +	return sysfs_emit(buf, "%d\n", hba->critical_health);
> +}

Now that the data type of hba->critical_health has been changed from
boolean into integer, should its name perhaps be changed into
hba->critical_health_count?

> @@ -1130,6 +1131,9 @@ struct ufs_hba {
>   	struct delayed_work ufs_rtc_update_work;
>   	struct pm_qos_request pm_qos_req;
>   	bool pm_qos_enabled;
> +
> +	/* HEALTH_CRITICAL exception reported */
> +	int critical_health;
>   };

Please leave out the inline comment since @critical_health already
has a kernel-doc comment.

Thanks,

Bart.

Avri Altman Feb. 10, 2025, 7:57 p.m. UTC | #2

> On 2/10/25 5:58 AM, Avri Altman wrote:
> > To handle this new `sysfs` entry, either `udev` rules or some other
> > polling code can be configured to monitor changes in the
> > `critical_health` attribute.
> 
> Hmm ... I'm not aware of any support in udevd to poll on sysfs attributes? I
> think that calling select(), poll() or epoll() is required to wait for a sysfs_notify()
> call.
Done.
It’s a leftover from the previous commit log.
Btw I tested it with a udev rule, and it works fine as well.

> 
> > +Description:	Report the number of times a critical health event has been
> > +		reported by a UFS device. further insight into the specific
> 
> further -> Further?
Done.

> 
> > +static ssize_t critical_health_show(struct device *dev,
> > +				    struct device_attribute *attr, char *buf) {
> > +	struct ufs_hba *hba = dev_get_drvdata(dev);
> > +
> > +	return sysfs_emit(buf, "%d\n", hba->critical_health); }
> 
> Now that the data type of hba->critical_health has been changed from
> boolean into integer, should its name perhaps be changed into
> hba->critical_health_count?
Done.

> 
> > @@ -1130,6 +1131,9 @@ struct ufs_hba {
> >   	struct delayed_work ufs_rtc_update_work;
> >   	struct pm_qos_request pm_qos_req;
> >   	bool pm_qos_enabled;
> > +
> > +	/* HEALTH_CRITICAL exception reported */
> > +	int critical_health;
> >   };
> 
> Please leave out the inline comment since @critical_health already has a
> kernel-doc comment.
Done.

Thanks,
Avri
> 
> Thanks,
> 
> Bart.

diff --git a/Documentation/ABI/testing/sysfs-driver-ufs b/Documentation/ABI/testing/sysfs-driver-ufs
index 5fa6655aee84..565d281a7dcd 100644
--- a/Documentation/ABI/testing/sysfs-driver-ufs
+++ b/Documentation/ABI/testing/sysfs-driver-ufs
@@ -1559,3 +1559,16 @@  Description:
 		Symbol - HCMID. This file shows the UFSHCD manufacturer id.
 		The Manufacturer ID is defined by JEDEC in JEDEC-JEP106.
 		The file is read only.
+
+What:		/sys/bus/platform/drivers/ufshcd/*/critical_health
+What:		/sys/bus/platform/devices/*.ufs/critical_health
+Date:		February 2025
+Contact:	Avri Altman <avri.altman@wdc.com>
+Description:	Report the number of times a critical health event has been
+		reported by a UFS device. further insight into the specific
+		issue can be gained by reading one of: bPreEOLInfo,
+		bDeviceLifeTimeEstA, bDeviceLifeTimeEstB,
+		bWriteBoosterBufferLifeTimeEst, and bRPMBLifeTimeEst.
+
+		The file is read only.
+
diff --git a/drivers/ufs/core/ufs-sysfs.c b/drivers/ufs/core/ufs-sysfs.c
index 3438269a5440..3899e34f6eae 100644
--- a/drivers/ufs/core/ufs-sysfs.c
+++ b/drivers/ufs/core/ufs-sysfs.c
@@ -458,6 +458,14 @@  static ssize_t pm_qos_enable_store(struct device *dev,
 	return count;
 }
 
+static ssize_t critical_health_show(struct device *dev,
+				    struct device_attribute *attr, char *buf)
+{
+	struct ufs_hba *hba = dev_get_drvdata(dev);
+
+	return sysfs_emit(buf, "%d\n", hba->critical_health);
+}
+
 static DEVICE_ATTR_RW(rpm_lvl);
 static DEVICE_ATTR_RO(rpm_target_dev_state);
 static DEVICE_ATTR_RO(rpm_target_link_state);
@@ -470,6 +478,7 @@  static DEVICE_ATTR_RW(enable_wb_buf_flush);
 static DEVICE_ATTR_RW(wb_flush_threshold);
 static DEVICE_ATTR_RW(rtc_update_ms);
 static DEVICE_ATTR_RW(pm_qos_enable);
+static DEVICE_ATTR_RO(critical_health);
 
 static struct attribute *ufs_sysfs_ufshcd_attrs[] = {
 	&dev_attr_rpm_lvl.attr,
@@ -484,6 +493,7 @@  static struct attribute *ufs_sysfs_ufshcd_attrs[] = {
 	&dev_attr_wb_flush_threshold.attr,
 	&dev_attr_rtc_update_ms.attr,
 	&dev_attr_pm_qos_enable.attr,
+	&dev_attr_critical_health.attr,
 	NULL
 };
 
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
index cd404ade48dc..ad4034fea6cc 100644
--- a/drivers/ufs/core/ufshcd.c
+++ b/drivers/ufs/core/ufshcd.c
@@ -6216,6 +6216,11 @@  static void ufshcd_exception_event_handler(struct work_struct *work)
 	if (status & hba->ee_drv_mask & MASK_EE_URGENT_TEMP)
 		ufshcd_temp_exception_event_handler(hba, status);
 
+	if (status & hba->ee_drv_mask & MASK_EE_HEALTH_CRITICAL) {
+		hba->critical_health++;
+		sysfs_notify(&hba->dev->kobj, NULL, "critical_health");
+	}
+
 	ufs_debugfs_exception_event(hba, status);
 }
 
@@ -8308,6 +8313,11 @@  static int ufs_get_device_desc(struct ufs_hba *hba)
 
 	ufshcd_temp_notif_probe(hba, desc_buf);
 
+	if (dev_info->wspecversion >= 0x410) {
+		hba->critical_health = 0;
+		ufshcd_enable_ee(hba, MASK_EE_HEALTH_CRITICAL);
+	}
+
 	ufs_init_rtc(hba, desc_buf);
 
 	/*
diff --git a/include/ufs/ufs.h b/include/ufs/ufs.h
index 89672ad8c3bb..d335bff1a310 100644
--- a/include/ufs/ufs.h
+++ b/include/ufs/ufs.h
@@ -419,6 +419,7 @@  enum {
 	MASK_EE_TOO_LOW_TEMP		= BIT(4),
 	MASK_EE_WRITEBOOSTER_EVENT	= BIT(5),
 	MASK_EE_PERFORMANCE_THROTTLING	= BIT(6),
+	MASK_EE_HEALTH_CRITICAL		= BIT(9),
 };
 #define MASK_EE_URGENT_TEMP (MASK_EE_TOO_HIGH_TEMP | MASK_EE_TOO_LOW_TEMP)
 
diff --git a/include/ufs/ufshcd.h b/include/ufs/ufshcd.h
index 650ff238cd74..81e35129ded0 100644
--- a/include/ufs/ufshcd.h
+++ b/include/ufs/ufshcd.h
@@ -962,6 +962,7 @@  enum ufshcd_mcq_opr {
  * @ufs_rtc_update_work: A work for UFS RTC periodic update
  * @pm_qos_req: PM QoS request handle
  * @pm_qos_enabled: flag to check if pm qos is enabled
+ * @critical_health: count of critical health exceptions
  */
 struct ufs_hba {
 	void __iomem *mmio_base;
@@ -1130,6 +1131,9 @@  struct ufs_hba {
 	struct delayed_work ufs_rtc_update_work;
 	struct pm_qos_request pm_qos_req;
 	bool pm_qos_enabled;
+
+	/* HEALTH_CRITICAL exception reported */
+	int critical_health;
 };
 
 /**

[v3] scsi: ufs: critical health condition

Commit Message

Comments

Patch