diff mbox series

[v2,2/3] PCI/NPEM: Add Native PCIe Enclosure Management support

Message ID 20240528131940.16924-3-mariusz.tkaczyk@linux.intel.com (mailing list archive)
State Superseded
Delegated to: Bjorn Helgaas
Headers show
Series PCIe Enclosure LED Management | expand

Commit Message

Mariusz Tkaczyk May 28, 2024, 1:19 p.m. UTC
Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows
managing LED in storage enclosures. NPEM is indication oriented
and it does not give direct access to LED. Although each of
the indications *could* represent an individual LED, multiple
indications could also be represented as a single,
multi-color LED or a single LED blinking in a specific interval.
The specification leaves that open.

Each enabled indication (capability register bit on) is represented as a
ledclass_dev which can be controlled through sysfs. For every ledclass
device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0).
It is corresponding to NPEM control register (Indication bit on/off).

Ledclass devices appear in sysfs as child devices (subdirectory) of PCI
device which has an NPEM Extended Capability and indication is enabled
in NPEM capability register. For example, these are leds created for
pcieport "10000:02:05.0" on my setup:

leds/
├── 10000:02:05.0:enclosure:fail
├── 10000:02:05.0:enclosure:locate
├── 10000:02:05.0:enclosure:ok
└── 10000:02:05.0:enclosure:rebuild

They can be also found in "/sys/class/leds" directory. Parent PCIe device
bdf is used to guarantee uniqueness across leds subsystem.

To enable/disable fail indication "brightness" file can be edited:
echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness
echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness

PCIe r6.1, sec 7.9.19.2 defines the possible indications.

Multiple indications for same parent PCIe device can conflict and
hardware may update them when processing new request. To avoid issues,
driver refresh all indications by reading back control register.

Driver is projected to be exclusive NPEM extended capability manager.
It waits up to 1 second after imposing new request, it doesn't verify if
controller is busy before write, assuming that mutex lock gives protection
from concurrent updates. Driver is not registered if _DSM LED management
is available.

NPEM is a PCIe extended capability so it should be registered in
pcie_init_capabilities() but it is not possible due to LED dependency.
Parent pci_device must be added earlier for led_classdev_register()
to be successful. NPEM does not require configuration on kernel side, it
is safe to register LED devices later.

Link: https://members.pcisig.com/wg/PCI-SIG/document/19849 [1]
Suggested-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
---
 drivers/pci/Kconfig           |   9 +
 drivers/pci/Makefile          |   1 +
 drivers/pci/npem.c            | 410 ++++++++++++++++++++++++++++++++++
 drivers/pci/pci.h             |   8 +
 drivers/pci/probe.c           |   2 +
 drivers/pci/remove.c          |   2 +
 include/linux/pci.h           |   3 +
 include/uapi/linux/pci_regs.h |  34 +++
 8 files changed, 469 insertions(+)
 create mode 100644 drivers/pci/npem.c

Comments

Ilpo Järvinen May 28, 2024, 1:55 p.m. UTC | #1
On Tue, 28 May 2024, Mariusz Tkaczyk wrote:

> Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows
> managing LED in storage enclosures. NPEM is indication oriented
> and it does not give direct access to LED. Although each of
> the indications *could* represent an individual LED, multiple
> indications could also be represented as a single,
> multi-color LED or a single LED blinking in a specific interval.
> The specification leaves that open.
> 
> Each enabled indication (capability register bit on) is represented as a
> ledclass_dev which can be controlled through sysfs. For every ledclass
> device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0).
> It is corresponding to NPEM control register (Indication bit on/off).
> 
> Ledclass devices appear in sysfs as child devices (subdirectory) of PCI
> device which has an NPEM Extended Capability and indication is enabled
> in NPEM capability register. For example, these are leds created for
> pcieport "10000:02:05.0" on my setup:
> 
> leds/
> ├── 10000:02:05.0:enclosure:fail
> ├── 10000:02:05.0:enclosure:locate
> ├── 10000:02:05.0:enclosure:ok
> └── 10000:02:05.0:enclosure:rebuild
> 
> They can be also found in "/sys/class/leds" directory. Parent PCIe device
> bdf is used to guarantee uniqueness across leds subsystem.
> 
> To enable/disable fail indication "brightness" file can be edited:
> echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness
> echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness
> 
> PCIe r6.1, sec 7.9.19.2 defines the possible indications.
> 
> Multiple indications for same parent PCIe device can conflict and
> hardware may update them when processing new request. To avoid issues,
> driver refresh all indications by reading back control register.
> 
> Driver is projected to be exclusive NPEM extended capability manager.
> It waits up to 1 second after imposing new request, it doesn't verify if
> controller is busy before write, assuming that mutex lock gives protection
> from concurrent updates. Driver is not registered if _DSM LED management
> is available.
> 
> NPEM is a PCIe extended capability so it should be registered in
> pcie_init_capabilities() but it is not possible due to LED dependency.
> Parent pci_device must be added earlier for led_classdev_register()
> to be successful. NPEM does not require configuration on kernel side, it
> is safe to register LED devices later.
> 
> Link: https://members.pcisig.com/wg/PCI-SIG/document/19849 [1]
> Suggested-by: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>  drivers/pci/Kconfig           |   9 +
>  drivers/pci/Makefile          |   1 +
>  drivers/pci/npem.c            | 410 ++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.h             |   8 +
>  drivers/pci/probe.c           |   2 +
>  drivers/pci/remove.c          |   2 +
>  include/linux/pci.h           |   3 +
>  include/uapi/linux/pci_regs.h |  34 +++
>  8 files changed, 469 insertions(+)
>  create mode 100644 drivers/pci/npem.c
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index d35001589d88..e696e69ad516 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -143,6 +143,15 @@ config PCI_IOV
>  
>  	  If unsure, say N.
>  
> +config PCI_NPEM
> +	bool "Native PCIe Enclosure Management"
> +	depends on LEDS_CLASS=y
> +	help
> +	  Support for Native PCIe Enclosure Management. It allows managing LED
> +	  indications in storage enclosures. Enclosure must support following
> +	  indications: OK, Locate, Fail, Rebuild, other indications are
> +	  optional.
> +
>  config PCI_PRI
>  	bool "PCI PRI support"
>  	select PCI_ATS
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 175302036890..cd5f655d4be9 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
>  obj-$(CONFIG_PCI_DOE)		+= doe.o
>  obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> +obj-$(CONFIG_PCI_NPEM)		+= npem.o
>  
>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> diff --git a/drivers/pci/npem.c b/drivers/pci/npem.c
> new file mode 100644
> index 000000000000..a76a2044dab2
> --- /dev/null
> +++ b/drivers/pci/npem.c
> @@ -0,0 +1,410 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCIe Enclosure management driver created for LED interfaces based on
> + * indications. It says *what indications* blink but does not specify *how*
> + * they blink - it is hardware defined.
> + *
> + * The driver name refers to Native PCIe Enclosure Management. It is
> + * first indication oriented standard with specification.
> + *
> + * Native PCIe Enclosure Management (NPEM)
> + *	PCIe Base Specification r6.1 sec 6.28
> + *	PCIe Base Specification r6.1 sec 7.9.19
> + *
> + * Copyright (c) 2023-2024 Intel Corporation
> + *	Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/errno.h>
> +#include <linux/iopoll.h>
> +#include <linux/leds.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci_regs.h>
> +#include <linux/types.h>
> +#include <linux/uleds.h>
> +
> +#include "pci.h"
> +
> +struct indication {
> +	u32 bit;
> +	const char *name;
> +};
> +
> +static const struct indication npem_indications[] = {
> +	{PCI_NPEM_IND_OK,	"enclosure:ok"},
> +	{PCI_NPEM_IND_LOCATE,	"enclosure:locate"},
> +	{PCI_NPEM_IND_FAIL,	"enclosure:fail"},
> +	{PCI_NPEM_IND_REBUILD,	"enclosure:rebuild"},
> +	{PCI_NPEM_IND_PFA,	"enclosure:pfa"},
> +	{PCI_NPEM_IND_HOTSPARE,	"enclosure:hotspare"},
> +	{PCI_NPEM_IND_ICA,	"enclosure:ica"},
> +	{PCI_NPEM_IND_IFA,	"enclosure:ifa"},
> +	{PCI_NPEM_IND_IDT,	"enclosure:idt"},
> +	{PCI_NPEM_IND_DISABLED,	"enclosure:disabled"},
> +	{PCI_NPEM_IND_SPEC_0,	"enclosure:specific_0"},
> +	{PCI_NPEM_IND_SPEC_1,	"enclosure:specific_1"},
> +	{PCI_NPEM_IND_SPEC_2,	"enclosure:specific_2"},
> +	{PCI_NPEM_IND_SPEC_3,	"enclosure:specific_3"},
> +	{PCI_NPEM_IND_SPEC_4,	"enclosure:specific_4"},
> +	{PCI_NPEM_IND_SPEC_5,	"enclosure:specific_5"},
> +	{PCI_NPEM_IND_SPEC_6,	"enclosure:specific_6"},
> +	{PCI_NPEM_IND_SPEC_7,	"enclosure:specific_7"},
> +	{0,			NULL}
> +};
> +
> +#define for_each_indication(ind, inds) \
> +	for (ind = inds; ind->bit; ind++)
> +
> +/* To avoid confusion, do not keep any special bits in indications */
> +static u32 reg_to_indications(u32 caps, const struct indication *inds)
> +{
> +	const struct indication *ind;
> +	u32 supported_indications = 0;
> +
> +	for_each_indication(ind, inds)
> +		supported_indications |= ind->bit;
> +
> +	return caps & supported_indications;
> +}
> +
> +/**
> + * struct npem_led - LED details
> + * @indication: indication details
> + * @npem: npem device
> + * @name: LED name
> + * @led: LED device
> + */
> +struct npem_led {
> +	const struct indication *indication;
> +	struct npem *npem;
> +	char name[LED_MAX_NAME_SIZE];
> +	struct led_classdev led;
> +};
> +
> +/**
> + * struct npem_ops - backend specific callbacks
> + * @inds: supported indications array
> + * @get_active_indications: get active indications
> + *	npem: npem device
> + *	buf: response buffer
> + * @set_active_indications: set new indications
> + *	npem: npem device
> + *	val: bit mask to set
> + *
> + * Handle communication with hardware. set_active_indications updates
> + * npem->active_indications.
> + */
> +struct npem_ops {
> +	const struct indication *inds;
> +	int (*get_active_indications)(struct npem *npem, u32 *buf);
> +	int (*set_active_indications)(struct npem *npem, u32 val);
> +};
> +
> +/**
> + * struct npem - NPEM device properties
> + * @dev: PCIe device this driver is attached to
> + * @ops: Backend specific callbacks
> + * @npem_lock: to keep concurrent updates serialized
> + * @pos: NPEM capability offset (only relevant for NPEM direct register access,
> + *	 not DSM access method)
> + * @supported_indications: bit mask of supported indications
> + *			   non-indication and reserved bits are cleared
> + * @active_indications: bit mask of active indications
> + *			non-indication and reserved bits are cleared
> + * @led_cnt: Supported LEDs count
> + * @leds: supported LEDs
> + */
> +struct npem {
> +	struct pci_dev *dev;
> +	const struct npem_ops *ops;
> +	struct mutex npem_lock;
> +	u16 pos;
> +	u32 supported_indications;
> +	u32 active_indications;
> +	int led_cnt;
> +	struct npem_led leds[];
> +};
> +
> +static int npem_read_reg(struct npem *npem, u16 reg, u32 *val)
> +{
> +	int ret = pci_read_config_dword(npem->dev, npem->pos + reg, val);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_write_ctrl(struct npem *npem, u32 reg)
> +{
> +	int pos = npem->pos + PCI_NPEM_CTRL;
> +	int ret = pci_write_config_dword(npem->dev, pos, reg);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_get_active_indications(struct npem *npem, u32 *buf)
> +{
> +	int ret;
> +
> +	ret = npem_read_reg(npem, PCI_NPEM_CTRL, buf);
> +	if (ret)
> +		return ret;
> +
> +	/* If PCI_NPEM_CTRL_ENABLE is not set then no indication should blink */
> +	if (!(*buf & PCI_NPEM_CTRL_ENABLE))
> +		*buf = 0;
> +
> +	/* Filter out not supported indications in response */
> +	*buf &= npem->supported_indications;
> +	return 0;
> +}
> +
> +static int npem_set_active_indications(struct npem *npem, u32 val)
> +{
> +	int ret, ret_val;
> +	u32 cc_status;
> +
> +	lockdep_assert_held(&npem->npem_lock);
> +
> +	/* This bit is always required */
> +	val |= PCI_NPEM_CTRL_ENABLE;
> +	ret = npem_write_ctrl(npem, val);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * For the case where a NPEM command has not completed immediately,
> +	 * it is recommended that software not continuously “spin” on polling
> +	 * the status register, but rather poll under interrupt at a reduced
> +	 * rate; for example at 10 ms intervals.
> +	 *
> +	 * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of NPEM
> +	 * Command Completed"
> +	 */
> +	ret = read_poll_timeout(npem_read_reg, ret_val,
> +				ret_val || (cc_status & PCI_NPEM_STATUS_CC),
> +				10 * USEC_PER_MSEC, USEC_PER_SEC, false, npem,
> +				PCI_NPEM_STATUS, &cc_status);
> +	if (ret)
> +		return ret_val;
> +
> +	/*
> +	 * All writes to control register, including writes that do not change
> +	 * the register value, are NPEM commands and should eventually result
> +	 * in a command completion indication in the NPEM Status Register.
> +	 *
> +	 * PCIe Base Specification r6.1 sec 7.9.19.3
> +	 *
> +	 * Register may not be updated, or other conflicting bits may be
> +	 * cleared. Spec is not strict here. Read NPEM Control register after
> +	 * write to keep cache in-sync.
> +	 */
> +	return npem_get_active_indications(npem, &npem->active_indications);
> +}
> +
> +static const struct npem_ops npem_ops = {
> +	.inds = npem_indications,
> +	.get_active_indications = npem_get_active_indications,
> +	.set_active_indications = npem_set_active_indications,
> +};
> +
> +#define DSM_GUID GUID_INIT(0x5d524d9d, 0xfff9, 0x4d4b,  0x8c, 0xb7, 0x74, 0x7e,\
> +			   0xd5, 0x1e, 0x19, 0x4d)
> +#define GET_SUPPORTED_STATES_DSM	BIT(1)
> +#define GET_STATE_DSM			BIT(2)
> +#define SET_STATE_DSM			BIT(3)
> +
> +static const guid_t dsm_guid = DSM_GUID;
> +
> +static bool npem_has_dsm(struct pci_dev *pdev)
> +{
> +	acpi_handle handle;
> +
> +	handle = ACPI_HANDLE(&pdev->dev);
> +	if (!handle)
> +		return false;
> +
> +	return acpi_check_dsm(handle, &dsm_guid, 0x1, GET_SUPPORTED_STATES_DSM |
> +			      GET_STATE_DSM | SET_STATE_DSM);
> +}
> +
> +/*
> + * This function does not use ops->get_active_indications().
> + * It returns cached value, npem->npem_lock is held and it is safe.
> + */
> +static enum led_brightness brightness_get(struct led_classdev *led)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	int ret, val = LED_OFF;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (npem->active_indications & nled->indication->bit)
> +		val = LED_ON;
> +
> +	mutex_unlock(&npem->npem_lock);
> +
> +	return val;
> +}
> +
> +static int brightness_set(struct led_classdev *led,
> +			  enum led_brightness brightness)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	u32 indications;
> +	int ret;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (brightness == LED_OFF)
> +		indications = npem->active_indications & ~(nled->indication->bit);
> +	else
> +		indications = npem->active_indications | nled->indication->bit;
> +
> +	ret = npem->ops->set_active_indications(npem, indications);
> +
> +	mutex_unlock(&npem->npem_lock);
> +	return ret;
> +}
> +
> +static void npem_free(struct npem *npem)
> +{
> +	struct npem_led *nled;
> +	int cnt;
> +
> +	for (cnt = 0; cnt < npem->led_cnt; cnt++) {
> +		nled = &npem->leds[cnt];
> +
> +		if (nled->name[0])
> +			led_classdev_unregister(&nled->led);
> +	}
> +
> +	mutex_destroy(&npem->npem_lock);
> +	kfree(npem);
> +}
> +
> +static int pci_npem_set_led_classdev(struct npem *npem, struct npem_led *nled)
> +{
> +	struct led_classdev *led = &nled->led;
> +	struct led_init_data init_data = {};
> +	char *name = nled->name;
> +	int ret;
> +
> +	init_data.devicename = pci_name(npem->dev);
> +	init_data.default_label = nled->indication->name;
> +
> +	ret = led_compose_name(&npem->dev->dev, &init_data, name);
> +	if (ret)
> +		return ret;
> +
> +	led->name = name;
> +	led->brightness_set_blocking = brightness_set;
> +	led->brightness_get = brightness_get;
> +	led->max_brightness = LED_ON;
> +	led->default_trigger = "none";
> +	led->flags = 0;
> +
> +	ret = led_classdev_register(&npem->dev->dev, led);
> +	if (ret)
> +		/* Clear the name to indicate that it is not registered. */
> +		name[0] = 0;
> +	return ret;
> +}
> +
> +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> +			 int pos, u32 caps)
> +{
> +	u32 supported = reg_to_indications(caps, ops->inds);
> +	int supported_cnt = hweight32(supported);
> +	const struct indication *indication;
> +	struct npem_led *nled;
> +	struct npem *npem;
> +	int led_idx = 0;
> +	u32 active;
> +	int ret;
> +
> +	npem = kzalloc(struct_size(npem, leds, supported_cnt), GFP_KERNEL);
> +
> +	if (!npem)
> +		return -ENOMEM;

Don't leave empty line between func and it's error handling.

> +
> +	npem->supported_indications = supported;
> +	npem->led_cnt = supported_cnt;
> +	npem->pos = pos;
> +	npem->dev = dev;
> +	npem->ops = ops;
> +
> +	ret = ops->get_active_indications(npem, &active);
> +	if (ret) {
> +		npem_free(npem);

Just call kfree() directly here. As is, you're calling mutex_destroy() 
before mutex_init().

> +		return -EACCES;
> +	}
> +
> +	npem->active_indications = reg_to_indications(active, ops->inds);
> +
> +	/*
> +	 * Do not take npem->npem_lock, get_brightness() is called on
> +	 * registration path.
> +	 */
> +	mutex_init(&npem->npem_lock);
> +
> +	for_each_indication(indication, npem_indications) {
> +		if ((npem->supported_indications & indication->bit) == 0)

if (!(...))

> +			/* Do not register unsupported indication */

This sounds quite obvious comment, I'd drop it.

> +			continue;
> +
> +		nled = &npem->leds[led_idx++];
> +		nled->indication = indication;
> +		nled->npem = npem;
> +
> +		ret = pci_npem_set_led_classdev(npem, nled);
> +		if (ret) {
> +			npem_free(npem);
> +			return ret;
> +		}
> +	}
> +
> +	dev->npem = npem;
> +	return 0;
> +}
> +
> +void pci_npem_remove(struct pci_dev *dev)
> +{
> +	if (dev->npem)
> +		npem_free(dev->npem);
> +}
> +
> +void pci_npem_create(struct pci_dev *dev)
> +{
> +	const struct npem_ops *ops = &npem_ops;
> +	int pos = 0, ret;
> +	u32 cap;
> +
> +	if (!npem_has_dsm(dev)) {
> +		pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_NPEM);
> +		if (pos == 0)
> +			return;
> +
> +		if (pci_read_config_dword(dev, pos + PCI_NPEM_CAP, &cap) != 0 ||
> +		    (cap & PCI_NPEM_CAP_CAPABLE) == 0)
> +			return;
> +	} else {
> +		/*
> +		 * OS should use the DSM for LED control if it is available
> +		 * PCI Firmware Spec r3.3 sec 4.7.
> +		 */
> +		return;
> +	}
> +
> +	ret = pci_npem_init(dev, ops, pos, cap);
> +	if (ret)
> +		pci_err(dev, "Failed to register PCIe Enclosure Management driver, err: %d\n",
> +			ret);
> +}
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fd44565c4756..9dea8c7353ab 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -333,6 +333,14 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { }
>  static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
>  #endif
>  
> +#ifdef CONFIG_PCI_NPEM
> +void pci_npem_create(struct pci_dev *dev);
> +void pci_npem_remove(struct pci_dev *dev);
> +#else
> +static inline void pci_npem_create(struct pci_dev *dev) { }
> +static inline void pci_npem_remove(struct pci_dev *dev) { }
> +#endif
> +
>  /**
>   * pci_dev_set_io_state - Set the new error state if possible.
>   *
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 8e696e547565..b8ea6353e27a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2582,6 +2582,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>  	dev->match_driver = false;
>  	ret = device_add(&dev->dev);
>  	WARN_ON(ret < 0);
> +
> +	pci_npem_create(dev);
>  }
>  
>  struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn)
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index d749ea8250d6..1436f9cf1fea 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -33,6 +33,8 @@ static void pci_destroy_dev(struct pci_dev *dev)
>  	if (!dev->dev.kobj.parent)
>  		return;
>  
> +	pci_npem_remove(dev);
> +
>  	device_del(&dev->dev);
>  
>  	down_write(&pci_bus_sem);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index fb004fd4e889..c327c2dd4527 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -517,6 +517,9 @@ struct pci_dev {
>  #endif
>  #ifdef CONFIG_PCI_DOE
>  	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> +#endif
> +#ifdef CONFIG_PCI_NPEM
> +	struct npem	*npem;		/* Native PCIe Enclosure Management */
>  #endif
>  	u16		acs_cap;	/* ACS Capability offset */
>  	phys_addr_t	rom;		/* Physical address if not from BAR */
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 94c00996e633..97e8b7ed9998 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -740,6 +740,7 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> +#define PCI_EXT_CAP_ID_NPEM	0x29	/* Native PCIe Enclosure Management */
>  #define PCI_EXT_CAP_ID_PL_32GT  0x2A    /* Physical Layer 32.0 GT/s */
>  #define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
>  #define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
> @@ -1121,6 +1122,39 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Native PCIe Enclosure Management */
> +#define PCI_NPEM_CAP	0x04 /* NPEM capability register */
> +#define	 PCI_NPEM_CAP_CAPABLE		0x00000001 /* NPEM Capable */
> +
> +#define PCI_NPEM_CTRL	0x08 /* NPEM control register */
> +#define	 PCI_NPEM_CTRL_ENABLE		0x00000001 /* NPEM Enable */

I think the rest of the bits would belong here or after 
PCI_NPEM_CAP_CAPABLE.

> +#define PCI_NPEM_STATUS	0x0c /* NPEM status register */
> +#define	 PCI_NPEM_STATUS_CC		0x00000001 /* NPEM Command completed */
> +/*
> + * Native PCIe Enclosure Management indication bits and Reset command bit
> + * are corresponding for capability and control registers.
> + */
> +#define  PCI_NPEM_CMD_RESET		0x00000002 /* NPEM Reset Command */
> +#define  PCI_NPEM_IND_OK		0x00000004 /* NPEM indication OK */
> +#define  PCI_NPEM_IND_LOCATE		0x00000008 /* NPEM indication Locate */
> +#define  PCI_NPEM_IND_FAIL		0x00000010 /* NPEM indication Fail */
> +#define  PCI_NPEM_IND_REBUILD		0x00000020 /* NPEM indication Rebuild */
> +#define  PCI_NPEM_IND_PFA		0x00000040 /* NPEM indication Predicted Failure Analysis */
> +#define  PCI_NPEM_IND_HOTSPARE		0x00000080 /* NPEM indication Hot Spare */
> +#define  PCI_NPEM_IND_ICA		0x00000100 /* NPEM indication In Critical Array */
> +#define  PCI_NPEM_IND_IFA		0x00000200 /* NPEM indication In Failed Array */
> +#define  PCI_NPEM_IND_IDT		0x00000400 /* NPEM indication Invalid Device Type */
> +#define  PCI_NPEM_IND_DISABLED		0x00000800 /* NPEM indication Disabled */

> +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> +#define  PCI_NPEM_IND_SPEC_7		0x40000000

Are these off-by-one, shouldn't that field go all the way to the last bit 
which is 0x80000000 (the field is 31:24)??
Dan Williams May 29, 2024, 5:21 a.m. UTC | #2
Mariusz Tkaczyk wrote:
> Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows
> managing LED in storage enclosures. NPEM is indication oriented
> and it does not give direct access to LED. Although each of
> the indications *could* represent an individual LED, multiple
> indications could also be represented as a single,
> multi-color LED or a single LED blinking in a specific interval.
> The specification leaves that open.
> 
> Each enabled indication (capability register bit on) is represented as a
> ledclass_dev which can be controlled through sysfs. For every ledclass
> device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0).
> It is corresponding to NPEM control register (Indication bit on/off).
> 
> Ledclass devices appear in sysfs as child devices (subdirectory) of PCI
> device which has an NPEM Extended Capability and indication is enabled
> in NPEM capability register. For example, these are leds created for
> pcieport "10000:02:05.0" on my setup:
> 
> leds/
> ├── 10000:02:05.0:enclosure:fail
> ├── 10000:02:05.0:enclosure:locate
> ├── 10000:02:05.0:enclosure:ok
> └── 10000:02:05.0:enclosure:rebuild
> 
> They can be also found in "/sys/class/leds" directory. Parent PCIe device
> bdf is used to guarantee uniqueness across leds subsystem.
> 
> To enable/disable fail indication "brightness" file can be edited:
> echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness
> echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness
> 
> PCIe r6.1, sec 7.9.19.2 defines the possible indications.
> 
> Multiple indications for same parent PCIe device can conflict and
> hardware may update them when processing new request. To avoid issues,
> driver refresh all indications by reading back control register.
> 
> Driver is projected to be exclusive NPEM extended capability manager.
> It waits up to 1 second after imposing new request, it doesn't verify if
> controller is busy before write, assuming that mutex lock gives protection
> from concurrent updates. Driver is not registered if _DSM LED management
> is available.
> 
> NPEM is a PCIe extended capability so it should be registered in
> pcie_init_capabilities() but it is not possible due to LED dependency.
> Parent pci_device must be added earlier for led_classdev_register()
> to be successful. NPEM does not require configuration on kernel side, it
> is safe to register LED devices later.
> 
> Link: https://members.pcisig.com/wg/PCI-SIG/document/19849 [1]
> Suggested-by: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>  drivers/pci/Kconfig           |   9 +
>  drivers/pci/Makefile          |   1 +
>  drivers/pci/npem.c            | 410 ++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.h             |   8 +
>  drivers/pci/probe.c           |   2 +
>  drivers/pci/remove.c          |   2 +
>  include/linux/pci.h           |   3 +
>  include/uapi/linux/pci_regs.h |  34 +++
>  8 files changed, 469 insertions(+)
>  create mode 100644 drivers/pci/npem.c
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index d35001589d88..e696e69ad516 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -143,6 +143,15 @@ config PCI_IOV
>  
>  	  If unsure, say N.
>  
> +config PCI_NPEM
> +	bool "Native PCIe Enclosure Management"
> +	depends on LEDS_CLASS=y

I would have expected

    depends on NEW_LEDS
    select LEDS_CLASS

> +	help
> +	  Support for Native PCIe Enclosure Management. It allows managing LED
> +	  indications in storage enclosures. Enclosure must support following
> +	  indications: OK, Locate, Fail, Rebuild, other indications are
> +	  optional.
> +
>  config PCI_PRI
>  	bool "PCI PRI support"
>  	select PCI_ATS
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 175302036890..cd5f655d4be9 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
>  obj-$(CONFIG_PCI_DOE)		+= doe.o
>  obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> +obj-$(CONFIG_PCI_NPEM)		+= npem.o
>  
>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> diff --git a/drivers/pci/npem.c b/drivers/pci/npem.c
> new file mode 100644
> index 000000000000..a76a2044dab2
> --- /dev/null
> +++ b/drivers/pci/npem.c
> @@ -0,0 +1,410 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCIe Enclosure management driver created for LED interfaces based on
> + * indications. It says *what indications* blink but does not specify *how*
> + * they blink - it is hardware defined.
> + *
> + * The driver name refers to Native PCIe Enclosure Management. It is
> + * first indication oriented standard with specification.
> + *
> + * Native PCIe Enclosure Management (NPEM)
> + *	PCIe Base Specification r6.1 sec 6.28
> + *	PCIe Base Specification r6.1 sec 7.9.19
> + *
> + * Copyright (c) 2023-2024 Intel Corporation
> + *	Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/errno.h>
> +#include <linux/iopoll.h>
> +#include <linux/leds.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci_regs.h>
> +#include <linux/types.h>
> +#include <linux/uleds.h>
> +
> +#include "pci.h"
> +
> +struct indication {
> +	u32 bit;
> +	const char *name;
> +};
> +
> +static const struct indication npem_indications[] = {
> +	{PCI_NPEM_IND_OK,	"enclosure:ok"},
> +	{PCI_NPEM_IND_LOCATE,	"enclosure:locate"},
> +	{PCI_NPEM_IND_FAIL,	"enclosure:fail"},
> +	{PCI_NPEM_IND_REBUILD,	"enclosure:rebuild"},
> +	{PCI_NPEM_IND_PFA,	"enclosure:pfa"},
> +	{PCI_NPEM_IND_HOTSPARE,	"enclosure:hotspare"},
> +	{PCI_NPEM_IND_ICA,	"enclosure:ica"},
> +	{PCI_NPEM_IND_IFA,	"enclosure:ifa"},
> +	{PCI_NPEM_IND_IDT,	"enclosure:idt"},
> +	{PCI_NPEM_IND_DISABLED,	"enclosure:disabled"},
> +	{PCI_NPEM_IND_SPEC_0,	"enclosure:specific_0"},
> +	{PCI_NPEM_IND_SPEC_1,	"enclosure:specific_1"},
> +	{PCI_NPEM_IND_SPEC_2,	"enclosure:specific_2"},
> +	{PCI_NPEM_IND_SPEC_3,	"enclosure:specific_3"},
> +	{PCI_NPEM_IND_SPEC_4,	"enclosure:specific_4"},
> +	{PCI_NPEM_IND_SPEC_5,	"enclosure:specific_5"},
> +	{PCI_NPEM_IND_SPEC_6,	"enclosure:specific_6"},
> +	{PCI_NPEM_IND_SPEC_7,	"enclosure:specific_7"},
> +	{0,			NULL}
> +};
> +
> +#define for_each_indication(ind, inds) \
> +	for (ind = inds; ind->bit; ind++)
> +
> +/* To avoid confusion, do not keep any special bits in indications */

I am confused by this comment. What "special bits" is this referring to?

Make sure comments are something that can help you remember 5 to 10
years what this function is doing.

> +static u32 reg_to_indications(u32 caps, const struct indication *inds)
> +{
> +	const struct indication *ind;
> +	u32 supported_indications = 0;
> +
> +	for_each_indication(ind, inds)
> +		supported_indications |= ind->bit;
> +
> +	return caps & supported_indications;
> +}
> +
> +/**
> + * struct npem_led - LED details
> + * @indication: indication details
> + * @npem: npem device
> + * @name: LED name
> + * @led: LED device
> + */
> +struct npem_led {
> +	const struct indication *indication;
> +	struct npem *npem;
> +	char name[LED_MAX_NAME_SIZE];
> +	struct led_classdev led;
> +};
> +
> +/**
> + * struct npem_ops - backend specific callbacks
> + * @inds: supported indications array
> + * @get_active_indications: get active indications
> + *	npem: npem device
> + *	buf: response buffer
> + * @set_active_indications: set new indications
> + *	npem: npem device
> + *	val: bit mask to set
> + *
> + * Handle communication with hardware. set_active_indications updates
> + * npem->active_indications.
> + */
> +struct npem_ops {
> +	const struct indication *inds;

@inds is not an operation, it feels like something that belongs as
another member in 'struct npem'. What drove this data to join 'struct
npem_ops'?

> +	int (*get_active_indications)(struct npem *npem, u32 *buf);
> +	int (*set_active_indications)(struct npem *npem, u32 val);
> +};
> +
> +/**
> + * struct npem - NPEM device properties
> + * @dev: PCIe device this driver is attached to
> + * @ops: Backend specific callbacks
> + * @npem_lock: to keep concurrent updates serialized
> + * @pos: NPEM capability offset (only relevant for NPEM direct register access,
> + *	 not DSM access method)
> + * @supported_indications: bit mask of supported indications
> + *			   non-indication and reserved bits are cleared
> + * @active_indications: bit mask of active indications
> + *			non-indication and reserved bits are cleared
> + * @led_cnt: Supported LEDs count
> + * @leds: supported LEDs

It would help to describe the locking model here since the comment for
@npem_lock is not adding anything useful.

> + */
> +struct npem {
> +	struct pci_dev *dev;
> +	const struct npem_ops *ops;
> +	struct mutex npem_lock;
> +	u16 pos;
> +	u32 supported_indications;
> +	u32 active_indications;
> +	int led_cnt;
> +	struct npem_led leds[];
> +};
> +
> +static int npem_read_reg(struct npem *npem, u16 reg, u32 *val)
> +{
> +	int ret = pci_read_config_dword(npem->dev, npem->pos + reg, val);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_write_ctrl(struct npem *npem, u32 reg)
> +{
> +	int pos = npem->pos + PCI_NPEM_CTRL;
> +	int ret = pci_write_config_dword(npem->dev, pos, reg);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_get_active_indications(struct npem *npem, u32 *buf)
> +{
> +	int ret;
> +
> +	ret = npem_read_reg(npem, PCI_NPEM_CTRL, buf);
> +	if (ret)
> +		return ret;
> +
> +	/* If PCI_NPEM_CTRL_ENABLE is not set then no indication should blink */
> +	if (!(*buf & PCI_NPEM_CTRL_ENABLE))

It feels odd to reuse @buf like this.

I would have a local @ctrl variable and before returning do:

   *buf = ctrl;

> +		*buf = 0;
> +
> +	/* Filter out not supported indications in response */
> +	*buf &= npem->supported_indications;
> +	return 0;
> +}
> +
> +static int npem_set_active_indications(struct npem *npem, u32 val)
> +{
> +	int ret, ret_val;
> +	u32 cc_status;
> +
> +	lockdep_assert_held(&npem->npem_lock);
> +
> +	/* This bit is always required */
> +	val |= PCI_NPEM_CTRL_ENABLE;
> +	ret = npem_write_ctrl(npem, val);

@val is too generic a name when it is known that this the control
register value, how about call it @ctrl directly?

> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * For the case where a NPEM command has not completed immediately,
> +	 * it is recommended that software not continuously “spin” on polling
> +	 * the status register, but rather poll under interrupt at a reduced
> +	 * rate; for example at 10 ms intervals.

I think the use of read_poll_timeout() obviates the need for this
comment. I.e. read_poll_timeout() is self documenting.

> +	 *
> +	 * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of NPEM
> +	 * Command Completed"
> +	 */
> +	ret = read_poll_timeout(npem_read_reg, ret_val,
> +				ret_val || (cc_status & PCI_NPEM_STATUS_CC),
> +				10 * USEC_PER_MSEC, USEC_PER_SEC, false, npem,
> +				PCI_NPEM_STATUS, &cc_status);
> +	if (ret)
> +		return ret_val;
> +
> +	/*
> +	 * All writes to control register, including writes that do not change
> +	 * the register value, are NPEM commands and should eventually result
> +	 * in a command completion indication in the NPEM Status Register.
> +	 *
> +	 * PCIe Base Specification r6.1 sec 7.9.19.3
> +	 *
> +	 * Register may not be updated, or other conflicting bits may be
> +	 * cleared. Spec is not strict here. Read NPEM Control register after
> +	 * write to keep cache in-sync.
> +	 */
> +	return npem_get_active_indications(npem, &npem->active_indications);
> +}
> +
> +static const struct npem_ops npem_ops = {
> +	.inds = npem_indications,
> +	.get_active_indications = npem_get_active_indications,
> +	.set_active_indications = npem_set_active_indications,
> +};
> +
> +#define DSM_GUID GUID_INIT(0x5d524d9d, 0xfff9, 0x4d4b,  0x8c, 0xb7, 0x74, 0x7e,\
> +			   0xd5, 0x1e, 0x19, 0x4d)
> +#define GET_SUPPORTED_STATES_DSM	BIT(1)
> +#define GET_STATE_DSM			BIT(2)
> +#define SET_STATE_DSM			BIT(3)
> +
> +static const guid_t dsm_guid = DSM_GUID;
> +
> +static bool npem_has_dsm(struct pci_dev *pdev)
> +{
> +	acpi_handle handle;
> +
> +	handle = ACPI_HANDLE(&pdev->dev);
> +	if (!handle)
> +		return false;
> +
> +	return acpi_check_dsm(handle, &dsm_guid, 0x1, GET_SUPPORTED_STATES_DSM |
> +			      GET_STATE_DSM | SET_STATE_DSM);
> +}
> +
> +/*
> + * This function does not use ops->get_active_indications().
> + * It returns cached value, npem->npem_lock is held and it is safe.

What about this function would make a reader think it was unsafe? It is
evident that this function does not call ops->get_active_indications(),
just as it is evident that it does not call kfree().

Would a better description be:

"The status of each indicator is cached at init time and only updated at
write time. brightness_get() is only responsible for reflecting the last
written/cached value."

> + */
> +static enum led_brightness brightness_get(struct led_classdev *led)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	int ret, val = LED_OFF;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (npem->active_indications & nled->indication->bit)
> +		val = LED_ON;
> +
> +	mutex_unlock(&npem->npem_lock);
> +
> +	return val;
> +}
> +
> +static int brightness_set(struct led_classdev *led,
> +			  enum led_brightness brightness)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	u32 indications;
> +	int ret;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (brightness == LED_OFF)
> +		indications = npem->active_indications & ~(nled->indication->bit);
> +	else
> +		indications = npem->active_indications | nled->indication->bit;
> +
> +	ret = npem->ops->set_active_indications(npem, indications);
> +
> +	mutex_unlock(&npem->npem_lock);
> +	return ret;
> +}
> +
> +static void npem_free(struct npem *npem)
> +{
> +	struct npem_led *nled;
> +	int cnt;
> +
> +	for (cnt = 0; cnt < npem->led_cnt; cnt++) {
> +		nled = &npem->leds[cnt];
> +
> +		if (nled->name[0])
> +			led_classdev_unregister(&nled->led);
> +	}
> +
> +	mutex_destroy(&npem->npem_lock);
> +	kfree(npem);
> +}
> +
> +static int pci_npem_set_led_classdev(struct npem *npem, struct npem_led *nled)
> +{
> +	struct led_classdev *led = &nled->led;
> +	struct led_init_data init_data = {};
> +	char *name = nled->name;
> +	int ret;
> +
> +	init_data.devicename = pci_name(npem->dev);
> +	init_data.default_label = nled->indication->name;
> +
> +	ret = led_compose_name(&npem->dev->dev, &init_data, name);
> +	if (ret)
> +		return ret;
> +
> +	led->name = name;
> +	led->brightness_set_blocking = brightness_set;
> +	led->brightness_get = brightness_get;
> +	led->max_brightness = LED_ON;
> +	led->default_trigger = "none";
> +	led->flags = 0;
> +
> +	ret = led_classdev_register(&npem->dev->dev, led);
> +	if (ret)
> +		/* Clear the name to indicate that it is not registered. */
> +		name[0] = 0;
> +	return ret;
> +}
> +
> +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> +			 int pos, u32 caps)
> +{
> +	u32 supported = reg_to_indications(caps, ops->inds);
> +	int supported_cnt = hweight32(supported);
> +	const struct indication *indication;
> +	struct npem_led *nled;
> +	struct npem *npem;
> +	int led_idx = 0;
> +	u32 active;
> +	int ret;
> +
> +	npem = kzalloc(struct_size(npem, leds, supported_cnt), GFP_KERNEL);
> +
> +	if (!npem)
> +		return -ENOMEM;
> +
> +	npem->supported_indications = supported;
> +	npem->led_cnt = supported_cnt;
> +	npem->pos = pos;
> +	npem->dev = dev;
> +	npem->ops = ops;
> +
> +	ret = ops->get_active_indications(npem, &active);
> +	if (ret) {
> +		npem_free(npem);
> +		return -EACCES;

get_active_indications() returns an errno, why translate to "Permission
denied" which is likely to make someone go cross-eyed wondering why root
is getting "Permission denied".

> +	}
> +
> +	npem->active_indications = reg_to_indications(active, ops->inds);
> +
> +	/*
> +	 * Do not take npem->npem_lock, get_brightness() is called on
> +	 * registration path.
> +	 */

Delete this comment. Just put a lockdep_assert_held() in
get_active_indications() and take the lock for caching the initial
state.

> +	mutex_init(&npem->npem_lock);
> +
> +	for_each_indication(indication, npem_indications) {
> +		if ((npem->supported_indications & indication->bit) == 0)
> +			/* Do not register unsupported indication */
> +			continue;
> +
> +		nled = &npem->leds[led_idx++];
> +		nled->indication = indication;
> +		nled->npem = npem;
> +
> +		ret = pci_npem_set_led_classdev(npem, nled);
> +		if (ret) {
> +			npem_free(npem);
> +			return ret;
> +		}
> +	}
> +
> +	dev->npem = npem;
> +	return 0;
> +}
> +
> +void pci_npem_remove(struct pci_dev *dev)
> +{
> +	if (dev->npem)
> +		npem_free(dev->npem);

Just put a NULL check in npem_free().

> +}
> +
> +void pci_npem_create(struct pci_dev *dev)
> +{
> +	const struct npem_ops *ops = &npem_ops;
> +	int pos = 0, ret;
> +	u32 cap;
> +
> +	if (!npem_has_dsm(dev)) {
> +		pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_NPEM);
> +		if (pos == 0)
> +			return;
> +
> +		if (pci_read_config_dword(dev, pos + PCI_NPEM_CAP, &cap) != 0 ||
> +		    (cap & PCI_NPEM_CAP_CAPABLE) == 0)
> +			return;
> +	} else {
> +		/*
> +		 * OS should use the DSM for LED control if it is available
> +		 * PCI Firmware Spec r3.3 sec 4.7.
> +		 */
> +		return;

I assume this is a TODO for the next patch that add DSM support?

> +	}
> +
> +	ret = pci_npem_init(dev, ops, pos, cap);
> +	if (ret)
> +		pci_err(dev, "Failed to register PCIe Enclosure Management driver, err: %d\n",
> +			ret);
> +}
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fd44565c4756..9dea8c7353ab 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -333,6 +333,14 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { }
>  static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
>  #endif
>  
> +#ifdef CONFIG_PCI_NPEM
> +void pci_npem_create(struct pci_dev *dev);
> +void pci_npem_remove(struct pci_dev *dev);
> +#else
> +static inline void pci_npem_create(struct pci_dev *dev) { }
> +static inline void pci_npem_remove(struct pci_dev *dev) { }
> +#endif
> +
>  /**
>   * pci_dev_set_io_state - Set the new error state if possible.
>   *
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 8e696e547565..b8ea6353e27a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2582,6 +2582,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>  	dev->match_driver = false;
>  	ret = device_add(&dev->dev);
>  	WARN_ON(ret < 0);
> +
> +	pci_npem_create(dev);
>  }
>  
>  struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn)
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index d749ea8250d6..1436f9cf1fea 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -33,6 +33,8 @@ static void pci_destroy_dev(struct pci_dev *dev)
>  	if (!dev->dev.kobj.parent)
>  		return;
>  
> +	pci_npem_remove(dev);
> +
>  	device_del(&dev->dev);
>  
>  	down_write(&pci_bus_sem);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index fb004fd4e889..c327c2dd4527 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -517,6 +517,9 @@ struct pci_dev {
>  #endif
>  #ifdef CONFIG_PCI_DOE
>  	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> +#endif
> +#ifdef CONFIG_PCI_NPEM
> +	struct npem	*npem;		/* Native PCIe Enclosure Management */
>  #endif
>  	u16		acs_cap;	/* ACS Capability offset */
>  	phys_addr_t	rom;		/* Physical address if not from BAR */
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 94c00996e633..97e8b7ed9998 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -740,6 +740,7 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> +#define PCI_EXT_CAP_ID_NPEM	0x29	/* Native PCIe Enclosure Management */
>  #define PCI_EXT_CAP_ID_PL_32GT  0x2A    /* Physical Layer 32.0 GT/s */
>  #define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
>  #define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
> @@ -1121,6 +1122,39 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Native PCIe Enclosure Management */
> +#define PCI_NPEM_CAP	0x04 /* NPEM capability register */
> +#define	 PCI_NPEM_CAP_CAPABLE		0x00000001 /* NPEM Capable */
> +
> +#define PCI_NPEM_CTRL	0x08 /* NPEM control register */
> +#define	 PCI_NPEM_CTRL_ENABLE		0x00000001 /* NPEM Enable */
> +
> +#define PCI_NPEM_STATUS	0x0c /* NPEM status register */
> +#define	 PCI_NPEM_STATUS_CC		0x00000001 /* NPEM Command completed */
> +/*
> + * Native PCIe Enclosure Management indication bits and Reset command bit
> + * are corresponding for capability and control registers.
> + */
> +#define  PCI_NPEM_CMD_RESET		0x00000002 /* NPEM Reset Command */
> +#define  PCI_NPEM_IND_OK		0x00000004 /* NPEM indication OK */
> +#define  PCI_NPEM_IND_LOCATE		0x00000008 /* NPEM indication Locate */
> +#define  PCI_NPEM_IND_FAIL		0x00000010 /* NPEM indication Fail */
> +#define  PCI_NPEM_IND_REBUILD		0x00000020 /* NPEM indication Rebuild */
> +#define  PCI_NPEM_IND_PFA		0x00000040 /* NPEM indication Predicted Failure Analysis */
> +#define  PCI_NPEM_IND_HOTSPARE		0x00000080 /* NPEM indication Hot Spare */
> +#define  PCI_NPEM_IND_ICA		0x00000100 /* NPEM indication In Critical Array */
> +#define  PCI_NPEM_IND_IFA		0x00000200 /* NPEM indication In Failed Array */
> +#define  PCI_NPEM_IND_IDT		0x00000400 /* NPEM indication Invalid Device Type */
> +#define  PCI_NPEM_IND_DISABLED		0x00000800 /* NPEM indication Disabled */
> +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> +#define  PCI_NPEM_IND_SPEC_7		0x40000000

Given no other driver needs this, I would define them locally in
drivers/pci/npem.c.
Lukas Wunner May 29, 2024, 9:38 a.m. UTC | #3
On Tue, May 28, 2024 at 10:21:10PM -0700, Dan Williams wrote:
> Mariusz Tkaczyk wrote:
> > +config PCI_NPEM
> > +	bool "Native PCIe Enclosure Management"
> > +	depends on LEDS_CLASS=y
> 
> I would have expected
> 
>     depends on NEW_LEDS
>     select LEDS_CLASS

Hm, a quick "git grep -C 2 'depends on NEW_LEDS'" shows that noone else
does that.  Everyone else either selects both NEW_LEDS and LEDS_CLASS
or depends on both or depends on just LEDS_CLASS.

(Since LEDS_CLASS is constrained to "if NEW_LEDS", depending on both
seems pointless, so I'm not sure why some people do that.)

I guess it would be good to get guidance from leds maintainers what
the preferred modus operandi is.


> > +#define for_each_indication(ind, inds) \
> > +	for (ind = inds; ind->bit; ind++)
> > +
> > +/* To avoid confusion, do not keep any special bits in indications */
> 
> I am confused by this comment. What "special bits" is this referring to?

I think it's referring to bit 0 in the Status and Control register,
which is a master "NPEM Capable" and "NPEM Enable" bit.


> > +struct npem_ops {
> > +	const struct indication *inds;
> 
> @inds is not an operation, it feels like something that belongs as
> another member in 'struct npem'. What drove this data to join 'struct
> npem_ops'?

The native NPEM register interface supports enclosure-specific indications
which the DSM interface does not support.  So those indications are
present in the native npem_ops->inds and not present in the DSM
npem_ops->inds.


> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
[...]
> > +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> > +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> > +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> > +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> > +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> > +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> > +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> > +#define  PCI_NPEM_IND_SPEC_7		0x40000000
> 
> Given no other driver needs this, I would define them locally in
> drivers/pci/npem.c.

This is a uapi header, so could be used not just by other drivers
but by user space.

It's common to add spec-defined register bits to this header file
even if they're only used by a single source file in the kernel.

Thanks,

Lukas
Mariusz Tkaczyk June 12, 2024, 11:40 a.m. UTC | #4
Hi,
Thanks for feedback Dan!

On Wed, 29 May 2024 11:38:12 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, May 28, 2024 at 10:21:10PM -0700, Dan Williams wrote:
> > Mariusz Tkaczyk wrote:  
> > > +config PCI_NPEM
> > > +	bool "Native PCIe Enclosure Management"
> > > +	depends on LEDS_CLASS=y  
> > 
> > I would have expected
> > 
> >     depends on NEW_LEDS
> >     select LEDS_CLASS  
> 
> Hm, a quick "git grep -C 2 'depends on NEW_LEDS'" shows that noone else
> does that.  Everyone else either selects both NEW_LEDS and LEDS_CLASS
> or depends on both or depends on just LEDS_CLASS.
> 
> (Since LEDS_CLASS is constrained to "if NEW_LEDS", depending on both
> seems pointless, so I'm not sure why some people do that.)
> 
> I guess it would be good to get guidance from leds maintainers what
> the preferred modus operandi is.

Pavel, could you please advice?
I have no clue which way I should take so I prefer to keep current approach.

> 
> 
> > > +#define for_each_indication(ind, inds) \
> > > +	for (ind = inds; ind->bit; ind++)
> > > +
> > > +/* To avoid confusion, do not keep any special bits in indications */  
> > 
> > I am confused by this comment. What "special bits" is this referring to?  
> 
> I think it's referring to bit 0 in the Status and Control register,
> which is a master "NPEM Capable" and "NPEM Enable" bit.

Yes, there are 2 special bits for capability/control
NPEM_CAP_CAPABLE/NPEM_ENABLE and NPEM_CAP_RESET/NPEM_RESET.

I wanted to highlight that these bits are not included in the cache. I will try
to make it more precise in v3.

> 
> 
> > > +struct npem_ops {
> > > +	const struct indication *inds;  
> > 
> > @inds is not an operation, it feels like something that belongs as
> > another member in 'struct npem'. What drove this data to join 'struct
> > npem_ops'?  
> 
> The native NPEM register interface supports enclosure-specific indications
> which the DSM interface does not support.  So those indications are
> present in the native npem_ops->inds and not present in the DSM
> npem_ops->inds.

Yes, I need to differentiate DSM and NPEM indications. DSM has own indications
list.

> 
> 
> > > --- a/include/uapi/linux/pci_regs.h
> > > +++ b/include/uapi/linux/pci_regs.h  
> [...]
> > > +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> > > +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> > > +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> > > +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> > > +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> > > +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> > > +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> > > +#define  PCI_NPEM_IND_SPEC_7		0x40000000  
> > 
> > Given no other driver needs this, I would define them locally in
> > drivers/pci/npem.c.  
> 
> This is a uapi header, so could be used not just by other drivers
> but by user space.
> 
> It's common to add spec-defined register bits to this header file
> even if they're only used by a single source file in the kernel.
> 

I will stay with current state while waiting for Bjorn's voice here.

I will send v3 with fixes requested by Dan and Ilpo but I still need Stuart
feedback on DSM patch.

Thanks,
Mariusz
stuart hayes June 13, 2024, 4:11 p.m. UTC | #5
On 6/12/2024 6:40 AM, Mariusz Tkaczyk wrote:
> Hi,
> Thanks for feedback Dan!
> 
> On Wed, 29 May 2024 11:38:12 +0200
> Lukas Wunner <lukas@wunner.de> wrote:
> 
>> On Tue, May 28, 2024 at 10:21:10PM -0700, Dan Williams wrote:
>>> Mariusz Tkaczyk wrote:
>>>> +config PCI_NPEM
>>>> +	bool "Native PCIe Enclosure Management"
>>>> +	depends on LEDS_CLASS=y
>>>
>>> I would have expected
>>>
>>>      depends on NEW_LEDS
>>>      select LEDS_CLASS
>>
>> Hm, a quick "git grep -C 2 'depends on NEW_LEDS'" shows that noone else
>> does that.  Everyone else either selects both NEW_LEDS and LEDS_CLASS
>> or depends on both or depends on just LEDS_CLASS.
>>
>> (Since LEDS_CLASS is constrained to "if NEW_LEDS", depending on both
>> seems pointless, so I'm not sure why some people do that.)
>>
>> I guess it would be good to get guidance from leds maintainers what
>> the preferred modus operandi is.
> 
> Pavel, could you please advice?
> I have no clue which way I should take so I prefer to keep current approach.
> 
>>
>>
>>>> +#define for_each_indication(ind, inds) \
>>>> +	for (ind = inds; ind->bit; ind++)
>>>> +
>>>> +/* To avoid confusion, do not keep any special bits in indications */
>>>
>>> I am confused by this comment. What "special bits" is this referring to?
>>
>> I think it's referring to bit 0 in the Status and Control register,
>> which is a master "NPEM Capable" and "NPEM Enable" bit.
> 
> Yes, there are 2 special bits for capability/control
> NPEM_CAP_CAPABLE/NPEM_ENABLE and NPEM_CAP_RESET/NPEM_RESET.
> 
> I wanted to highlight that these bits are not included in the cache. I will try
> to make it more precise in v3.
> 
>>
>>
>>>> +struct npem_ops {
>>>> +	const struct indication *inds;
>>>
>>> @inds is not an operation, it feels like something that belongs as
>>> another member in 'struct npem'. What drove this data to join 'struct
>>> npem_ops'?
>>
>> The native NPEM register interface supports enclosure-specific indications
>> which the DSM interface does not support.  So those indications are
>> present in the native npem_ops->inds and not present in the DSM
>> npem_ops->inds.
> 
> Yes, I need to differentiate DSM and NPEM indications. DSM has own indications
> list.
> 
>>
>>
>>>> --- a/include/uapi/linux/pci_regs.h
>>>> +++ b/include/uapi/linux/pci_regs.h
>> [...]
>>>> +#define  PCI_NPEM_IND_SPEC_0		0x00800000
>>>> +#define  PCI_NPEM_IND_SPEC_1		0x01000000
>>>> +#define  PCI_NPEM_IND_SPEC_2		0x02000000
>>>> +#define  PCI_NPEM_IND_SPEC_3		0x04000000
>>>> +#define  PCI_NPEM_IND_SPEC_4		0x08000000
>>>> +#define  PCI_NPEM_IND_SPEC_5		0x10000000
>>>> +#define  PCI_NPEM_IND_SPEC_6		0x20000000
>>>> +#define  PCI_NPEM_IND_SPEC_7		0x40000000
>>>
>>> Given no other driver needs this, I would define them locally in
>>> drivers/pci/npem.c.
>>
>> This is a uapi header, so could be used not just by other drivers
>> but by user space.
>>
>> It's common to add spec-defined register bits to this header file
>> even if they're only used by a single source file in the kernel.
>>
> 
> I will stay with current state while waiting for Bjorn's voice here.
> 
> I will send v3 with fixes requested by Dan and Ilpo but I still need Stuart
> feedback on DSM patch.
> 
> Thanks,
> Mariusz

I'm working on testing this now, sorry for the delay.
Thanks!  Stuart
Mariusz Tkaczyk June 14, 2024, 1:40 p.m. UTC | #6
On Tue, 28 May 2024 16:55:16 +0300 (EEST)
Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> wrote:

> > +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> > +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> > +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> > +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> > +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> > +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> > +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> > +#define  PCI_NPEM_IND_SPEC_7		0x40000000  
> 
> Are these off-by-one, shouldn't that field go all the way to the last bit 
> which is 0x80000000 (the field is 31:24)??

Yes, I made a mistake. Good catch!

Mariusz
stuart hayes June 14, 2024, 9:06 p.m. UTC | #7
On 5/28/2024 8:19 AM, Mariusz Tkaczyk wrote:
> Native PCIe Enclosure Management (NPEM, PCIe r6.1 sec 6.28) allows
> managing LED in storage enclosures. NPEM is indication oriented
> and it does not give direct access to LED. Although each of
> the indications *could* represent an individual LED, multiple
> indications could also be represented as a single,
> multi-color LED or a single LED blinking in a specific interval.
> The specification leaves that open.
> 
> Each enabled indication (capability register bit on) is represented as a
> ledclass_dev which can be controlled through sysfs. For every ledclass
> device only 2 brightness states are allowed: LED_ON (1) or LED_OFF (0).
> It is corresponding to NPEM control register (Indication bit on/off).
> 
> Ledclass devices appear in sysfs as child devices (subdirectory) of PCI
> device which has an NPEM Extended Capability and indication is enabled
> in NPEM capability register. For example, these are leds created for
> pcieport "10000:02:05.0" on my setup:
> 
> leds/
> ├── 10000:02:05.0:enclosure:fail
> ├── 10000:02:05.0:enclosure:locate
> ├── 10000:02:05.0:enclosure:ok
> └── 10000:02:05.0:enclosure:rebuild
> 
> They can be also found in "/sys/class/leds" directory. Parent PCIe device
> bdf is used to guarantee uniqueness across leds subsystem.
> 
> To enable/disable fail indication "brightness" file can be edited:
> echo 1 > ./leds/10000:02:05.0:enclosure:fail/brightness
> echo 0 > ./leds/10000:02:05.0:enclosure:fail/brightness
> 
> PCIe r6.1, sec 7.9.19.2 defines the possible indications.
> 
> Multiple indications for same parent PCIe device can conflict and
> hardware may update them when processing new request. To avoid issues,
> driver refresh all indications by reading back control register.
> 
> Driver is projected to be exclusive NPEM extended capability manager.
> It waits up to 1 second after imposing new request, it doesn't verify if
> controller is busy before write, assuming that mutex lock gives protection
> from concurrent updates. Driver is not registered if _DSM LED management
> is available.
> 
> NPEM is a PCIe extended capability so it should be registered in
> pcie_init_capabilities() but it is not possible due to LED dependency.
> Parent pci_device must be added earlier for led_classdev_register()
> to be successful. NPEM does not require configuration on kernel side, it
> is safe to register LED devices later.
> 
> Link: https://members.pcisig.com/wg/PCI-SIG/document/19849 [1]
> Suggested-by: Lukas Wunner <lukas@wunner.de>
> Signed-off-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> ---
>   drivers/pci/Kconfig           |   9 +
>   drivers/pci/Makefile          |   1 +
>   drivers/pci/npem.c            | 410 ++++++++++++++++++++++++++++++++++
>   drivers/pci/pci.h             |   8 +
>   drivers/pci/probe.c           |   2 +
>   drivers/pci/remove.c          |   2 +
>   include/linux/pci.h           |   3 +
>   include/uapi/linux/pci_regs.h |  34 +++
>   8 files changed, 469 insertions(+)
>   create mode 100644 drivers/pci/npem.c
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index d35001589d88..e696e69ad516 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -143,6 +143,15 @@ config PCI_IOV
>   
>   	  If unsure, say N.
>   
> +config PCI_NPEM
> +	bool "Native PCIe Enclosure Management"
> +	depends on LEDS_CLASS=y
> +	help
> +	  Support for Native PCIe Enclosure Management. It allows managing LED
> +	  indications in storage enclosures. Enclosure must support following
> +	  indications: OK, Locate, Fail, Rebuild, other indications are
> +	  optional.
> +
>   config PCI_PRI
>   	bool "PCI PRI support"
>   	select PCI_ATS
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 175302036890..cd5f655d4be9 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>   obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
>   obj-$(CONFIG_PCI_DOE)		+= doe.o
>   obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
> +obj-$(CONFIG_PCI_NPEM)		+= npem.o
>   
>   # Endpoint library must be initialized before its users
>   obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> diff --git a/drivers/pci/npem.c b/drivers/pci/npem.c
> new file mode 100644
> index 000000000000..a76a2044dab2
> --- /dev/null
> +++ b/drivers/pci/npem.c
> @@ -0,0 +1,410 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCIe Enclosure management driver created for LED interfaces based on
> + * indications. It says *what indications* blink but does not specify *how*
> + * they blink - it is hardware defined.
> + *
> + * The driver name refers to Native PCIe Enclosure Management. It is
> + * first indication oriented standard with specification.
> + *
> + * Native PCIe Enclosure Management (NPEM)
> + *	PCIe Base Specification r6.1 sec 6.28
> + *	PCIe Base Specification r6.1 sec 7.9.19
> + *
> + * Copyright (c) 2023-2024 Intel Corporation
> + *	Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/errno.h>
> +#include <linux/iopoll.h>
> +#include <linux/leds.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci_regs.h>
> +#include <linux/types.h>
> +#include <linux/uleds.h>
> +
> +#include "pci.h"
> +
> +struct indication {
> +	u32 bit;
> +	const char *name;
> +};
> +
> +static const struct indication npem_indications[] = {
> +	{PCI_NPEM_IND_OK,	"enclosure:ok"},
> +	{PCI_NPEM_IND_LOCATE,	"enclosure:locate"},
> +	{PCI_NPEM_IND_FAIL,	"enclosure:fail"},
> +	{PCI_NPEM_IND_REBUILD,	"enclosure:rebuild"},
> +	{PCI_NPEM_IND_PFA,	"enclosure:pfa"},
> +	{PCI_NPEM_IND_HOTSPARE,	"enclosure:hotspare"},
> +	{PCI_NPEM_IND_ICA,	"enclosure:ica"},
> +	{PCI_NPEM_IND_IFA,	"enclosure:ifa"},
> +	{PCI_NPEM_IND_IDT,	"enclosure:idt"},
> +	{PCI_NPEM_IND_DISABLED,	"enclosure:disabled"},
> +	{PCI_NPEM_IND_SPEC_0,	"enclosure:specific_0"},
> +	{PCI_NPEM_IND_SPEC_1,	"enclosure:specific_1"},
> +	{PCI_NPEM_IND_SPEC_2,	"enclosure:specific_2"},
> +	{PCI_NPEM_IND_SPEC_3,	"enclosure:specific_3"},
> +	{PCI_NPEM_IND_SPEC_4,	"enclosure:specific_4"},
> +	{PCI_NPEM_IND_SPEC_5,	"enclosure:specific_5"},
> +	{PCI_NPEM_IND_SPEC_6,	"enclosure:specific_6"},
> +	{PCI_NPEM_IND_SPEC_7,	"enclosure:specific_7"},
> +	{0,			NULL}
> +};
> +
> +#define for_each_indication(ind, inds) \
> +	for (ind = inds; ind->bit; ind++)
> +
> +/* To avoid confusion, do not keep any special bits in indications */
> +static u32 reg_to_indications(u32 caps, const struct indication *inds)
> +{
> +	const struct indication *ind;
> +	u32 supported_indications = 0;
> +
> +	for_each_indication(ind, inds)
> +		supported_indications |= ind->bit;
> +
> +	return caps & supported_indications;
> +}
> +
> +/**
> + * struct npem_led - LED details
> + * @indication: indication details
> + * @npem: npem device
> + * @name: LED name
> + * @led: LED device
> + */
> +struct npem_led {
> +	const struct indication *indication;
> +	struct npem *npem;
> +	char name[LED_MAX_NAME_SIZE];
> +	struct led_classdev led;
> +};
> +
> +/**
> + * struct npem_ops - backend specific callbacks
> + * @inds: supported indications array
> + * @get_active_indications: get active indications
> + *	npem: npem device
> + *	buf: response buffer
> + * @set_active_indications: set new indications
> + *	npem: npem device
> + *	val: bit mask to set
> + *
> + * Handle communication with hardware. set_active_indications updates
> + * npem->active_indications.
> + */
> +struct npem_ops {
> +	const struct indication *inds;
> +	int (*get_active_indications)(struct npem *npem, u32 *buf);
> +	int (*set_active_indications)(struct npem *npem, u32 val);
> +};
> +
> +/**
> + * struct npem - NPEM device properties
> + * @dev: PCIe device this driver is attached to
> + * @ops: Backend specific callbacks
> + * @npem_lock: to keep concurrent updates serialized
> + * @pos: NPEM capability offset (only relevant for NPEM direct register access,
> + *	 not DSM access method)
> + * @supported_indications: bit mask of supported indications
> + *			   non-indication and reserved bits are cleared
> + * @active_indications: bit mask of active indications
> + *			non-indication and reserved bits are cleared
> + * @led_cnt: Supported LEDs count
> + * @leds: supported LEDs
> + */
> +struct npem {
> +	struct pci_dev *dev;
> +	const struct npem_ops *ops;
> +	struct mutex npem_lock;
> +	u16 pos;
> +	u32 supported_indications;
> +	u32 active_indications;
> +	int led_cnt;
> +	struct npem_led leds[];
> +};
> +
> +static int npem_read_reg(struct npem *npem, u16 reg, u32 *val)
> +{
> +	int ret = pci_read_config_dword(npem->dev, npem->pos + reg, val);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_write_ctrl(struct npem *npem, u32 reg)
> +{
> +	int pos = npem->pos + PCI_NPEM_CTRL;
> +	int ret = pci_write_config_dword(npem->dev, pos, reg);
> +
> +	return pcibios_err_to_errno(ret);
> +}
> +
> +static int npem_get_active_indications(struct npem *npem, u32 *buf)
> +{
> +	int ret;
> +
> +	ret = npem_read_reg(npem, PCI_NPEM_CTRL, buf);
> +	if (ret)
> +		return ret;
> +
> +	/* If PCI_NPEM_CTRL_ENABLE is not set then no indication should blink */
> +	if (!(*buf & PCI_NPEM_CTRL_ENABLE))
> +		*buf = 0;
> +
> +	/* Filter out not supported indications in response */
> +	*buf &= npem->supported_indications;
> +	return 0;
> +}
> +
> +static int npem_set_active_indications(struct npem *npem, u32 val)
> +{
> +	int ret, ret_val;
> +	u32 cc_status;
> +
> +	lockdep_assert_held(&npem->npem_lock);
> +
> +	/* This bit is always required */
> +	val |= PCI_NPEM_CTRL_ENABLE;
> +	ret = npem_write_ctrl(npem, val);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * For the case where a NPEM command has not completed immediately,
> +	 * it is recommended that software not continuously “spin” on polling
> +	 * the status register, but rather poll under interrupt at a reduced
> +	 * rate; for example at 10 ms intervals.
> +	 *
> +	 * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of NPEM
> +	 * Command Completed"
> +	 */
> +	ret = read_poll_timeout(npem_read_reg, ret_val,
> +				ret_val || (cc_status & PCI_NPEM_STATUS_CC),
> +				10 * USEC_PER_MSEC, USEC_PER_SEC, false, npem,
> +				PCI_NPEM_STATUS, &cc_status);
> +	if (ret)
> +		return ret_val;
> +
> +	/*
> +	 * All writes to control register, including writes that do not change
> +	 * the register value, are NPEM commands and should eventually result
> +	 * in a command completion indication in the NPEM Status Register.
> +	 *
> +	 * PCIe Base Specification r6.1 sec 7.9.19.3
> +	 *
> +	 * Register may not be updated, or other conflicting bits may be
> +	 * cleared. Spec is not strict here. Read NPEM Control register after
> +	 * write to keep cache in-sync.
> +	 */
> +	return npem_get_active_indications(npem, &npem->active_indications);
> +}
> +
> +static const struct npem_ops npem_ops = {
> +	.inds = npem_indications,
> +	.get_active_indications = npem_get_active_indications,
> +	.set_active_indications = npem_set_active_indications,
> +};
> +
> +#define DSM_GUID GUID_INIT(0x5d524d9d, 0xfff9, 0x4d4b,  0x8c, 0xb7, 0x74, 0x7e,\
> +			   0xd5, 0x1e, 0x19, 0x4d)
> +#define GET_SUPPORTED_STATES_DSM	BIT(1)
> +#define GET_STATE_DSM			BIT(2)
> +#define SET_STATE_DSM			BIT(3)
> +
> +static const guid_t dsm_guid = DSM_GUID;
> +
> +static bool npem_has_dsm(struct pci_dev *pdev)
> +{
> +	acpi_handle handle;
> +
> +	handle = ACPI_HANDLE(&pdev->dev);
> +	if (!handle)
> +		return false;
> +
> +	return acpi_check_dsm(handle, &dsm_guid, 0x1, GET_SUPPORTED_STATES_DSM |
> +			      GET_STATE_DSM | SET_STATE_DSM);
> +}
> +

I had to change the DSM definitions to be 1/2/3 rather than BIT(1)/BIT(2)/BIT(3), since
the numbers 1/2/3 (not 2/4/8) need to be passed to acpi_evaluate_dsm_typed()... I just
changed it to:

#define GET_SUPPORTED_STATES_DSM	1
#define GET_STATE_DSM			2
#define SET_STATE_DSM			3

...and

	return acpi_check_dsm(handle, &dsm_guid, 0x1,
			      BIT(GET_SUPPORTED_STATES_DSM) |
			      BIT(GET_STATE_DSM) | BIT(SET_STATE_DSM));

After I changed this (and one other thing, below), it worked great.

> +/*
> + * This function does not use ops->get_active_indications().
> + * It returns cached value, npem->npem_lock is held and it is safe.
> + */
> +static enum led_brightness brightness_get(struct led_classdev *led)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	int ret, val = LED_OFF;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (npem->active_indications & nled->indication->bit)
> +		val = LED_ON;
> +
> +	mutex_unlock(&npem->npem_lock);
> +
> +	return val;
> +}
> +
> +static int brightness_set(struct led_classdev *led,
> +			  enum led_brightness brightness)
> +{
> +	struct npem_led *nled = container_of(led, struct npem_led, led);
> +	struct npem *npem = nled->npem;
> +	u32 indications;
> +	int ret;
> +
> +	ret = mutex_lock_interruptible(&npem->npem_lock);
> +	if (ret)
> +		return ret;
> +
> +	if (brightness == LED_OFF)
> +		indications = npem->active_indications & ~(nled->indication->bit);
> +	else
> +		indications = npem->active_indications | nled->indication->bit;
> +
> +	ret = npem->ops->set_active_indications(npem, indications);
> +
> +	mutex_unlock(&npem->npem_lock);
> +	return ret;
> +}
> +
> +static void npem_free(struct npem *npem)
> +{
> +	struct npem_led *nled;
> +	int cnt;
> +
> +	for (cnt = 0; cnt < npem->led_cnt; cnt++) {
> +		nled = &npem->leds[cnt];
> +
> +		if (nled->name[0])
> +			led_classdev_unregister(&nled->led);
> +	}
> +
> +	mutex_destroy(&npem->npem_lock);
> +	kfree(npem);
> +}
> +
> +static int pci_npem_set_led_classdev(struct npem *npem, struct npem_led *nled)
> +{
> +	struct led_classdev *led = &nled->led;
> +	struct led_init_data init_data = {};
> +	char *name = nled->name;
> +	int ret;
> +
> +	init_data.devicename = pci_name(npem->dev);
> +	init_data.default_label = nled->indication->name;
> +
> +	ret = led_compose_name(&npem->dev->dev, &init_data, name);
> +	if (ret)
> +		return ret;
> +
> +	led->name = name;
> +	led->brightness_set_blocking = brightness_set;
> +	led->brightness_get = brightness_get;
> +	led->max_brightness = LED_ON;
> +	led->default_trigger = "none";
> +	led->flags = 0;
> +
> +	ret = led_classdev_register(&npem->dev->dev, led);
> +	if (ret)
> +		/* Clear the name to indicate that it is not registered. */
> +		name[0] = 0;
> +	return ret;
> +}
> +
> +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> +			 int pos, u32 caps)
> +{
> +	u32 supported = reg_to_indications(caps, ops->inds);
> +	int supported_cnt = hweight32(supported);
> +	const struct indication *indication;
> +	struct npem_led *nled;
> +	struct npem *npem;
> +	int led_idx = 0;
> +	u32 active;
> +	int ret;
> +
> +	npem = kzalloc(struct_size(npem, leds, supported_cnt), GFP_KERNEL);
> +
> +	if (!npem)
> +		return -ENOMEM;
> +
> +	npem->supported_indications = supported;
> +	npem->led_cnt = supported_cnt;
> +	npem->pos = pos;
> +	npem->dev = dev;
> +	npem->ops = ops;
> +
> +	ret = ops->get_active_indications(npem, &active);
> +	if (ret) {
> +		npem_free(npem);
> +		return -EACCES;
> +	}

Failing pci_npem_init() if this ops->get_active_indications() fails will keep this
from working on most (all?) Dell servers, because the _DSM get/set functions use an
IPMI operation region to get/set the active LEDs, and this is getting run before the
IPMI drivers and acpi_ipmi module (which provides ACPI access to IPMI operation
regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)

For testing, I just changed this to ignore the error returned from
get_active_indications() if it is using DSM ops.  That still results in a string of
errors ("ACPI Error: Region IPMI (ID=7) has no handler" etc), but once the system is
up and I can log in, everything is loaded, and i can see & change brightness in sysfs.

I'm not sure if ignoring the error is the best fix for that.  Solutions I
can think of are:
(1) ignoring an error result from get_active_indications during init (as mentioned)
(2) providing a mechanism to trigger this driver to rescan a PCI device later from
     user space
(3) don't cache the active LEDs--just get the active states using NPEM/DSM when
     brightness is read, and do a get/modify/set when setting the brightness... then
     get_active_indications() wouldn't need to be called during init.


Thank you for including DSM support, Mariusz!

> +
> +	npem->active_indications = reg_to_indications(active, ops->inds);
> +
> +	/*
> +	 * Do not take npem->npem_lock, get_brightness() is called on
> +	 * registration path.
> +	 */
> +	mutex_init(&npem->npem_lock);
> +
> +	for_each_indication(indication, npem_indications) {
> +		if ((npem->supported_indications & indication->bit) == 0)
> +			/* Do not register unsupported indication */
> +			continue;
> +
> +		nled = &npem->leds[led_idx++];
> +		nled->indication = indication;
> +		nled->npem = npem;
> +
> +		ret = pci_npem_set_led_classdev(npem, nled);
> +		if (ret) {
> +			npem_free(npem);
> +			return ret;
> +		}
> +	}
> +
> +	dev->npem = npem;
> +	return 0;
> +}
> +
> +void pci_npem_remove(struct pci_dev *dev)
> +{
> +	if (dev->npem)
> +		npem_free(dev->npem);
> +}
> +
> +void pci_npem_create(struct pci_dev *dev)
> +{
> +	const struct npem_ops *ops = &npem_ops;
> +	int pos = 0, ret;
> +	u32 cap;
> +
> +	if (!npem_has_dsm(dev)) {
> +		pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_NPEM);
> +		if (pos == 0)
> +			return;
> +
> +		if (pci_read_config_dword(dev, pos + PCI_NPEM_CAP, &cap) != 0 ||
> +		    (cap & PCI_NPEM_CAP_CAPABLE) == 0)
> +			return;
> +	} else {
> +		/*
> +		 * OS should use the DSM for LED control if it is available
> +		 * PCI Firmware Spec r3.3 sec 4.7.
> +		 */
> +		return;
> +	}
> +
> +	ret = pci_npem_init(dev, ops, pos, cap);
> +	if (ret)
> +		pci_err(dev, "Failed to register PCIe Enclosure Management driver, err: %d\n",
> +			ret);
> +}
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index fd44565c4756..9dea8c7353ab 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -333,6 +333,14 @@ static inline void pci_doe_destroy(struct pci_dev *pdev) { }
>   static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
>   #endif
>   
> +#ifdef CONFIG_PCI_NPEM
> +void pci_npem_create(struct pci_dev *dev);
> +void pci_npem_remove(struct pci_dev *dev);
> +#else
> +static inline void pci_npem_create(struct pci_dev *dev) { }
> +static inline void pci_npem_remove(struct pci_dev *dev) { }
> +#endif
> +
>   /**
>    * pci_dev_set_io_state - Set the new error state if possible.
>    *
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 8e696e547565..b8ea6353e27a 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -2582,6 +2582,8 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
>   	dev->match_driver = false;
>   	ret = device_add(&dev->dev);
>   	WARN_ON(ret < 0);
> +
> +	pci_npem_create(dev);
>   }
>   
>   struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn)
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index d749ea8250d6..1436f9cf1fea 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -33,6 +33,8 @@ static void pci_destroy_dev(struct pci_dev *dev)
>   	if (!dev->dev.kobj.parent)
>   		return;
>   
> +	pci_npem_remove(dev);
> +
>   	device_del(&dev->dev);
>   
>   	down_write(&pci_bus_sem);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index fb004fd4e889..c327c2dd4527 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -517,6 +517,9 @@ struct pci_dev {
>   #endif
>   #ifdef CONFIG_PCI_DOE
>   	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
> +#endif
> +#ifdef CONFIG_PCI_NPEM
> +	struct npem	*npem;		/* Native PCIe Enclosure Management */
>   #endif
>   	u16		acs_cap;	/* ACS Capability offset */
>   	phys_addr_t	rom;		/* Physical address if not from BAR */
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 94c00996e633..97e8b7ed9998 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -740,6 +740,7 @@
>   #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>   #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>   #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> +#define PCI_EXT_CAP_ID_NPEM	0x29	/* Native PCIe Enclosure Management */
>   #define PCI_EXT_CAP_ID_PL_32GT  0x2A    /* Physical Layer 32.0 GT/s */
>   #define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
>   #define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
> @@ -1121,6 +1122,39 @@
>   #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>   #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>   
> +/* Native PCIe Enclosure Management */
> +#define PCI_NPEM_CAP	0x04 /* NPEM capability register */
> +#define	 PCI_NPEM_CAP_CAPABLE		0x00000001 /* NPEM Capable */
> +
> +#define PCI_NPEM_CTRL	0x08 /* NPEM control register */
> +#define	 PCI_NPEM_CTRL_ENABLE		0x00000001 /* NPEM Enable */
> +
> +#define PCI_NPEM_STATUS	0x0c /* NPEM status register */
> +#define	 PCI_NPEM_STATUS_CC		0x00000001 /* NPEM Command completed */
> +/*
> + * Native PCIe Enclosure Management indication bits and Reset command bit
> + * are corresponding for capability and control registers.
> + */
> +#define  PCI_NPEM_CMD_RESET		0x00000002 /* NPEM Reset Command */
> +#define  PCI_NPEM_IND_OK		0x00000004 /* NPEM indication OK */
> +#define  PCI_NPEM_IND_LOCATE		0x00000008 /* NPEM indication Locate */
> +#define  PCI_NPEM_IND_FAIL		0x00000010 /* NPEM indication Fail */
> +#define  PCI_NPEM_IND_REBUILD		0x00000020 /* NPEM indication Rebuild */
> +#define  PCI_NPEM_IND_PFA		0x00000040 /* NPEM indication Predicted Failure Analysis */
> +#define  PCI_NPEM_IND_HOTSPARE		0x00000080 /* NPEM indication Hot Spare */
> +#define  PCI_NPEM_IND_ICA		0x00000100 /* NPEM indication In Critical Array */
> +#define  PCI_NPEM_IND_IFA		0x00000200 /* NPEM indication In Failed Array */
> +#define  PCI_NPEM_IND_IDT		0x00000400 /* NPEM indication Invalid Device Type */
> +#define  PCI_NPEM_IND_DISABLED		0x00000800 /* NPEM indication Disabled */
> +#define  PCI_NPEM_IND_SPEC_0		0x00800000
> +#define  PCI_NPEM_IND_SPEC_1		0x01000000
> +#define  PCI_NPEM_IND_SPEC_2		0x02000000
> +#define  PCI_NPEM_IND_SPEC_3		0x04000000
> +#define  PCI_NPEM_IND_SPEC_4		0x08000000
> +#define  PCI_NPEM_IND_SPEC_5		0x10000000
> +#define  PCI_NPEM_IND_SPEC_6		0x20000000
> +#define  PCI_NPEM_IND_SPEC_7		0x40000000
> +
>   /* Data Object Exchange */
>   #define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
>   #define  PCI_DOE_CAP_INT_SUP			0x00000001  /* Interrupt Support */
Lukas Wunner June 15, 2024, 10:33 a.m. UTC | #8
On Fri, Jun 14, 2024 at 04:06:14PM -0500, stuart hayes wrote:
> On 5/28/2024 8:19 AM, Mariusz Tkaczyk wrote:
> > +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> > +			 int pos, u32 caps)
> > +{
[...]
> > +	ret = ops->get_active_indications(npem, &active);
> > +	if (ret) {
> > +		npem_free(npem);
> > +		return -EACCES;
> > +	}
> 
> Failing pci_npem_init() if this ops->get_active_indications() fails
> will keep this from working on most (all?) Dell servers, because the
> _DSM get/set functions use an IPMI operation region to get/set the
> active LEDs, and this is getting run before the IPMI drivers and
> acpi_ipmi module (which provides ACPI access to IPMI operation
> regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)

CONFIG_ACPI_IPMI is tristate.  Even if it's built-in, the
module_initcall() becomes a device_initcall().

PCI enumeration happens from a subsys_initcall(), way earlier
than device_initcall().

If you set CONFIG_ACPI_IPMI=y and change the module_initcall() in
drivers/acpi/acpi_ipmi.c to arch_initcall(), does the issue go away?


> (2) providing a mechanism to trigger this driver to rescan a PCI
>     device later from user space

If this was a regular device driver and -EPROBE_DEFER was returned if
IPMI drivers aren't loaded yet, then this would be easy to solve.
But neither is the case here.

Of course it's possible to exercise the "remove" and "rescan" attributes
in sysfs to re-enumerate the device but that's not a great solution.


> (3) don't cache the active LEDs--just get the active states using
>     NPEM/DSM when brightness is read, and do a get/modify/set when
>     setting the brightness... then get_active_indications() wouldn't
>     need to be called during init.

Not good.  The LEDs are published in sysfs from a subsys_initcall().
Brightness changes through sysfs could in theory immediately happen
once they're published.  If acpi_ipmi is a module and gets loaded way
later, we'd still have to cope with Set State or Get State DSMs going
nowhere.

Thanks,

Lukas
Mariusz Tkaczyk June 18, 2024, 8:56 a.m. UTC | #9
On Sat, 15 Jun 2024 12:33:45 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Fri, Jun 14, 2024 at 04:06:14PM -0500, stuart hayes wrote:
> > On 5/28/2024 8:19 AM, Mariusz Tkaczyk wrote:  
> > > +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> > > +			 int pos, u32 caps)
> > > +{  
> [...]
> > > +	ret = ops->get_active_indications(npem, &active);
> > > +	if (ret) {
> > > +		npem_free(npem);
> > > +		return -EACCES;
> > > +	}  
> > 
> > Failing pci_npem_init() if this ops->get_active_indications() fails
> > will keep this from working on most (all?) Dell servers, because the
> > _DSM get/set functions use an IPMI operation region to get/set the
> > active LEDs, and this is getting run before the IPMI drivers and
> > acpi_ipmi module (which provides ACPI access to IPMI operation
> > regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)  
> 
> CONFIG_ACPI_IPMI is tristate.  Even if it's built-in, the
> module_initcall() becomes a device_initcall().
> 
> PCI enumeration happens from a subsys_initcall(), way earlier
> than device_initcall().
> 
> If you set CONFIG_ACPI_IPMI=y and change the module_initcall() in
> drivers/acpi/acpi_ipmi.c to arch_initcall(), does the issue go away?

That seems to be the best option. Please test Lukas proposal and let me know.
Shouldn't I make a dependency to ACPI_IPMI in Kconfig (with optional comment
about initcall)?

+config PCI_NPEM
+	bool "Native PCIe Enclosure Management"
+	depends on LEDS_CLASS=y
+	depends on ACPI_IPMI=y

> 
> 
> > (2) providing a mechanism to trigger this driver to rescan a PCI
> >     device later from user space  
> 
> If this was a regular device driver and -EPROBE_DEFER was returned if
> IPMI drivers aren't loaded yet, then this would be easy to solve.
> But neither is the case here.
> 
> Of course it's possible to exercise the "remove" and "rescan" attributes
> in sysfs to re-enumerate the device but that's not a great solution.

We cannot expect from users to know and do that. If we cannot configure driver,
we should not register it. We have to guarantee that IMPI commands are
available at the point we are using them.

There is not better place to add _DSM device than its enumeration and I have a
feeling than sooner or later someone else will reach this problem so it would
be better for community to solve it now.

> 
> 
> > (3) don't cache the active LEDs--just get the active states using
> >     NPEM/DSM when brightness is read, and do a get/modify/set when
> >     setting the brightness... then get_active_indications() wouldn't
> >     need to be called during init.  
> 
> Not good.  The LEDs are published in sysfs from a subsys_initcall().
> Brightness changes through sysfs could in theory immediately happen
> once they're published.  If acpi_ipmi is a module and gets loaded way
> later, we'd still have to cope with Set State or Get State DSMs going
> nowhere.
> 

Agree. I can do it and it should be safe but it is not addressing the issue.
We would limit time race window but we will not close it.

Thanks,
Mariusz
Lukas Wunner June 18, 2024, 5:13 p.m. UTC | #10
On Tue, Jun 18, 2024 at 10:56:53AM +0200, Mariusz Tkaczyk wrote:
> On Sat, 15 Jun 2024 12:33:45 +0200 Lukas Wunner <lukas@wunner.de> wrote:
>> > On Fri, Jun 14, 2024 at 04:06:14PM -0500, stuart hayes wrote:
> > > Failing pci_npem_init() if this ops->get_active_indications() fails
> > > will keep this from working on most (all?) Dell servers, because the
> > > _DSM get/set functions use an IPMI operation region to get/set the
> > > active LEDs, and this is getting run before the IPMI drivers and
> > > acpi_ipmi module (which provides ACPI access to IPMI operation
> > > regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)  
> > 
> > CONFIG_ACPI_IPMI is tristate.  Even if it's built-in, the
> > module_initcall() becomes a device_initcall().
> > 
> > PCI enumeration happens from a subsys_initcall(), way earlier
> > than device_initcall().
> > 
> > If you set CONFIG_ACPI_IPMI=y and change the module_initcall() in
> > drivers/acpi/acpi_ipmi.c to arch_initcall(), does the issue go away?
> 
> That seems to be the best option. Please test Lukas proposal and let me know.
> Shouldn't I make a dependency to ACPI_IPMI in Kconfig (with optional comment
> about initcall)?
> 
> +config PCI_NPEM
> +	bool "Native PCIe Enclosure Management"
> +	depends on LEDS_CLASS=y
> +	depends on ACPI_IPMI=y

This would effectively disallow NPEM on non-ACPI systems.

I think what you want instead is to allow either ACPI_IPMI=y or
ACPI_IPMI=n, but not ACPI_IPMI=m, so:

	depends on ACPI_IPMI!=m

Thanks,

Lukas
stuart hayes June 18, 2024, 6:50 p.m. UTC | #11
On 6/18/2024 3:56 AM, Mariusz Tkaczyk wrote:
> On Sat, 15 Jun 2024 12:33:45 +0200
> Lukas Wunner <lukas@wunner.de> wrote:
> 
>> On Fri, Jun 14, 2024 at 04:06:14PM -0500, stuart hayes wrote:
>>> On 5/28/2024 8:19 AM, Mariusz Tkaczyk wrote:
>>>> +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
>>>> +			 int pos, u32 caps)
>>>> +{
>> [...]
>>>> +	ret = ops->get_active_indications(npem, &active);
>>>> +	if (ret) {
>>>> +		npem_free(npem);
>>>> +		return -EACCES;
>>>> +	}
>>>
>>> Failing pci_npem_init() if this ops->get_active_indications() fails
>>> will keep this from working on most (all?) Dell servers, because the
>>> _DSM get/set functions use an IPMI operation region to get/set the
>>> active LEDs, and this is getting run before the IPMI drivers and
>>> acpi_ipmi module (which provides ACPI access to IPMI operation
>>> regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)
>>
>> CONFIG_ACPI_IPMI is tristate.  Even if it's built-in, the
>> module_initcall() becomes a device_initcall().
>>
>> PCI enumeration happens from a subsys_initcall(), way earlier
>> than device_initcall().
>>
>> If you set CONFIG_ACPI_IPMI=y and change the module_initcall() in
>> drivers/acpi/acpi_ipmi.c to arch_initcall(), does the issue go away?
> 
> That seems to be the best option. Please test Lukas proposal and let me know.
> Shouldn't I make a dependency to ACPI_IPMI in Kconfig (with optional comment
> about initcall)?
> 
> +config PCI_NPEM
> +	bool "Native PCIe Enclosure Management"
> +	depends on LEDS_CLASS=y
> +	depends on ACPI_IPMI=y
> 

I tried it just to be sure, but changing the module_initcall() to an
arch_initcall() in acpi_ipmi.c (and compiling it into the kernel) doesn't
help... acpi_ipmi is loaded before npem, but the underlying IPMI drivers
that acpi_ipmi uses to provide the IPMI operation region in ACPI aren't
loaded until later... acpi_ipmi needs the IPMI device.

I'll try to think of another solution.  I don't see how to get the IPMI
drivers to load before PCI devices are enumerated, so it seems to me that
the only way to get it to work from the moment the LEDs show up in sysfs
is to somehow delay the npem driver initialization of the LEDs (at least
for _DSM) or use something to trigger a rescan later.

I notice that acpi_ipmi provides a function to wait for it to have an
IPMI device registered (acpi_wait_for_acpi_ipmi()), which is used by
acpi_power_meter.c.  It would be kind of awkward to use that... it just
just waits for a completion (with a 2 second timeout)--it isn't a notifier
or callback.  On my system, the npem driver failed to run a _DSM at
0.72 seconds, and ipmi_si driver didn't initialize the IPMI controller
until 9.54 seconds.

>>
>>> (2) providing a mechanism to trigger this driver to rescan a PCI
>>>      device later from user space
>>
>> If this was a regular device driver and -EPROBE_DEFER was returned if
>> IPMI drivers aren't loaded yet, then this would be easy to solve.
>> But neither is the case here.
>>
>> Of course it's possible to exercise the "remove" and "rescan" attributes
>> in sysfs to re-enumerate the device but that's not a great solution.
> 
> We cannot expect from users to know and do that. If we cannot configure driver,
> we should not register it. We have to guarantee that IMPI commands are
> available at the point we are using them.
> 
> There is not better place to add _DSM device than its enumeration and I have a
> feeling than sooner or later someone else will reach this problem so it would
> be better for community to solve it now.
> 
>>
>>
>>> (3) don't cache the active LEDs--just get the active states using
>>>      NPEM/DSM when brightness is read, and do a get/modify/set when
>>>      setting the brightness... then get_active_indications() wouldn't
>>>      need to be called during init.
>>
>> Not good.  The LEDs are published in sysfs from a subsys_initcall().
>> Brightness changes through sysfs could in theory immediately happen
>> once they're published.  If acpi_ipmi is a module and gets loaded way
>> later, we'd still have to cope with Set State or Get State DSMs going
>> nowhere.
>>
> 
> Agree. I can do it and it should be safe but it is not addressing the issue.
> We would limit time race window but we will not close it.
> 
> Thanks,
> Mariusz
Dan Williams June 18, 2024, 7:32 p.m. UTC | #12
stuart hayes wrote:
> 
> 
> On 6/18/2024 3:56 AM, Mariusz Tkaczyk wrote:
> > On Sat, 15 Jun 2024 12:33:45 +0200
> > Lukas Wunner <lukas@wunner.de> wrote:
> > 
> >> On Fri, Jun 14, 2024 at 04:06:14PM -0500, stuart hayes wrote:
> >>> On 5/28/2024 8:19 AM, Mariusz Tkaczyk wrote:
> >>>> +static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
> >>>> +			 int pos, u32 caps)
> >>>> +{
> >> [...]
> >>>> +	ret = ops->get_active_indications(npem, &active);
> >>>> +	if (ret) {
> >>>> +		npem_free(npem);
> >>>> +		return -EACCES;
> >>>> +	}
> >>>
> >>> Failing pci_npem_init() if this ops->get_active_indications() fails
> >>> will keep this from working on most (all?) Dell servers, because the
> >>> _DSM get/set functions use an IPMI operation region to get/set the
> >>> active LEDs, and this is getting run before the IPMI drivers and
> >>> acpi_ipmi module (which provides ACPI access to IPMI operation
> >>> regions) get loaded.  (GET_SUPPORTED_STATES works without IPMI.)
> >>
> >> CONFIG_ACPI_IPMI is tristate.  Even if it's built-in, the
> >> module_initcall() becomes a device_initcall().
> >>
> >> PCI enumeration happens from a subsys_initcall(), way earlier
> >> than device_initcall().
> >>
> >> If you set CONFIG_ACPI_IPMI=y and change the module_initcall() in
> >> drivers/acpi/acpi_ipmi.c to arch_initcall(), does the issue go away?
> > 
> > That seems to be the best option. Please test Lukas proposal and let me know.
> > Shouldn't I make a dependency to ACPI_IPMI in Kconfig (with optional comment
> > about initcall)?
> > 
> > +config PCI_NPEM
> > +	bool "Native PCIe Enclosure Management"
> > +	depends on LEDS_CLASS=y
> > +	depends on ACPI_IPMI=y
> > 
> 
> I tried it just to be sure, but changing the module_initcall() to an
> arch_initcall() in acpi_ipmi.c (and compiling it into the kernel) doesn't
> help... acpi_ipmi is loaded before npem, but the underlying IPMI drivers
> that acpi_ipmi uses to provide the IPMI operation region in ACPI aren't
> loaded until later... acpi_ipmi needs the IPMI device.
> 
> I'll try to think of another solution.  I don't see how to get the IPMI
> drivers to load before PCI devices are enumerated, so it seems to me that
> the only way to get it to work from the moment the LEDs show up in sysfs
> is to somehow delay the npem driver initialization of the LEDs (at least
> for _DSM) or use something to trigger a rescan later.
> 
> I notice that acpi_ipmi provides a function to wait for it to have an
> IPMI device registered (acpi_wait_for_acpi_ipmi()), which is used by
> acpi_power_meter.c.  It would be kind of awkward to use that... it just
> just waits for a completion (with a 2 second timeout)--it isn't a notifier
> or callback.  On my system, the npem driver failed to run a _DSM at
> 0.72 seconds, and ipmi_si driver didn't initialize the IPMI controller
> until 9.54 seconds.

It strikes me that playing these initcall games is a losing battle and
that this case would be best served by late loading of NPEM
functionality.

Something similar is happening with PCI device security where the
enabling depends on a third-party driver for a platform
"security-manager" (TSM) to arrive.

The approach there is to make the functionality independent of
device-discovery vs TSM driver load order. So, if the TSM driver is
loaded early then pci_init_capabilities() can immediately enable the
functionality. If the TSM driver is loaded *after* some devices have already
gone through pci_init_capabilities(), then that event is responsible for
doing for_each_pci_dev() to catch up on devices that missed their
initial chance to turn on device security details.

So, for NPEM, the thought would be to implement the same rendezvous
flow, i.e. s/TSM/NPEM/.

I am an overdue for a refresh of the TSM patches, but you can see the
last posting here:

http://lore.kernel.org/171291190324.3532867.13480405752065082171.stgit@dwillia2-xfh.jf.intel.com
Lukas Wunner June 19, 2024, 9:08 a.m. UTC | #13
On Tue, Jun 18, 2024 at 12:32:33PM -0700, Dan Williams wrote:
> It strikes me that playing these initcall games is a losing battle and
> that this case would be best served by late loading of NPEM
> functionality.
> 
> Something similar is happening with PCI device security where the
> enabling depends on a third-party driver for a platform
> "security-manager" (TSM) to arrive.
> 
> The approach there is to make the functionality independent of
> device-discovery vs TSM driver load order. So, if the TSM driver is
> loaded early then pci_init_capabilities() can immediately enable the
> functionality. If the TSM driver is loaded *after* some devices have already
> gone through pci_init_capabilities(), then that event is responsible for
> doing for_each_pci_dev() to catch up on devices that missed their
> initial chance to turn on device security details.
> 
> So, for NPEM, the thought would be to implement the same rendezvous
> flow, i.e. s/TSM/NPEM/.

A different viewpoint is that these issues are caused by the
"division of labor" between OS kernel and platform firmware.

In the NPEM case, Dell servers require the OS to call firmware
to change LEDs.  But before OS can do that, OS has to initialize
a certain other interface with firmware.

In the TSM case, Intel TDX Connect or AMD SEV-TIO require OS to
ask firmware to perform certain authentication steps with devices,
wherefore OS has to provide another interface to facilitate
communication with the device.

It's a complexity nightmare exacerbated by vendor-specific quirks.

Which is why I'm arguing that firmware functionality (e.g. TDX module)
should be constrained to the absolute minimum and the OS should be
in control of as much as possible.  That's the approach Apple has
been following as it's the only way to achieve their close interplay
between hardware and software without making things too complex.

It seems what's keeping this series from working on Dell servers is
primarily that the driver wants to read out LED status on probe.
So I've recommended to Mariusz off-list to do that lazily if possible,
i.e. on first read of a LED's status.

Then if users do try to read or write LED status on Dell servers without
loading IPMI modules first, they get to keep the pieces, sorry. :(


> I am an overdue for a refresh of the TSM patches

No hurry, there's a refresh of the OS-owned PCI device authentication
coming up before end of this month.

I'm taking my "TDX Connect heretic" hat off now. :)

Thanks,

Lukas
Mariusz Tkaczyk June 19, 2024, 12:07 p.m. UTC | #14
On Wed, 19 Jun 2024 11:08:26 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, Jun 18, 2024 at 12:32:33PM -0700, Dan Williams wrote:
> > It strikes me that playing these initcall games is a losing battle and
> > that this case would be best served by late loading of NPEM
> > functionality.
> > 
> > Something similar is happening with PCI device security where the
> > enabling depends on a third-party driver for a platform
> > "security-manager" (TSM) to arrive.
> > 
> > The approach there is to make the functionality independent of
> > device-discovery vs TSM driver load order. So, if the TSM driver is
> > loaded early then pci_init_capabilities() can immediately enable the
> > functionality. If the TSM driver is loaded *after* some devices have already
> > gone through pci_init_capabilities(), then that event is responsible for
> > doing for_each_pci_dev() to catch up on devices that missed their
> > initial chance to turn on device security details.
> > 
> > So, for NPEM, the thought would be to implement the same rendezvous
> > flow, i.e. s/TSM/NPEM/.  
> 
> A different viewpoint is that these issues are caused by the
> "division of labor" between OS kernel and platform firmware.
> 
> In the NPEM case, Dell servers require the OS to call firmware
> to change LEDs.  But before OS can do that, OS has to initialize
> a certain other interface with firmware.
> 
> In the TSM case, Intel TDX Connect or AMD SEV-TIO require OS to
> ask firmware to perform certain authentication steps with devices,
> wherefore OS has to provide another interface to facilitate
> communication with the device.
> 
> It's a complexity nightmare exacerbated by vendor-specific quirks.
> 
> Which is why I'm arguing that firmware functionality (e.g. TDX module)
> should be constrained to the absolute minimum and the OS should be
> in control of as much as possible.  That's the approach Apple has
> been following as it's the only way to achieve their close interplay
> between hardware and software without making things too complex.
> 
> It seems what's keeping this series from working on Dell servers is
> primarily that the driver wants to read out LED status on probe.
> So I've recommended to Mariusz off-list to do that lazily if possible,
> i.e. on first read of a LED's status.
> 
> Then if users do try to read or write LED status on Dell servers without
> loading IPMI modules first, they get to keep the pieces, sorry. :(

> 
Initially, I thought that Dan suggestion is the best option but after taking
into account use cases of the driver and times provided by Stuart - lazy
loading wins.

As a led application maintainer, I can accept fact that I cannot impose led for
a while and errors will be reported, that is fine. I can left a hint why it is
happening to user.

I would be a nightmare to get new LED controller after some time if LED
interface appearance is delayed. It is much worse from user perspective because
no device means that I have no information in userland. I cannot determine if
something is going to be up soon so I will report disks as not supported -
unnecessary maintenance hell. I may receive a lot of issues.

Stuart, please give me some time to apply suggestions and introduce lazy
approach. I'm working on it!

Thanks,
Mariusz
Mariusz Tkaczyk June 19, 2024, 2:28 p.m. UTC | #15
On Tue, 28 May 2024 22:21:10 -0700
Dan Williams <dan.j.williams@intel.com> wrote:

> > +
> > +	/*
> > +	 * For the case where a NPEM command has not completed immediately,
> > +	 * it is recommended that software not continuously “spin” on
> > polling
> > +	 * the status register, but rather poll under interrupt at a
> > reduced
> > +	 * rate; for example at 10 ms intervals.  
> 
> I think the use of read_poll_timeout() obviates the need for this
> comment. I.e. read_poll_timeout() is self documenting.
> 

This is not describing how read_poll_timeout works but why I need it at all.
I don't think that there is and ever will be a controller that needs a time to
complete the NPEM request. From my experience this is immediate action. This is
explanation why I have to keep it this way even if it has almost no sense :)

Mariusz

> > +	 *
> > +	 * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of
> > NPEM
> > +	 * Command Completed"
> > +	 */
> > +	ret = read_poll_timeout(npem_read_reg, ret_val,
> > +				ret_val || (cc_status &
> > PCI_NPEM_STATUS_CC),
> > +				10 * USEC_PER_MSEC, USEC_PER_SEC, false,
> > npem,
> > +				PCI_NPEM_STATUS, &cc_status);
> > +	if (ret)
> > +		return ret_val;
diff mbox series

Patch

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index d35001589d88..e696e69ad516 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -143,6 +143,15 @@  config PCI_IOV
 
 	  If unsure, say N.
 
+config PCI_NPEM
+	bool "Native PCIe Enclosure Management"
+	depends on LEDS_CLASS=y
+	help
+	  Support for Native PCIe Enclosure Management. It allows managing LED
+	  indications in storage enclosures. Enclosure must support following
+	  indications: OK, Locate, Fail, Rebuild, other indications are
+	  optional.
+
 config PCI_PRI
 	bool "PCI PRI support"
 	select PCI_ATS
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 175302036890..cd5f655d4be9 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -34,6 +34,7 @@  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
 obj-$(CONFIG_PCI_DOE)		+= doe.o
 obj-$(CONFIG_PCI_DYNAMIC_OF_NODES) += of_property.o
+obj-$(CONFIG_PCI_NPEM)		+= npem.o
 
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
diff --git a/drivers/pci/npem.c b/drivers/pci/npem.c
new file mode 100644
index 000000000000..a76a2044dab2
--- /dev/null
+++ b/drivers/pci/npem.c
@@ -0,0 +1,410 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCIe Enclosure management driver created for LED interfaces based on
+ * indications. It says *what indications* blink but does not specify *how*
+ * they blink - it is hardware defined.
+ *
+ * The driver name refers to Native PCIe Enclosure Management. It is
+ * first indication oriented standard with specification.
+ *
+ * Native PCIe Enclosure Management (NPEM)
+ *	PCIe Base Specification r6.1 sec 6.28
+ *	PCIe Base Specification r6.1 sec 7.9.19
+ *
+ * Copyright (c) 2023-2024 Intel Corporation
+ *	Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/iopoll.h>
+#include <linux/leds.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci_regs.h>
+#include <linux/types.h>
+#include <linux/uleds.h>
+
+#include "pci.h"
+
+struct indication {
+	u32 bit;
+	const char *name;
+};
+
+static const struct indication npem_indications[] = {
+	{PCI_NPEM_IND_OK,	"enclosure:ok"},
+	{PCI_NPEM_IND_LOCATE,	"enclosure:locate"},
+	{PCI_NPEM_IND_FAIL,	"enclosure:fail"},
+	{PCI_NPEM_IND_REBUILD,	"enclosure:rebuild"},
+	{PCI_NPEM_IND_PFA,	"enclosure:pfa"},
+	{PCI_NPEM_IND_HOTSPARE,	"enclosure:hotspare"},
+	{PCI_NPEM_IND_ICA,	"enclosure:ica"},
+	{PCI_NPEM_IND_IFA,	"enclosure:ifa"},
+	{PCI_NPEM_IND_IDT,	"enclosure:idt"},
+	{PCI_NPEM_IND_DISABLED,	"enclosure:disabled"},
+	{PCI_NPEM_IND_SPEC_0,	"enclosure:specific_0"},
+	{PCI_NPEM_IND_SPEC_1,	"enclosure:specific_1"},
+	{PCI_NPEM_IND_SPEC_2,	"enclosure:specific_2"},
+	{PCI_NPEM_IND_SPEC_3,	"enclosure:specific_3"},
+	{PCI_NPEM_IND_SPEC_4,	"enclosure:specific_4"},
+	{PCI_NPEM_IND_SPEC_5,	"enclosure:specific_5"},
+	{PCI_NPEM_IND_SPEC_6,	"enclosure:specific_6"},
+	{PCI_NPEM_IND_SPEC_7,	"enclosure:specific_7"},
+	{0,			NULL}
+};
+
+#define for_each_indication(ind, inds) \
+	for (ind = inds; ind->bit; ind++)
+
+/* To avoid confusion, do not keep any special bits in indications */
+static u32 reg_to_indications(u32 caps, const struct indication *inds)
+{
+	const struct indication *ind;
+	u32 supported_indications = 0;
+
+	for_each_indication(ind, inds)
+		supported_indications |= ind->bit;
+
+	return caps & supported_indications;
+}
+
+/**
+ * struct npem_led - LED details
+ * @indication: indication details
+ * @npem: npem device
+ * @name: LED name
+ * @led: LED device
+ */
+struct npem_led {
+	const struct indication *indication;
+	struct npem *npem;
+	char name[LED_MAX_NAME_SIZE];
+	struct led_classdev led;
+};
+
+/**
+ * struct npem_ops - backend specific callbacks
+ * @inds: supported indications array
+ * @get_active_indications: get active indications
+ *	npem: npem device
+ *	buf: response buffer
+ * @set_active_indications: set new indications
+ *	npem: npem device
+ *	val: bit mask to set
+ *
+ * Handle communication with hardware. set_active_indications updates
+ * npem->active_indications.
+ */
+struct npem_ops {
+	const struct indication *inds;
+	int (*get_active_indications)(struct npem *npem, u32 *buf);
+	int (*set_active_indications)(struct npem *npem, u32 val);
+};
+
+/**
+ * struct npem - NPEM device properties
+ * @dev: PCIe device this driver is attached to
+ * @ops: Backend specific callbacks
+ * @npem_lock: to keep concurrent updates serialized
+ * @pos: NPEM capability offset (only relevant for NPEM direct register access,
+ *	 not DSM access method)
+ * @supported_indications: bit mask of supported indications
+ *			   non-indication and reserved bits are cleared
+ * @active_indications: bit mask of active indications
+ *			non-indication and reserved bits are cleared
+ * @led_cnt: Supported LEDs count
+ * @leds: supported LEDs
+ */
+struct npem {
+	struct pci_dev *dev;
+	const struct npem_ops *ops;
+	struct mutex npem_lock;
+	u16 pos;
+	u32 supported_indications;
+	u32 active_indications;
+	int led_cnt;
+	struct npem_led leds[];
+};
+
+static int npem_read_reg(struct npem *npem, u16 reg, u32 *val)
+{
+	int ret = pci_read_config_dword(npem->dev, npem->pos + reg, val);
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int npem_write_ctrl(struct npem *npem, u32 reg)
+{
+	int pos = npem->pos + PCI_NPEM_CTRL;
+	int ret = pci_write_config_dword(npem->dev, pos, reg);
+
+	return pcibios_err_to_errno(ret);
+}
+
+static int npem_get_active_indications(struct npem *npem, u32 *buf)
+{
+	int ret;
+
+	ret = npem_read_reg(npem, PCI_NPEM_CTRL, buf);
+	if (ret)
+		return ret;
+
+	/* If PCI_NPEM_CTRL_ENABLE is not set then no indication should blink */
+	if (!(*buf & PCI_NPEM_CTRL_ENABLE))
+		*buf = 0;
+
+	/* Filter out not supported indications in response */
+	*buf &= npem->supported_indications;
+	return 0;
+}
+
+static int npem_set_active_indications(struct npem *npem, u32 val)
+{
+	int ret, ret_val;
+	u32 cc_status;
+
+	lockdep_assert_held(&npem->npem_lock);
+
+	/* This bit is always required */
+	val |= PCI_NPEM_CTRL_ENABLE;
+	ret = npem_write_ctrl(npem, val);
+	if (ret)
+		return ret;
+
+	/*
+	 * For the case where a NPEM command has not completed immediately,
+	 * it is recommended that software not continuously “spin” on polling
+	 * the status register, but rather poll under interrupt at a reduced
+	 * rate; for example at 10 ms intervals.
+	 *
+	 * PCIe r6.1 sec 6.28 "Implementation Note: Software Polling of NPEM
+	 * Command Completed"
+	 */
+	ret = read_poll_timeout(npem_read_reg, ret_val,
+				ret_val || (cc_status & PCI_NPEM_STATUS_CC),
+				10 * USEC_PER_MSEC, USEC_PER_SEC, false, npem,
+				PCI_NPEM_STATUS, &cc_status);
+	if (ret)
+		return ret_val;
+
+	/*
+	 * All writes to control register, including writes that do not change
+	 * the register value, are NPEM commands and should eventually result
+	 * in a command completion indication in the NPEM Status Register.
+	 *
+	 * PCIe Base Specification r6.1 sec 7.9.19.3
+	 *
+	 * Register may not be updated, or other conflicting bits may be
+	 * cleared. Spec is not strict here. Read NPEM Control register after
+	 * write to keep cache in-sync.
+	 */
+	return npem_get_active_indications(npem, &npem->active_indications);
+}
+
+static const struct npem_ops npem_ops = {
+	.inds = npem_indications,
+	.get_active_indications = npem_get_active_indications,
+	.set_active_indications = npem_set_active_indications,
+};
+
+#define DSM_GUID GUID_INIT(0x5d524d9d, 0xfff9, 0x4d4b,  0x8c, 0xb7, 0x74, 0x7e,\
+			   0xd5, 0x1e, 0x19, 0x4d)
+#define GET_SUPPORTED_STATES_DSM	BIT(1)
+#define GET_STATE_DSM			BIT(2)
+#define SET_STATE_DSM			BIT(3)
+
+static const guid_t dsm_guid = DSM_GUID;
+
+static bool npem_has_dsm(struct pci_dev *pdev)
+{
+	acpi_handle handle;
+
+	handle = ACPI_HANDLE(&pdev->dev);
+	if (!handle)
+		return false;
+
+	return acpi_check_dsm(handle, &dsm_guid, 0x1, GET_SUPPORTED_STATES_DSM |
+			      GET_STATE_DSM | SET_STATE_DSM);
+}
+
+/*
+ * This function does not use ops->get_active_indications().
+ * It returns cached value, npem->npem_lock is held and it is safe.
+ */
+static enum led_brightness brightness_get(struct led_classdev *led)
+{
+	struct npem_led *nled = container_of(led, struct npem_led, led);
+	struct npem *npem = nled->npem;
+	int ret, val = LED_OFF;
+
+	ret = mutex_lock_interruptible(&npem->npem_lock);
+	if (ret)
+		return ret;
+
+	if (npem->active_indications & nled->indication->bit)
+		val = LED_ON;
+
+	mutex_unlock(&npem->npem_lock);
+
+	return val;
+}
+
+static int brightness_set(struct led_classdev *led,
+			  enum led_brightness brightness)
+{
+	struct npem_led *nled = container_of(led, struct npem_led, led);
+	struct npem *npem = nled->npem;
+	u32 indications;
+	int ret;
+
+	ret = mutex_lock_interruptible(&npem->npem_lock);
+	if (ret)
+		return ret;
+
+	if (brightness == LED_OFF)
+		indications = npem->active_indications & ~(nled->indication->bit);
+	else
+		indications = npem->active_indications | nled->indication->bit;
+
+	ret = npem->ops->set_active_indications(npem, indications);
+
+	mutex_unlock(&npem->npem_lock);
+	return ret;
+}
+
+static void npem_free(struct npem *npem)
+{
+	struct npem_led *nled;
+	int cnt;
+
+	for (cnt = 0; cnt < npem->led_cnt; cnt++) {
+		nled = &npem->leds[cnt];
+
+		if (nled->name[0])
+			led_classdev_unregister(&nled->led);
+	}
+
+	mutex_destroy(&npem->npem_lock);
+	kfree(npem);
+}
+
+static int pci_npem_set_led_classdev(struct npem *npem, struct npem_led *nled)
+{
+	struct led_classdev *led = &nled->led;
+	struct led_init_data init_data = {};
+	char *name = nled->name;
+	int ret;
+
+	init_data.devicename = pci_name(npem->dev);
+	init_data.default_label = nled->indication->name;
+
+	ret = led_compose_name(&npem->dev->dev, &init_data, name);
+	if (ret)
+		return ret;
+
+	led->name = name;
+	led->brightness_set_blocking = brightness_set;
+	led->brightness_get = brightness_get;
+	led->max_brightness = LED_ON;
+	led->default_trigger = "none";
+	led->flags = 0;
+
+	ret = led_classdev_register(&npem->dev->dev, led);
+	if (ret)
+		/* Clear the name to indicate that it is not registered. */
+		name[0] = 0;
+	return ret;
+}
+
+static int pci_npem_init(struct pci_dev *dev, const struct npem_ops *ops,
+			 int pos, u32 caps)
+{
+	u32 supported = reg_to_indications(caps, ops->inds);
+	int supported_cnt = hweight32(supported);
+	const struct indication *indication;
+	struct npem_led *nled;
+	struct npem *npem;
+	int led_idx = 0;
+	u32 active;
+	int ret;
+
+	npem = kzalloc(struct_size(npem, leds, supported_cnt), GFP_KERNEL);
+
+	if (!npem)
+		return -ENOMEM;
+
+	npem->supported_indications = supported;
+	npem->led_cnt = supported_cnt;
+	npem->pos = pos;
+	npem->dev = dev;
+	npem->ops = ops;
+
+	ret = ops->get_active_indications(npem, &active);
+	if (ret) {
+		npem_free(npem);
+		return -EACCES;
+	}
+
+	npem->active_indications = reg_to_indications(active, ops->inds);
+
+	/*
+	 * Do not take npem->npem_lock, get_brightness() is called on
+	 * registration path.
+	 */
+	mutex_init(&npem->npem_lock);
+
+	for_each_indication(indication, npem_indications) {
+		if ((npem->supported_indications & indication->bit) == 0)
+			/* Do not register unsupported indication */
+			continue;
+
+		nled = &npem->leds[led_idx++];
+		nled->indication = indication;
+		nled->npem = npem;
+
+		ret = pci_npem_set_led_classdev(npem, nled);
+		if (ret) {
+			npem_free(npem);
+			return ret;
+		}
+	}
+
+	dev->npem = npem;
+	return 0;
+}
+
+void pci_npem_remove(struct pci_dev *dev)
+{
+	if (dev->npem)
+		npem_free(dev->npem);
+}
+
+void pci_npem_create(struct pci_dev *dev)
+{
+	const struct npem_ops *ops = &npem_ops;
+	int pos = 0, ret;
+	u32 cap;
+
+	if (!npem_has_dsm(dev)) {
+		pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_NPEM);
+		if (pos == 0)
+			return;
+
+		if (pci_read_config_dword(dev, pos + PCI_NPEM_CAP, &cap) != 0 ||
+		    (cap & PCI_NPEM_CAP_CAPABLE) == 0)
+			return;
+	} else {
+		/*
+		 * OS should use the DSM for LED control if it is available
+		 * PCI Firmware Spec r3.3 sec 4.7.
+		 */
+		return;
+	}
+
+	ret = pci_npem_init(dev, ops, pos, cap);
+	if (ret)
+		pci_err(dev, "Failed to register PCIe Enclosure Management driver, err: %d\n",
+			ret);
+}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fd44565c4756..9dea8c7353ab 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -333,6 +333,14 @@  static inline void pci_doe_destroy(struct pci_dev *pdev) { }
 static inline void pci_doe_disconnected(struct pci_dev *pdev) { }
 #endif
 
+#ifdef CONFIG_PCI_NPEM
+void pci_npem_create(struct pci_dev *dev);
+void pci_npem_remove(struct pci_dev *dev);
+#else
+static inline void pci_npem_create(struct pci_dev *dev) { }
+static inline void pci_npem_remove(struct pci_dev *dev) { }
+#endif
+
 /**
  * pci_dev_set_io_state - Set the new error state if possible.
  *
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 8e696e547565..b8ea6353e27a 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2582,6 +2582,8 @@  void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->match_driver = false;
 	ret = device_add(&dev->dev);
 	WARN_ON(ret < 0);
+
+	pci_npem_create(dev);
 }
 
 struct pci_dev *pci_scan_single_device(struct pci_bus *bus, int devfn)
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index d749ea8250d6..1436f9cf1fea 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -33,6 +33,8 @@  static void pci_destroy_dev(struct pci_dev *dev)
 	if (!dev->dev.kobj.parent)
 		return;
 
+	pci_npem_remove(dev);
+
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index fb004fd4e889..c327c2dd4527 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -517,6 +517,9 @@  struct pci_dev {
 #endif
 #ifdef CONFIG_PCI_DOE
 	struct xarray	doe_mbs;	/* Data Object Exchange mailboxes */
+#endif
+#ifdef CONFIG_PCI_NPEM
+	struct npem	*npem;		/* Native PCIe Enclosure Management */
 #endif
 	u16		acs_cap;	/* ACS Capability offset */
 	phys_addr_t	rom;		/* Physical address if not from BAR */
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 94c00996e633..97e8b7ed9998 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -740,6 +740,7 @@ 
 #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
 #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
 #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
+#define PCI_EXT_CAP_ID_NPEM	0x29	/* Native PCIe Enclosure Management */
 #define PCI_EXT_CAP_ID_PL_32GT  0x2A    /* Physical Layer 32.0 GT/s */
 #define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
 #define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
@@ -1121,6 +1122,39 @@ 
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
 
+/* Native PCIe Enclosure Management */
+#define PCI_NPEM_CAP	0x04 /* NPEM capability register */
+#define	 PCI_NPEM_CAP_CAPABLE		0x00000001 /* NPEM Capable */
+
+#define PCI_NPEM_CTRL	0x08 /* NPEM control register */
+#define	 PCI_NPEM_CTRL_ENABLE		0x00000001 /* NPEM Enable */
+
+#define PCI_NPEM_STATUS	0x0c /* NPEM status register */
+#define	 PCI_NPEM_STATUS_CC		0x00000001 /* NPEM Command completed */
+/*
+ * Native PCIe Enclosure Management indication bits and Reset command bit
+ * are corresponding for capability and control registers.
+ */
+#define  PCI_NPEM_CMD_RESET		0x00000002 /* NPEM Reset Command */
+#define  PCI_NPEM_IND_OK		0x00000004 /* NPEM indication OK */
+#define  PCI_NPEM_IND_LOCATE		0x00000008 /* NPEM indication Locate */
+#define  PCI_NPEM_IND_FAIL		0x00000010 /* NPEM indication Fail */
+#define  PCI_NPEM_IND_REBUILD		0x00000020 /* NPEM indication Rebuild */
+#define  PCI_NPEM_IND_PFA		0x00000040 /* NPEM indication Predicted Failure Analysis */
+#define  PCI_NPEM_IND_HOTSPARE		0x00000080 /* NPEM indication Hot Spare */
+#define  PCI_NPEM_IND_ICA		0x00000100 /* NPEM indication In Critical Array */
+#define  PCI_NPEM_IND_IFA		0x00000200 /* NPEM indication In Failed Array */
+#define  PCI_NPEM_IND_IDT		0x00000400 /* NPEM indication Invalid Device Type */
+#define  PCI_NPEM_IND_DISABLED		0x00000800 /* NPEM indication Disabled */
+#define  PCI_NPEM_IND_SPEC_0		0x00800000
+#define  PCI_NPEM_IND_SPEC_1		0x01000000
+#define  PCI_NPEM_IND_SPEC_2		0x02000000
+#define  PCI_NPEM_IND_SPEC_3		0x04000000
+#define  PCI_NPEM_IND_SPEC_4		0x08000000
+#define  PCI_NPEM_IND_SPEC_5		0x10000000
+#define  PCI_NPEM_IND_SPEC_6		0x20000000
+#define  PCI_NPEM_IND_SPEC_7		0x40000000
+
 /* Data Object Exchange */
 #define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
 #define  PCI_DOE_CAP_INT_SUP			0x00000001  /* Interrupt Support */