diff mbox series

[V11,5/8] cxl/port: Read CDAT table

Message ID 20220610202259.3544623-6-ira.weiny@intel.com
State Superseded
Headers show
Series CXL: Read CDAT and DSMAS data | expand

Commit Message

Ira Weiny June 10, 2022, 8:22 p.m. UTC
From: Ira Weiny <ira.weiny@intel.com>

The OS will need CDAT data from CXL devices to properly set up
interleave sets.  Currently this is supported through a DOE mailbox
which supports CDAT.

Search the DOE mailboxes available, query CDAT data, and cache the data
for later parsing.

Provide a sysfs binary attribute to allow dumping of the CDAT.

Binary dumping is modeled on /sys/firmware/ACPI/tables/

The ability to dump this table will be very useful for emulation of real
devices once they become available as QEMU CXL type 3 device emulation will
be able to load this file in.

This does not support table updates at runtime. It will always provide
whatever was there when first cached. Handling of table updates can be
implemented later.

Finally create a complete list of CDAT defines within cdat.h for code
wishing to decode the CDAT table.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V10:
	Ben Widawsky
		Failure to find CDAT should be a debug message not error
		Remove reference to cdat_mb from the port object
	Dropped [PATCH V10 5/9] cxl/port: Find a DOE mailbox which supports
		CDAT
		Iterate the mailboxes for the CDAT one each time.
	Define CXL_DOE_TABLE_ACCESS_LAST_ENTRY and add comment about
		it's use.

Changes from V9:
	Add debugging output
	Jonathan Cameron
		Move read_cdat to port probe by using dev_groups for the
		sysfs attributes.  This avoids issues with using devm
		before the driver is loaded while making sure the CDAT
		binary is available.

Changes from V8:
	Fix length print format
	Incorporate feedback from Jonathan
	Move all this to cxl_port which can help support switches when
	the time comes.

Changes from V6:
	Fix issue with devm use
		Move cached cdat data to cxl_dev_state
	Use new pci_doe_submit_task()
	Ensure the aux driver is locked while processing tasks
	Rebased on cxl-pending

Changes from V5:
	Add proper guards around cdat.h
	Split out finding the CDAT DOE mailbox
	Use cxl_cdat to group CDAT data together
	Adjust to use auxiliary_find_device() to find the DOE device
		which supplies the CDAT protocol.
	Rebased to latest
	Remove dev_dbg(length)
	Remove unneeded DOE Table access defines
	Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
		it is used

Changes from V4:
	Split this into it's own patch
	Rearchitect this such that the memdev driver calls into the DOE
	driver via the cxl_mem state object.  This allows CDAT data to
	come from any type of cxl_mem object not just PCI DOE.
	Rebase on new struct cxl_dev_state
---
 drivers/cxl/cdat.h     | 100 ++++++++++++++++++++++++
 drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h      |   3 +
 drivers/cxl/cxlpci.h   |   1 +
 drivers/cxl/port.c     |  51 ++++++++++++
 5 files changed, 327 insertions(+)
 create mode 100644 drivers/cxl/cdat.h

Comments

Dan Williams June 18, 2022, 12:43 a.m. UTC | #1
ira.weiny@ wrote:
> From: Ira Weiny <ira.weiny@intel.com>
> 
> The OS will need CDAT data from CXL devices to properly set up
> interleave sets.  Currently this is supported through a DOE mailbox
> which supports CDAT.
> 
> Search the DOE mailboxes available, query CDAT data, and cache the data
> for later parsing.
> 
> Provide a sysfs binary attribute to allow dumping of the CDAT.
> 
> Binary dumping is modeled on /sys/firmware/ACPI/tables/
> 
> The ability to dump this table will be very useful for emulation of real
> devices once they become available as QEMU CXL type 3 device emulation will
> be able to load this file in.
> 
> This does not support table updates at runtime. It will always provide
> whatever was there when first cached. Handling of table updates can be
> implemented later.
> 
> Finally create a complete list of CDAT defines within cdat.h for code
> wishing to decode the CDAT table.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> ---
> Changes from V10:
> 	Ben Widawsky
> 		Failure to find CDAT should be a debug message not error
> 		Remove reference to cdat_mb from the port object
> 	Dropped [PATCH V10 5/9] cxl/port: Find a DOE mailbox which supports
> 		CDAT
> 		Iterate the mailboxes for the CDAT one each time.
> 	Define CXL_DOE_TABLE_ACCESS_LAST_ENTRY and add comment about
> 		it's use.
> 
> Changes from V9:
> 	Add debugging output
> 	Jonathan Cameron
> 		Move read_cdat to port probe by using dev_groups for the
> 		sysfs attributes.  This avoids issues with using devm
> 		before the driver is loaded while making sure the CDAT
> 		binary is available.
> 
> Changes from V8:
> 	Fix length print format
> 	Incorporate feedback from Jonathan
> 	Move all this to cxl_port which can help support switches when
> 	the time comes.
> 
> Changes from V6:
> 	Fix issue with devm use
> 		Move cached cdat data to cxl_dev_state
> 	Use new pci_doe_submit_task()
> 	Ensure the aux driver is locked while processing tasks
> 	Rebased on cxl-pending
> 
> Changes from V5:
> 	Add proper guards around cdat.h
> 	Split out finding the CDAT DOE mailbox
> 	Use cxl_cdat to group CDAT data together
> 	Adjust to use auxiliary_find_device() to find the DOE device
> 		which supplies the CDAT protocol.
> 	Rebased to latest
> 	Remove dev_dbg(length)
> 	Remove unneeded DOE Table access defines
> 	Move CXL_DOE_PROTOCOL_TABLE_ACCESS define into this patch where
> 		it is used
> 
> Changes from V4:
> 	Split this into it's own patch
> 	Rearchitect this such that the memdev driver calls into the DOE
> 	driver via the cxl_mem state object.  This allows CDAT data to
> 	come from any type of cxl_mem object not just PCI DOE.
> 	Rebase on new struct cxl_dev_state
> ---
>  drivers/cxl/cdat.h     | 100 ++++++++++++++++++++++++
>  drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxl.h      |   3 +
>  drivers/cxl/cxlpci.h   |   1 +
>  drivers/cxl/port.c     |  51 ++++++++++++
>  5 files changed, 327 insertions(+)
>  create mode 100644 drivers/cxl/cdat.h
> 
> diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
> new file mode 100644
> index 000000000000..c6a48ab326bf
> --- /dev/null
> +++ b/drivers/cxl/cdat.h
> @@ -0,0 +1,100 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __CXL_CDAT_H__
> +#define __CXL_CDAT_H__
> +
> +/*
> + * Coherent Device Attribute table (CDAT)
> + *
> + * Specification available from UEFI.org
> + *
> + * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
> + * done one entry at a time, where the first entry is the header.
> + */
> +
> +#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
> +#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
> +#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
> +#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
> +#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
> +#define CXL_DOE_TABLE_ACCESS_LAST_ENTRY		0xffff
> +
> +/*
> + * CDAT entries are little endian and are read from PCI config space which
> + * is also little endian.
> + * As such, on a big endian system these will have been reversed.
> + * This prevents us from making easy use of packed structures.
> + * Style form pci_regs.h
> + */
> +
> +#define CDAT_HEADER_LENGTH_DW 4
> +#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
> +#define CDAT_HEADER_DW0_LENGTH		0xffffffff
> +#define CDAT_HEADER_DW1_REVISION	0x000000ff
> +#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
> +/* CDAT_HEADER_DW2_RESERVED	*/
> +#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
> +
> +/* All structures have a common first DW */
> +#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
> +#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
> +#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
> +
> +#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
> +
> +/* Device Scoped Memory Affinity Structure */
> +#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
> +#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
> +#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
> +#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
> +
> +/* Device Scoped Latency and Bandwidth Information Structure */
> +#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
> +#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
> +#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
> +#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
> +#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
> +
> +/* Device Scoped Memory Side Cache Information Structure */
> +#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
> +	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
> +
> +/* Device Scoped Initiator Structure */
> +#define CDAT_DSIS_DW1_FLAGS		0x000000ff
> +#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
> +
> +/* Device Scoped EFI Memory Type Structure */
> +#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
> +#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
> +#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
> +
> +/* Switch Scoped Latency and Bandwidth Information Structure */
> +#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
> +#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
> +#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
> +#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
> +#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
> +
> +#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
> +
> +/**
> + * struct cxl_cdat - CXL CDAT data
> + *
> + * @table: cache of CDAT table
> + * @length: length of cached CDAT table
> + */
> +struct cxl_cdat {
> +	void *table;
> +	size_t length;
> +};
> +
> +#endif /* !__CXL_CDAT_H__ */
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index c4c99ff7b55e..84dc82f7dff0 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -4,10 +4,12 @@
>  #include <linux/device.h>
>  #include <linux/delay.h>
>  #include <linux/pci.h>
> +#include <linux/pci-doe.h>
>  #include <cxlpci.h>
>  #include <cxlmem.h>
>  #include <cxl.h>
>  #include "core.h"
> +#include "cdat.h"
>  
>  /**
>   * DOC: cxl core pci
> @@ -458,3 +460,173 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm)
>  	return 0;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
> +
> +static struct pci_doe_mb *find_cdat_mb(struct device *uport)
> +{
> +	struct cxl_memdev *cxlmd;
> +	struct cxl_dev_state *cxlds;
> +	int i;
> +
> +	if (!is_cxl_memdev(uport))
> +		return NULL;
> +
> +	cxlmd = to_cxl_memdev(uport);
> +	cxlds = cxlmd->cxlds;

This feels stuck between 2 worlds. Either cxl_port_probe() needs to do
the enumeration, or the attribute needs to move to be memdev relative.
Given that CXL switches are going to also have CDAT data, then the
former path needs to happen. Yes, cxl_pci still needs to do the vector
allocation, but it does not need to do the PCI DOE probing.

> +
> +	for (i = 0; i < cxlds->num_mbs; i++) {
> +		struct pci_doe_mb *cur = cxlds->doe_mbs[i];
> +
> +		if (pci_doe_supports_prot(cur, PCI_DVSEC_VENDOR_ID_CXL,
> +					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
> +			return cur;
> +	}
> +
> +	return NULL;
> +}
> +
> +#define CDAT_DOE_REQ(entry_handle)					\
> +	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
> +		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
> +		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
> +	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
> +
> +static void cxl_doe_task_complete(struct pci_doe_task *task)
> +{
> +	complete(task->private);
> +}
> +
> +static int cxl_cdat_get_length(struct device *dev,
> +			       struct pci_doe_mb *cdat_mb,
> +			       size_t *length)
> +{
> +	u32 cdat_request_pl = CDAT_DOE_REQ(0);
> +	u32 cdat_response_pl[32];
> +	DECLARE_COMPLETION_ONSTACK(c);
> +	struct pci_doe_task task = {
> +		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> +		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> +		.request_pl = &cdat_request_pl,
> +		.request_pl_sz = sizeof(cdat_request_pl),
> +		.response_pl = cdat_response_pl,
> +		.response_pl_sz = sizeof(cdat_response_pl),
> +		.complete = cxl_doe_task_complete,
> +		.private = &c,
> +	};
> +	int rc = 0;
> +
> +	rc = pci_doe_submit_task(cdat_mb, &task);
> +	if (rc < 0) {
> +		dev_err(dev, "DOE submit failed: %d", rc);
> +		return rc;
> +	}
> +	wait_for_completion(&c);
> +
> +	if (task.rv < 1)
> +		return -EIO;
> +
> +	*length = cdat_response_pl[1];
> +	dev_dbg(dev, "CDAT length %zu\n", *length);
> +
> +	return rc;
> +}
> +
> +static int cxl_cdat_read_table(struct device *dev,
> +			       struct pci_doe_mb *cdat_mb,
> +			       struct cxl_cdat *cdat)
> +{
> +	size_t length = cdat->length;
> +	u32 *data = cdat->table;
> +	int entry_handle = 0;
> +	int rc = 0;
> +
> +	do {
> +		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
> +		u32 cdat_response_pl[32];
> +		DECLARE_COMPLETION_ONSTACK(c);
> +		struct pci_doe_task task = {
> +			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
> +			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
> +			.request_pl = &cdat_request_pl,
> +			.request_pl_sz = sizeof(cdat_request_pl),
> +			.response_pl = cdat_response_pl,
> +			.response_pl_sz = sizeof(cdat_response_pl),
> +			.complete = cxl_doe_task_complete,
> +			.private = &c,
> +		};
> +		size_t entry_dw;
> +		u32 *entry;
> +
> +		rc = pci_doe_submit_task(cdat_mb, &task);
> +		if (rc < 0) {
> +			dev_err(dev, "DOE submit failed: %d", rc);
> +			return rc;
> +		}
> +		wait_for_completion(&c);
> +
> +		/* Get the CXL table access header entry handle */
> +		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE,
> +					 cdat_response_pl[0]);
> +		entry = cdat_response_pl + 1;
> +		entry_dw = task.rv / sizeof(u32);
> +		/* Skip Header */
> +		entry_dw -= 1;
> +		entry_dw = min(length / 4, entry_dw);
> +		memcpy(data, entry, entry_dw * sizeof(u32));
> +		length -= entry_dw * sizeof(u32);
> +		data += entry_dw;
> +
> +	} while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
> +
> +	return rc;
> +}
> +
> +/**
> + * read_cdat_data - Read the CDAT data on this port
> + * @port: Port to read data from
> + *
> + * This call will sleep waiting for responses from the DOE mailbox.
> + */
> +void read_cdat_data(struct cxl_port *port)
> +{
> +	static struct pci_doe_mb *cdat_mb;
> +	struct device *dev = &port->dev;
> +	struct device *uport = port->uport;
> +	size_t cdat_length;
> +	int ret;
> +
> +	/*
> +	 * Ensure a reference on the underlying uport device which has the
> +	 * mailboxes in it
> +	 */
> +	get_device(uport);

I had written up a long description grumbling about "just in case"
acquistions of device references, but then I lost that draft.

So I'll do the shorter response, but give you more homework in the
process. How / Why is that get_device() needed? What are the guarantees that
ensure you that the last reference has not been dropped just before that
call? Hint: what context is this code running?

> +
> +	cdat_mb = find_cdat_mb(uport);
> +	if (!cdat_mb) {
> +		dev_dbg(dev, "No CDAT mailbox\n");
> +		goto out;
> +	}
> +
> +	if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
> +		dev_dbg(dev, "No CDAT length\n");
> +		goto out;
> +	}
> +
> +	port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> +	if (!port->cdat.table)
> +		goto out;
> +
> +	port->cdat.length = cdat_length;
> +	ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
> +	if (ret) {
> +		/* Don't leave table data allocated on error */
> +		devm_kfree(dev, port->cdat.table);
> +		port->cdat.table = NULL;
> +		port->cdat.length = 0;
> +		dev_err(dev, "CDAT data read error\n");

Rather than a chatty / ephemeral error message I think this wants some
indication in userspace, likely the 0-length CDAT binary attribute, so
that userspace can debug why the kernel is picking sub-optimal QTG ids
for newly provisioned CXL regions.

> +	}
> +
> +out:
> +	put_device(uport);
> +}
> +EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 140dc3278cde..f0f5c24789bc 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -8,6 +8,7 @@
>  #include <linux/bitfield.h>
>  #include <linux/bitops.h>
>  #include <linux/io.h>
> +#include "cdat.h"
>  
>  /**
>   * DOC: cxl objects
> @@ -267,6 +268,7 @@ struct cxl_nvdimm {
>   * @component_reg_phys: component register capability base address (optional)
>   * @dead: last ep has been removed, force port re-creation
>   * @depth: How deep this port is relative to the root. depth 0 is the root.
> + * @cdat: Cached CDAT data
>   */
>  struct cxl_port {
>  	struct device dev;
> @@ -278,6 +280,7 @@ struct cxl_port {
>  	resource_size_t component_reg_phys;
>  	bool dead;
>  	unsigned int depth;
> +	struct cxl_cdat cdat;
>  };
>  
>  /**
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index fce1c11729c2..eec597dbe763 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -74,4 +74,5 @@ static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
>  int devm_cxl_port_enumerate_dports(struct cxl_port *port);
>  struct cxl_dev_state;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
> +void read_cdat_data(struct cxl_port *port);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 3cf308f114c4..7881ce66a5de 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -49,6 +49,9 @@ static int cxl_port_probe(struct device *dev)
>  	if (IS_ERR(cxlhdm))
>  		return PTR_ERR(cxlhdm);
>  
> +	/* Cache the data early to ensure is_visible() works */
> +	read_cdat_data(port);
> +
>  	if (is_cxl_endpoint(port)) {
>  		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
>  		struct cxl_dev_state *cxlds = cxlmd->cxlds;
> @@ -78,10 +81,58 @@ static int cxl_port_probe(struct device *dev)
>  	return 0;
>  }
>  
> +static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
> +			 struct bin_attribute *bin_attr, char *buf,
> +			 loff_t offset, size_t count)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_port *port = to_cxl_port(dev);
> +
> +	if (!port->cdat.table)
> +		return 0;
> +
> +	return memory_read_from_buffer(buf, count, &offset,
> +				       port->cdat.table,
> +				       port->cdat.length);
> +}
> +
> +static BIN_ATTR_RO(cdat, 0);

This should be BIN_ATTR_ADMIN_RO(), see:

3022c6a1b4b7 driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}

> +
> +static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
> +					      struct bin_attribute *attr, int i)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct cxl_port *port = to_cxl_port(dev);
> +
> +	if ((attr == &bin_attr_cdat) && port->cdat.table)
> +		return 0400;

Per above change you only need to manage visibility and not permissions,
also per the comment about the error message in the CDAT table read case
I think it makes sense to show an empty attribute. Only if the device
does not claim to have a CDAT should the attribute not be visible.

> +
> +	return 0;
> +}
> +
> +static struct bin_attribute *cxl_cdat_bin_attributes[] = {
> +	&bin_attr_cdat,
> +	NULL,
> +};
> +
> +static struct attribute_group cxl_cdat_attribute_group = {
> +	.name = "CDAT",
> +	.bin_attrs = cxl_cdat_bin_attributes,
> +	.is_bin_visible = cxl_port_bin_attr_is_visible,
> +};
> +
> +static const struct attribute_group *cxl_port_attribute_groups[] = {
> +	&cxl_cdat_attribute_group,
> +	NULL,
> +};
> +
>  static struct cxl_driver cxl_port_driver = {
>  	.name = "cxl_port",
>  	.probe = cxl_port_probe,
>  	.id = CXL_DEVICE_PORT,
> +	.drv = {
> +		.dev_groups = cxl_port_attribute_groups,
> +	},
>  };
>  
>  module_cxl_driver(cxl_port_driver);
> -- 
> 2.35.1
>
Dan Williams June 21, 2022, 7:10 p.m. UTC | #2
Dan Williams wrote:
> ira.weiny@ wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
[..]
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index c4c99ff7b55e..84dc82f7dff0 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -4,10 +4,12 @@
> >  #include <linux/device.h>
> >  #include <linux/delay.h>
> >  #include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> >  #include <cxlpci.h>
> >  #include <cxlmem.h>
> >  #include <cxl.h>
> >  #include "core.h"
> > +#include "cdat.h"
> >  
> >  /**
> >   * DOC: cxl core pci
> > @@ -458,3 +460,173 @@ int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm)
> >  	return 0;
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
> > +
> > +static struct pci_doe_mb *find_cdat_mb(struct device *uport)
> > +{
> > +	struct cxl_memdev *cxlmd;
> > +	struct cxl_dev_state *cxlds;
> > +	int i;
> > +
> > +	if (!is_cxl_memdev(uport))
> > +		return NULL;
> > +
> > +	cxlmd = to_cxl_memdev(uport);
> > +	cxlds = cxlmd->cxlds;
> 
> This feels stuck between 2 worlds. Either cxl_port_probe() needs to do
> the enumeration, or the attribute needs to move to be memdev relative.
> Given that CXL switches are going to also have CDAT data, then the
> former path needs to happen. Yes, cxl_pci still needs to do the vector
> allocation, but it does not need to do the PCI DOE probing.

It is really the interrupt setup that makes this an awkward fit all
around. The PCI core knows how to handle capabilities with interrupts,
but only for PCIe port services. DOE is both a PCIe port service *and*
and "endpoint service" like VPD (pci_vpd_init()). The more I think about
this the closer I get to the recommendation from Lukas which is that
DOE is more like pci_vpd_init() than pci_aer_init(), or a custom
enabling per driver.

If the DOE enumeration moves to a sub-function of
pci_init_capabilities() then the cxl_pci and/or cxl_port drivers just
look those up and use them. The DOE instances would remain in polled
mode unless and until a PCI driver added interrupt support late. In
other words, DOE can follow the VPD init model as long as interrupts are
not involved, and if interrupts are desired it requires late allocation
of IRQ vectors.
Lukas Wunner June 21, 2022, 7:34 p.m. UTC | #3
On Tue, Jun 21, 2022 at 12:10:03PM -0700, Dan Williams wrote:
> It is really the interrupt setup that makes this an awkward fit all
> around. The PCI core knows how to handle capabilities with interrupts,
> but only for PCIe port services. DOE is both a PCIe port service *and*
> and "endpoint service" like VPD (pci_vpd_init()). The more I think about
> this the closer I get to the recommendation from Lukas which is that
> DOE is more like pci_vpd_init() than pci_aer_init(), or a custom
> enabling per driver.
> 
> If the DOE enumeration moves to a sub-function of
> pci_init_capabilities() then the cxl_pci and/or cxl_port drivers just
> look those up and use them. The DOE instances would remain in polled
> mode unless and until a PCI driver added interrupt support late. In
> other words, DOE can follow the VPD init model as long as interrupts are
> not involved, and if interrupts are desired it requires late allocation
> of IRQ vectors.

Thomas Gleixner has been working on dynamic allocation of MSI-X vectors.
We should probably just build on that and let the PCI core allocate
vectors for DOE mailboxes independently from drivers.

To conserve vectors, I'd delay allocation for a DOE mailbox until
it is first used.  There may be mailboxes that are never used.

DOE requires MSI-X or MSI.  We could probably leave MSI unsupported
until a device with broken MSI-X support shows up.  I envision that
with MSI, the onus is on the driver to allocate vectors for mailboxes
it intends to use and it would then have to "donate" those vectors
to the PCI core via a library function.

As for portdrv, that's a driver but Bjorn has expressed a desire
for a long time to move its functionality into the PCI core.
It shouldn't be allowed to unbind portdrv via sysfs and thus break
DPC etc, as is currently possible.

The question with regards to this series is, do we get *something*
merged and perfect it over time once it's in the tree, or do we
keep iterating on the mailing list.  I deliberately only provided
a single, comprehensive review and then stayed mum because I feel
bad for Ira having to keep reacting to more and more feedback
despite being at v11 already (or v12?  I've lost count).
Particularly because I suspect (I might be mistaken) that Ira's
natural habitat is actually CXL not PCI, so it might be a burden for him.
I'd be fine to implement suggestions I've made myself after Ira's
series lands.  No need for him to keep iterating ad infinitum.

Thanks,

Lukas
Dan Williams June 21, 2022, 7:41 p.m. UTC | #4
Lukas Wunner wrote:
> On Tue, Jun 21, 2022 at 12:10:03PM -0700, Dan Williams wrote:
> > It is really the interrupt setup that makes this an awkward fit all
> > around. The PCI core knows how to handle capabilities with interrupts,
> > but only for PCIe port services. DOE is both a PCIe port service *and*
> > and "endpoint service" like VPD (pci_vpd_init()). The more I think about
> > this the closer I get to the recommendation from Lukas which is that
> > DOE is more like pci_vpd_init() than pci_aer_init(), or a custom
> > enabling per driver.
> > 
> > If the DOE enumeration moves to a sub-function of
> > pci_init_capabilities() then the cxl_pci and/or cxl_port drivers just
> > look those up and use them. The DOE instances would remain in polled
> > mode unless and until a PCI driver added interrupt support late. In
> > other words, DOE can follow the VPD init model as long as interrupts are
> > not involved, and if interrupts are desired it requires late allocation
> > of IRQ vectors.
> 
> Thomas Gleixner has been working on dynamic allocation of MSI-X vectors.
> We should probably just build on that and let the PCI core allocate
> vectors for DOE mailboxes independently from drivers.
> 
> To conserve vectors, I'd delay allocation for a DOE mailbox until
> it is first used.  There may be mailboxes that are never used.
> 
> DOE requires MSI-X or MSI.  We could probably leave MSI unsupported
> until a device with broken MSI-X support shows up.  I envision that
> with MSI, the onus is on the driver to allocate vectors for mailboxes
> it intends to use and it would then have to "donate" those vectors
> to the PCI core via a library function.
> 
> As for portdrv, that's a driver but Bjorn has expressed a desire
> for a long time to move its functionality into the PCI core.
> It shouldn't be allowed to unbind portdrv via sysfs and thus break
> DPC etc, as is currently possible.
> 
> The question with regards to this series is, do we get *something*
> merged and perfect it over time once it's in the tree, or do we
> keep iterating on the mailing list.  I deliberately only provided
> a single, comprehensive review and then stayed mum because I feel
> bad for Ira having to keep reacting to more and more feedback
> despite being at v11 already (or v12?  I've lost count).
> Particularly because I suspect (I might be mistaken) that Ira's
> natural habitat is actually CXL not PCI, so it might be a burden for him.
> I'd be fine to implement suggestions I've made myself after Ira's
> series lands.  No need for him to keep iterating ad infinitum.

Yeah, sounds good. If the dynamic IRQ allocation support is on its way
then lets leave interrupt support out of the current DOE series and just
focus on getting polled mode going with the enumeration coming from the
PCI core.  That seems the shortest path to get something landed and
enables incremental improvement. Then the messiness of DOE interrupt
allocation and pcie_port_drv reworks can be saved for PCI core folks.
Ira Weiny June 21, 2022, 8:38 p.m. UTC | #5
On Tue, Jun 21, 2022 at 12:41:35PM -0700, Dan Williams wrote:
> Lukas Wunner wrote:
> > On Tue, Jun 21, 2022 at 12:10:03PM -0700, Dan Williams wrote:
> > > It is really the interrupt setup that makes this an awkward fit all
> > > around. The PCI core knows how to handle capabilities with interrupts,
> > > but only for PCIe port services. DOE is both a PCIe port service *and*
> > > and "endpoint service" like VPD (pci_vpd_init()). The more I think about
> > > this the closer I get to the recommendation from Lukas which is that
> > > DOE is more like pci_vpd_init() than pci_aer_init(), or a custom
> > > enabling per driver.
> > > 
> > > If the DOE enumeration moves to a sub-function of
> > > pci_init_capabilities() then the cxl_pci and/or cxl_port drivers just
> > > look those up and use them. The DOE instances would remain in polled
> > > mode unless and until a PCI driver added interrupt support late. In
> > > other words, DOE can follow the VPD init model as long as interrupts are
> > > not involved, and if interrupts are desired it requires late allocation
> > > of IRQ vectors.
> > 
> > Thomas Gleixner has been working on dynamic allocation of MSI-X vectors.
> > We should probably just build on that and let the PCI core allocate
> > vectors for DOE mailboxes independently from drivers.
> > 
> > To conserve vectors, I'd delay allocation for a DOE mailbox until
> > it is first used.  There may be mailboxes that are never used.
> > 
> > DOE requires MSI-X or MSI.  We could probably leave MSI unsupported
> > until a device with broken MSI-X support shows up.  I envision that
> > with MSI, the onus is on the driver to allocate vectors for mailboxes
> > it intends to use and it would then have to "donate" those vectors
> > to the PCI core via a library function.
> > 
> > As for portdrv, that's a driver but Bjorn has expressed a desire
> > for a long time to move its functionality into the PCI core.
> > It shouldn't be allowed to unbind portdrv via sysfs and thus break
> > DPC etc, as is currently possible.
> > 
> > The question with regards to this series is, do we get *something*
> > merged and perfect it over time once it's in the tree, or do we
> > keep iterating on the mailing list.

That is what I was going for by allocating the vectors in the CXL driver where
they happen to be used right now.  But I mentioned in the commit message that I
don't think that is where they should live long term.

> > I deliberately only provided
> > a single, comprehensive review and then stayed mum because I feel
> > bad for Ira having to keep reacting to more and more feedback
> > despite being at v11 already (or v12?  I've lost count).
> > Particularly because I suspect (I might be mistaken) that Ira's
> > natural habitat is actually CXL not PCI, so it might be a burden for him.

I'm always willing to learn more (PCI) but yes CXL is my focus.  I've never
modified the PCI layers so I am out of my element there.

> > I'd be fine to implement suggestions I've made myself after Ira's
> > series lands.  No need for him to keep iterating ad infinitum.
> 
> Yeah, sounds good. If the dynamic IRQ allocation support is on its way
> then lets leave interrupt support out of the current DOE series and just
> focus on getting polled mode going with the enumeration coming from the
> PCI core.

I've taken some time to look at the dynamic allocation stuff but I've not
internalized it nor figured out how to integrate it into the PCI/CXL layers.

> That seems the shortest path to get something landed and
> enables incremental improvement. Then the messiness of DOE interrupt
> allocation and pcie_port_drv reworks can be saved for PCI core folks.

Yes I think getting rid of the IRQ stuff for this round would help.  There has
been a lot of discussion around it and I'm still not fully happy with it where
it stands.  So I'll split it out for now.  If someone wants to pick it up that
would be great or I can work on a more central place.

Ira
Ira Weiny June 21, 2022, 9:14 p.m. UTC | #6
On Fri, Jun 17, 2022 at 05:43:34PM -0700, Dan Williams wrote:
> ira.weiny@ wrote:
> > From: Ira Weiny <ira.weiny@intel.com>
> > 

[snip]

> > +
> > +/**
> > + * read_cdat_data - Read the CDAT data on this port
> > + * @port: Port to read data from
> > + *
> > + * This call will sleep waiting for responses from the DOE mailbox.
> > + */
> > +void read_cdat_data(struct cxl_port *port)
> > +{
> > +	static struct pci_doe_mb *cdat_mb;
> > +	struct device *dev = &port->dev;
> > +	struct device *uport = port->uport;
> > +	size_t cdat_length;
> > +	int ret;
> > +
> > +	/*
> > +	 * Ensure a reference on the underlying uport device which has the
> > +	 * mailboxes in it
> > +	 */
> > +	get_device(uport);
> 
> I had written up a long description grumbling about "just in case"
> acquistions of device references, but then I lost that draft.
> 
> So I'll do the shorter response, but give you more homework in the
> process. How / Why is that get_device() needed? What are the guarantees that
> ensure you that the last reference has not been dropped just before that
> call? Hint: what context is this code running?

I'll check it out.  I suspect there is some reference on uport already taken
such that if port is valid uport must be valid.

> 
> > +
> > +	cdat_mb = find_cdat_mb(uport);
> > +	if (!cdat_mb) {
> > +		dev_dbg(dev, "No CDAT mailbox\n");
> > +		goto out;
> > +	}
> > +
> > +	if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
> > +		dev_dbg(dev, "No CDAT length\n");
> > +		goto out;
> > +	}
> > +
> > +	port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> > +	if (!port->cdat.table)
> > +		goto out;
> > +
> > +	port->cdat.length = cdat_length;
> > +	ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
> > +	if (ret) {
> > +		/* Don't leave table data allocated on error */
> > +		devm_kfree(dev, port->cdat.table);
> > +		port->cdat.table = NULL;
> > +		port->cdat.length = 0;
> > +		dev_err(dev, "CDAT data read error\n");
> 
> Rather than a chatty / ephemeral error message I think this wants some
> indication in userspace, likely the 0-length CDAT binary attribute, so
> that userspace can debug why the kernel is picking sub-optimal QTG ids
> for newly provisioned CXL regions.

I thought we agreed that 0-length or CDAT query failure would result in no
sysfs entry?

This message was to alert that a CDAT query was attempted but the read failed
vs finding no mailbox with CDAT capabilities for example.

[snip]

> >  
> > +static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
> > +			 struct bin_attribute *bin_attr, char *buf,
> > +			 loff_t offset, size_t count)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct cxl_port *port = to_cxl_port(dev);
> > +
> > +	if (!port->cdat.table)
> > +		return 0;
> > +
> > +	return memory_read_from_buffer(buf, count, &offset,
> > +				       port->cdat.table,
> > +				       port->cdat.length);
> > +}
> > +
> > +static BIN_ATTR_RO(cdat, 0);
> 
> This should be BIN_ATTR_ADMIN_RO(), see:
> 
> 3022c6a1b4b7 driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}

Are you suggesting I add BIN_ATTR_ADMIN_* macros?

> 
> > +
> > +static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
> > +					      struct bin_attribute *attr, int i)
> > +{
> > +	struct device *dev = kobj_to_dev(kobj);
> > +	struct cxl_port *port = to_cxl_port(dev);
> > +
> > +	if ((attr == &bin_attr_cdat) && port->cdat.table)
> > +		return 0400;
> 
> Per above change you only need to manage visibility and not permissions,

But the permissions indicate visibility (In the kdoc for struct
attribute_group).


 *              ...  Must
 *              return 0 if a binary attribute is not visible. The returned
 *              value will replace static permissions defined in
 *              struct bin_attribute.

And the value returned overrides the mode.

fs/sysfs/group.c:

create_files()

 82                         if (grp->is_bin_visible) {
 83                                 mode = grp->is_bin_visible(kobj, *bin_attr, i);
 84                                 if (!mode)
 85                                         continue;
 86                         }
 87 
 88                         WARN(mode & ~(SYSFS_PREALLOC | 0664),
 89                              "Attribute %s: Invalid permissions 0%o\n",
 90                              (*bin_attr)->attr.name, mode);
 91 
 92                         mode &= SYSFS_PREALLOC | 0664;


So I'm willing to add the macro but I'm not sure it is going to change anything
in this case.  I think to make those _ADMIN_ macros work with is_visible()
create_files() needs to be changed.  :-/  I'm not sure if the addition of
DEVICE_ATTR_ADMIN_{RO,RW} intended for is_visible() to be able to override the
mode?

> also per the comment about the error message in the CDAT table read case
> I think it makes sense to show an empty attribute. Only if the device
> does not claim to have a CDAT should the attribute not be visible.

Ok I can do that.

Thanks for the review,
Ira

> 
> > +
> > +	return 0;
> > +}
> > +
> > +static struct bin_attribute *cxl_cdat_bin_attributes[] = {
> > +	&bin_attr_cdat,
> > +	NULL,
> > +};
> > +
> > +static struct attribute_group cxl_cdat_attribute_group = {
> > +	.name = "CDAT",
> > +	.bin_attrs = cxl_cdat_bin_attributes,
> > +	.is_bin_visible = cxl_port_bin_attr_is_visible,
> > +};
> > +
> > +static const struct attribute_group *cxl_port_attribute_groups[] = {
> > +	&cxl_cdat_attribute_group,
> > +	NULL,
> > +};
> > +
> >  static struct cxl_driver cxl_port_driver = {
> >  	.name = "cxl_port",
> >  	.probe = cxl_port_probe,
> >  	.id = CXL_DEVICE_PORT,
> > +	.drv = {
> > +		.dev_groups = cxl_port_attribute_groups,
> > +	},
> >  };
> >  
> >  module_cxl_driver(cxl_port_driver);
> > -- 
> > 2.35.1
> > 
> 
>
Dan Williams June 21, 2022, 9:48 p.m. UTC | #7
Ira Weiny wrote:
> On Fri, Jun 17, 2022 at 05:43:34PM -0700, Dan Williams wrote:
> > ira.weiny@ wrote:
> > > From: Ira Weiny <ira.weiny@intel.com>
> > > 
> 
> [snip]
> 
> > > +
> > > +/**
> > > + * read_cdat_data - Read the CDAT data on this port
> > > + * @port: Port to read data from
> > > + *
> > > + * This call will sleep waiting for responses from the DOE mailbox.
> > > + */
> > > +void read_cdat_data(struct cxl_port *port)
> > > +{
> > > +	static struct pci_doe_mb *cdat_mb;
> > > +	struct device *dev = &port->dev;
> > > +	struct device *uport = port->uport;
> > > +	size_t cdat_length;
> > > +	int ret;
> > > +
> > > +	/*
> > > +	 * Ensure a reference on the underlying uport device which has the
> > > +	 * mailboxes in it
> > > +	 */
> > > +	get_device(uport);
> > 
> > I had written up a long description grumbling about "just in case"
> > acquistions of device references, but then I lost that draft.
> > 
> > So I'll do the shorter response, but give you more homework in the
> > process. How / Why is that get_device() needed? What are the guarantees that
> > ensure you that the last reference has not been dropped just before that
> > call? Hint: what context is this code running?
> 
> I'll check it out.  I suspect there is some reference on uport already taken
> such that if port is valid uport must be valid.

Right, this routine is called from cxl_port_probe() which holds the
device_lock() and prevents the port from being unregistered before it
has gone through device_release_driver(). The port was registered by the
driver for @uport i.e. cxl_mem. That driver is guaranteed to unregister
the port before it is unregistered itself.

As long as you know that the code path is within the lifespan of
cxl_port_probe() and cxl_port_remove() you are guaranteed to not need to
manage reference counts on the associated cxl_memdev.

> 
> > 
> > > +
> > > +	cdat_mb = find_cdat_mb(uport);
> > > +	if (!cdat_mb) {
> > > +		dev_dbg(dev, "No CDAT mailbox\n");
> > > +		goto out;
> > > +	}
> > > +
> > > +	if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
> > > +		dev_dbg(dev, "No CDAT length\n");
> > > +		goto out;
> > > +	}
> > > +
> > > +	port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
> > > +	if (!port->cdat.table)
> > > +		goto out;
> > > +
> > > +	port->cdat.length = cdat_length;
> > > +	ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
> > > +	if (ret) {
> > > +		/* Don't leave table data allocated on error */
> > > +		devm_kfree(dev, port->cdat.table);
> > > +		port->cdat.table = NULL;
> > > +		port->cdat.length = 0;
> > > +		dev_err(dev, "CDAT data read error\n");
> > 
> > Rather than a chatty / ephemeral error message I think this wants some
> > indication in userspace, likely the 0-length CDAT binary attribute, so
> > that userspace can debug why the kernel is picking sub-optimal QTG ids
> > for newly provisioned CXL regions.
> 
> I thought we agreed that 0-length or CDAT query failure would result in no
> sysfs entry?

Oh, I forgot about that, but some new rationale below...

> 
> This message was to alert that a CDAT query was attempted but the read failed
> vs finding no mailbox with CDAT capabilities for example.

...right, but that's an error message buried in the kernel log. I was
hoping for something where tooling can query and say "oh, by the way,
the driver tried and failed to get CDAT from this device that claimed to
support CDAT, remedy that situation if you are seeing unexpected
performance / behavior".

> 
> [snip]
> 
> > >  
> > > +static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
> > > +			 struct bin_attribute *bin_attr, char *buf,
> > > +			 loff_t offset, size_t count)
> > > +{
> > > +	struct device *dev = kobj_to_dev(kobj);
> > > +	struct cxl_port *port = to_cxl_port(dev);
> > > +
> > > +	if (!port->cdat.table)
> > > +		return 0;
> > > +
> > > +	return memory_read_from_buffer(buf, count, &offset,
> > > +				       port->cdat.table,
> > > +				       port->cdat.length);
> > > +}
> > > +
> > > +static BIN_ATTR_RO(cdat, 0);
> > 
> > This should be BIN_ATTR_ADMIN_RO(), see:
> > 
> > 3022c6a1b4b7 driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
> 
> Are you suggesting I add BIN_ATTR_ADMIN_* macros?

Yes.

> 
> > 
> > > +
> > > +static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
> > > +					      struct bin_attribute *attr, int i)
> > > +{
> > > +	struct device *dev = kobj_to_dev(kobj);
> > > +	struct cxl_port *port = to_cxl_port(dev);
> > > +
> > > +	if ((attr == &bin_attr_cdat) && port->cdat.table)
> > > +		return 0400;
> > 
> > Per above change you only need to manage visibility and not permissions,
> 
> But the permissions indicate visibility (In the kdoc for struct
> attribute_group).
> 
> 
>  *              ...  Must
>  *              return 0 if a binary attribute is not visible. The returned
>  *              value will replace static permissions defined in
>  *              struct bin_attribute.
> 
> And the value returned overrides the mode.
> 
> fs/sysfs/group.c:
> 
> create_files()
> 
>  82                         if (grp->is_bin_visible) {
>  83                                 mode = grp->is_bin_visible(kobj, *bin_attr, i);
>  84                                 if (!mode)
>  85                                         continue;
>  86                         }
>  87 
>  88                         WARN(mode & ~(SYSFS_PREALLOC | 0664),
>  89                              "Attribute %s: Invalid permissions 0%o\n",
>  90                              (*bin_attr)->attr.name, mode);
>  91 
>  92                         mode &= SYSFS_PREALLOC | 0664;
> 
> 
> So I'm willing to add the macro but I'm not sure it is going to change anything
> in this case.

The change I was expecting is that with BIN_ATTR_ADMIN_RO() this
implementation changes from:

	if ((attr == &bin_attr_cdat) && port->cdat.table)
	         return 0400;

...to:

	if ((attr == &bin_attr_cdat) && port->cdat.table)
	         return attr->mode;

...i.e. this routine only modifies visibility, you do not also need it
to enforce the root-read-only permission change since that's already
statically defined at attribute creation time.

> I think to make those _ADMIN_ macros work with is_visible()
> create_files() needs to be changed.  :-/  I'm not sure if the addition of
> DEVICE_ATTR_ADMIN_{RO,RW} intended for is_visible() to be able to override the
> mode?

The intent was that one only needs to look in one place to read the
permission, and is_visible() is (mostly*) only left to change the mode to
0.

* changes from read-only to/from writable would still need is_visble()
to manipulate permissions, but you get the idea.
Ira Weiny June 28, 2022, 3:24 a.m. UTC | #8
On Tue, Jun 21, 2022 at 02:48:11PM -0700, Dan Williams wrote:
> Ira Weiny wrote:
> > On Fri, Jun 17, 2022 at 05:43:34PM -0700, Dan Williams wrote:
> > > ira.weiny@ wrote:
> > > > From: Ira Weiny <ira.weiny@intel.com>
> > > > 
> > > 

[snip]

> > > Rather than a chatty / ephemeral error message I think this wants some
> > > indication in userspace, likely the 0-length CDAT binary attribute, so
> > > that userspace can debug why the kernel is picking sub-optimal QTG ids
> > > for newly provisioned CXL regions.
> > 
> > I thought we agreed that 0-length or CDAT query failure would result in no
> > sysfs entry?
> 
> Oh, I forgot about that, but some new rationale below...
> 
> > 
> > This message was to alert that a CDAT query was attempted but the read failed
> > vs finding no mailbox with CDAT capabilities for example.
> 
> ...right, but that's an error message buried in the kernel log. I was
> hoping for something where tooling can query and say "oh, by the way,
> the driver tried and failed to get CDAT from this device that claimed to
> support CDAT, remedy that situation if you are seeing unexpected
> performance / behavior".
> 

Ok I've added a flag which indicates if the device supported CDAT or not.  If
so the sysfs will be visible but the data may be 0 length.  Which means there
was some error in reading it.

> > 
> > [snip]
> > 
> > > >  
> > > > +static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
> > > > +			 struct bin_attribute *bin_attr, char *buf,
> > > > +			 loff_t offset, size_t count)
> > > > +{
> > > > +	struct device *dev = kobj_to_dev(kobj);
> > > > +	struct cxl_port *port = to_cxl_port(dev);
> > > > +
> > > > +	if (!port->cdat.table)
> > > > +		return 0;
> > > > +
> > > > +	return memory_read_from_buffer(buf, count, &offset,
> > > > +				       port->cdat.table,
> > > > +				       port->cdat.length);
> > > > +}
> > > > +
> > > > +static BIN_ATTR_RO(cdat, 0);
> > > 
> > > This should be BIN_ATTR_ADMIN_RO(), see:
> > > 
> > > 3022c6a1b4b7 driver-core: Introduce DEVICE_ATTR_ADMIN_{RO,RW}
> > 
> > Are you suggesting I add BIN_ATTR_ADMIN_* macros?
> 
> Yes.

Done.

> 
> > 
> > > 
> > > > +
> > > > +static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
> > > > +					      struct bin_attribute *attr, int i)
> > > > +{
> > > > +	struct device *dev = kobj_to_dev(kobj);
> > > > +	struct cxl_port *port = to_cxl_port(dev);
> > > > +
> > > > +	if ((attr == &bin_attr_cdat) && port->cdat.table)
> > > > +		return 0400;
> > > 
> > > Per above change you only need to manage visibility and not permissions,
> > 
> > But the permissions indicate visibility (In the kdoc for struct
> > attribute_group).
> > 
> > 
> >  *              ...  Must
> >  *              return 0 if a binary attribute is not visible. The returned
> >  *              value will replace static permissions defined in
> >  *              struct bin_attribute.
> > 
> > And the value returned overrides the mode.
> > 
> > fs/sysfs/group.c:
> > 
> > create_files()
> > 
> >  82                         if (grp->is_bin_visible) {
> >  83                                 mode = grp->is_bin_visible(kobj, *bin_attr, i);
> >  84                                 if (!mode)
> >  85                                         continue;
> >  86                         }
> >  87 
> >  88                         WARN(mode & ~(SYSFS_PREALLOC | 0664),
> >  89                              "Attribute %s: Invalid permissions 0%o\n",
> >  90                              (*bin_attr)->attr.name, mode);
> >  91 
> >  92                         mode &= SYSFS_PREALLOC | 0664;
> > 
> > 
> > So I'm willing to add the macro but I'm not sure it is going to change anything
> > in this case.
> 
> The change I was expecting is that with BIN_ATTR_ADMIN_RO() this
> implementation changes from:
> 
> 	if ((attr == &bin_attr_cdat) && port->cdat.table)
> 	         return 0400;
> 
> ...to:
> 
> 	if ((attr == &bin_attr_cdat) && port->cdat.table)
> 	         return attr->mode;
> 
> ...i.e. this routine only modifies visibility, you do not also need it
> to enforce the root-read-only permission change since that's already
> statically defined at attribute creation time.

Ok.

> 
> > I think to make those _ADMIN_ macros work with is_visible()
> > create_files() needs to be changed.  :-/  I'm not sure if the addition of
> > DEVICE_ATTR_ADMIN_{RO,RW} intended for is_visible() to be able to override the
> > mode?
> 
> The intent was that one only needs to look in one place to read the
> permission, and is_visible() is (mostly*) only left to change the mode to
> 0.
> 
> * changes from read-only to/from writable would still need is_visble()
> to manipulate permissions, but you get the idea.

Yep, done.
Ira
diff mbox series

Patch

diff --git a/drivers/cxl/cdat.h b/drivers/cxl/cdat.h
new file mode 100644
index 000000000000..c6a48ab326bf
--- /dev/null
+++ b/drivers/cxl/cdat.h
@@ -0,0 +1,100 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __CXL_CDAT_H__
+#define __CXL_CDAT_H__
+
+/*
+ * Coherent Device Attribute table (CDAT)
+ *
+ * Specification available from UEFI.org
+ *
+ * Whilst CDAT is defined as a single table, the access via DOE maiboxes is
+ * done one entry at a time, where the first entry is the header.
+ */
+
+#define CXL_DOE_TABLE_ACCESS_REQ_CODE		0x000000ff
+#define   CXL_DOE_TABLE_ACCESS_REQ_CODE_READ	0
+#define CXL_DOE_TABLE_ACCESS_TABLE_TYPE		0x0000ff00
+#define   CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA	0
+#define CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE	0xffff0000
+#define CXL_DOE_TABLE_ACCESS_LAST_ENTRY		0xffff
+
+/*
+ * CDAT entries are little endian and are read from PCI config space which
+ * is also little endian.
+ * As such, on a big endian system these will have been reversed.
+ * This prevents us from making easy use of packed structures.
+ * Style form pci_regs.h
+ */
+
+#define CDAT_HEADER_LENGTH_DW 4
+#define CDAT_HEADER_LENGTH_BYTES (CDAT_HEADER_LENGTH_DW * sizeof(u32))
+#define CDAT_HEADER_DW0_LENGTH		0xffffffff
+#define CDAT_HEADER_DW1_REVISION	0x000000ff
+#define CDAT_HEADER_DW1_CHECKSUM	0x0000ff00
+/* CDAT_HEADER_DW2_RESERVED	*/
+#define CDAT_HEADER_DW3_SEQUENCE	0xffffffff
+
+/* All structures have a common first DW */
+#define CDAT_STRUCTURE_DW0_TYPE		0x000000ff
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMAS 0
+#define   CDAT_STRUCTURE_DW0_TYPE_DSLBIS 1
+#define   CDAT_STRUCTURE_DW0_TYPE_DSMSCIS 2
+#define   CDAT_STRUCTURE_DW0_TYPE_DSIS 3
+#define   CDAT_STRUCTURE_DW0_TYPE_DSEMTS 4
+#define   CDAT_STRUCTURE_DW0_TYPE_SSLBIS 5
+
+#define CDAT_STRUCTURE_DW0_LENGTH	0xffff0000
+
+/* Device Scoped Memory Affinity Structure */
+#define CDAT_DSMAS_DW1_DSMAD_HANDLE	0x000000ff
+#define CDAT_DSMAS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSMAS_DPA_OFFSET(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMAS_DPA_LEN(entry) ((u64)((entry)[5]) << 32 | (entry)[4])
+#define CDAT_DSMAS_NON_VOLATILE(flags)  ((flags & 0x04) >> 2)
+
+/* Device Scoped Latency and Bandwidth Information Structure */
+#define CDAT_DSLBIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSLBIS_DW1_FLAGS		0x0000ff00
+#define CDAT_DSLBIS_DW1_DATA_TYPE	0x00ff0000
+#define CDAT_DSLBIS_BASE_UNIT(entry) ((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSLBIS_DW4_ENTRY_0		0x0000ffff
+#define CDAT_DSLBIS_DW4_ENTRY_1		0xffff0000
+#define CDAT_DSLBIS_DW5_ENTRY_2		0x0000ffff
+
+/* Device Scoped Memory Side Cache Information Structure */
+#define CDAT_DSMSCIS_DW1_HANDLE		0x000000ff
+#define CDAT_DSMSCIS_MEMORY_SIDE_CACHE_SIZE(entry) \
+	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSMSCIS_DW4_MEMORY_SIDE_CACHE_ATTRS 0xffffffff
+
+/* Device Scoped Initiator Structure */
+#define CDAT_DSIS_DW1_FLAGS		0x000000ff
+#define CDAT_DSIS_DW1_HANDLE		0x0000ff00
+
+/* Device Scoped EFI Memory Type Structure */
+#define CDAT_DSEMTS_DW1_HANDLE		0x000000ff
+#define CDAT_DSEMTS_DW1_EFI_MEMORY_TYPE_ATTR	0x0000ff00
+#define CDAT_DSEMTS_DPA_OFFSET(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_DSEMTS_DPA_LENGTH(entry)	((u64)((entry)[5]) << 32 | (entry)[4])
+
+/* Switch Scoped Latency and Bandwidth Information Structure */
+#define CDAT_SSLBIS_DW1_DATA_TYPE	0x000000ff
+#define CDAT_SSLBIS_BASE_UNIT(entry)	((u64)((entry)[3]) << 32 | (entry)[2])
+#define CDAT_SSLBIS_ENTRY_PORT_X(entry, i) ((entry)[4 + (i) * 2] & 0x0000ffff)
+#define CDAT_SSLBIS_ENTRY_PORT_Y(entry, i) (((entry)[4 + (i) * 2] & 0xffff0000) >> 16)
+#define CDAT_SSLBIS_ENTRY_LAT_OR_BW(entry, i) ((entry)[4 + (i) * 2 + 1] & 0x0000ffff)
+
+#define CXL_DOE_PROTOCOL_TABLE_ACCESS 2
+
+/**
+ * struct cxl_cdat - CXL CDAT data
+ *
+ * @table: cache of CDAT table
+ * @length: length of cached CDAT table
+ */
+struct cxl_cdat {
+	void *table;
+	size_t length;
+};
+
+#endif /* !__CXL_CDAT_H__ */
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c4c99ff7b55e..84dc82f7dff0 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -4,10 +4,12 @@ 
 #include <linux/device.h>
 #include <linux/delay.h>
 #include <linux/pci.h>
+#include <linux/pci-doe.h>
 #include <cxlpci.h>
 #include <cxlmem.h>
 #include <cxl.h>
 #include "core.h"
+#include "cdat.h"
 
 /**
  * DOC: cxl core pci
@@ -458,3 +460,173 @@  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm)
 	return 0;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_hdm_decode_init, CXL);
+
+static struct pci_doe_mb *find_cdat_mb(struct device *uport)
+{
+	struct cxl_memdev *cxlmd;
+	struct cxl_dev_state *cxlds;
+	int i;
+
+	if (!is_cxl_memdev(uport))
+		return NULL;
+
+	cxlmd = to_cxl_memdev(uport);
+	cxlds = cxlmd->cxlds;
+
+	for (i = 0; i < cxlds->num_mbs; i++) {
+		struct pci_doe_mb *cur = cxlds->doe_mbs[i];
+
+		if (pci_doe_supports_prot(cur, PCI_DVSEC_VENDOR_ID_CXL,
+					  CXL_DOE_PROTOCOL_TABLE_ACCESS))
+			return cur;
+	}
+
+	return NULL;
+}
+
+#define CDAT_DOE_REQ(entry_handle)					\
+	(FIELD_PREP(CXL_DOE_TABLE_ACCESS_REQ_CODE,			\
+		    CXL_DOE_TABLE_ACCESS_REQ_CODE_READ) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_TABLE_TYPE,			\
+		    CXL_DOE_TABLE_ACCESS_TABLE_TYPE_CDATA) |		\
+	 FIELD_PREP(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE, (entry_handle)))
+
+static void cxl_doe_task_complete(struct pci_doe_task *task)
+{
+	complete(task->private);
+}
+
+static int cxl_cdat_get_length(struct device *dev,
+			       struct pci_doe_mb *cdat_mb,
+			       size_t *length)
+{
+	u32 cdat_request_pl = CDAT_DOE_REQ(0);
+	u32 cdat_response_pl[32];
+	DECLARE_COMPLETION_ONSTACK(c);
+	struct pci_doe_task task = {
+		.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+		.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+		.request_pl = &cdat_request_pl,
+		.request_pl_sz = sizeof(cdat_request_pl),
+		.response_pl = cdat_response_pl,
+		.response_pl_sz = sizeof(cdat_response_pl),
+		.complete = cxl_doe_task_complete,
+		.private = &c,
+	};
+	int rc = 0;
+
+	rc = pci_doe_submit_task(cdat_mb, &task);
+	if (rc < 0) {
+		dev_err(dev, "DOE submit failed: %d", rc);
+		return rc;
+	}
+	wait_for_completion(&c);
+
+	if (task.rv < 1)
+		return -EIO;
+
+	*length = cdat_response_pl[1];
+	dev_dbg(dev, "CDAT length %zu\n", *length);
+
+	return rc;
+}
+
+static int cxl_cdat_read_table(struct device *dev,
+			       struct pci_doe_mb *cdat_mb,
+			       struct cxl_cdat *cdat)
+{
+	size_t length = cdat->length;
+	u32 *data = cdat->table;
+	int entry_handle = 0;
+	int rc = 0;
+
+	do {
+		u32 cdat_request_pl = CDAT_DOE_REQ(entry_handle);
+		u32 cdat_response_pl[32];
+		DECLARE_COMPLETION_ONSTACK(c);
+		struct pci_doe_task task = {
+			.prot.vid = PCI_DVSEC_VENDOR_ID_CXL,
+			.prot.type = CXL_DOE_PROTOCOL_TABLE_ACCESS,
+			.request_pl = &cdat_request_pl,
+			.request_pl_sz = sizeof(cdat_request_pl),
+			.response_pl = cdat_response_pl,
+			.response_pl_sz = sizeof(cdat_response_pl),
+			.complete = cxl_doe_task_complete,
+			.private = &c,
+		};
+		size_t entry_dw;
+		u32 *entry;
+
+		rc = pci_doe_submit_task(cdat_mb, &task);
+		if (rc < 0) {
+			dev_err(dev, "DOE submit failed: %d", rc);
+			return rc;
+		}
+		wait_for_completion(&c);
+
+		/* Get the CXL table access header entry handle */
+		entry_handle = FIELD_GET(CXL_DOE_TABLE_ACCESS_ENTRY_HANDLE,
+					 cdat_response_pl[0]);
+		entry = cdat_response_pl + 1;
+		entry_dw = task.rv / sizeof(u32);
+		/* Skip Header */
+		entry_dw -= 1;
+		entry_dw = min(length / 4, entry_dw);
+		memcpy(data, entry, entry_dw * sizeof(u32));
+		length -= entry_dw * sizeof(u32);
+		data += entry_dw;
+
+	} while (entry_handle != CXL_DOE_TABLE_ACCESS_LAST_ENTRY);
+
+	return rc;
+}
+
+/**
+ * read_cdat_data - Read the CDAT data on this port
+ * @port: Port to read data from
+ *
+ * This call will sleep waiting for responses from the DOE mailbox.
+ */
+void read_cdat_data(struct cxl_port *port)
+{
+	static struct pci_doe_mb *cdat_mb;
+	struct device *dev = &port->dev;
+	struct device *uport = port->uport;
+	size_t cdat_length;
+	int ret;
+
+	/*
+	 * Ensure a reference on the underlying uport device which has the
+	 * mailboxes in it
+	 */
+	get_device(uport);
+
+	cdat_mb = find_cdat_mb(uport);
+	if (!cdat_mb) {
+		dev_dbg(dev, "No CDAT mailbox\n");
+		goto out;
+	}
+
+	if (cxl_cdat_get_length(dev, cdat_mb, &cdat_length)) {
+		dev_dbg(dev, "No CDAT length\n");
+		goto out;
+	}
+
+	port->cdat.table = devm_kzalloc(dev, cdat_length, GFP_KERNEL);
+	if (!port->cdat.table)
+		goto out;
+
+	port->cdat.length = cdat_length;
+	ret = cxl_cdat_read_table(dev, cdat_mb, &port->cdat);
+	if (ret) {
+		/* Don't leave table data allocated on error */
+		devm_kfree(dev, port->cdat.table);
+		port->cdat.table = NULL;
+		port->cdat.length = 0;
+		dev_err(dev, "CDAT data read error\n");
+	}
+
+out:
+	put_device(uport);
+}
+EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 140dc3278cde..f0f5c24789bc 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -8,6 +8,7 @@ 
 #include <linux/bitfield.h>
 #include <linux/bitops.h>
 #include <linux/io.h>
+#include "cdat.h"
 
 /**
  * DOC: cxl objects
@@ -267,6 +268,7 @@  struct cxl_nvdimm {
  * @component_reg_phys: component register capability base address (optional)
  * @dead: last ep has been removed, force port re-creation
  * @depth: How deep this port is relative to the root. depth 0 is the root.
+ * @cdat: Cached CDAT data
  */
 struct cxl_port {
 	struct device dev;
@@ -278,6 +280,7 @@  struct cxl_port {
 	resource_size_t component_reg_phys;
 	bool dead;
 	unsigned int depth;
+	struct cxl_cdat cdat;
 };
 
 /**
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index fce1c11729c2..eec597dbe763 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -74,4 +74,5 @@  static inline resource_size_t cxl_regmap_to_base(struct pci_dev *pdev,
 int devm_cxl_port_enumerate_dports(struct cxl_port *port);
 struct cxl_dev_state;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm);
+void read_cdat_data(struct cxl_port *port);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 3cf308f114c4..7881ce66a5de 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -49,6 +49,9 @@  static int cxl_port_probe(struct device *dev)
 	if (IS_ERR(cxlhdm))
 		return PTR_ERR(cxlhdm);
 
+	/* Cache the data early to ensure is_visible() works */
+	read_cdat_data(port);
+
 	if (is_cxl_endpoint(port)) {
 		struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport);
 		struct cxl_dev_state *cxlds = cxlmd->cxlds;
@@ -78,10 +81,58 @@  static int cxl_port_probe(struct device *dev)
 	return 0;
 }
 
+static ssize_t cdat_read(struct file *filp, struct kobject *kobj,
+			 struct bin_attribute *bin_attr, char *buf,
+			 loff_t offset, size_t count)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_port *port = to_cxl_port(dev);
+
+	if (!port->cdat.table)
+		return 0;
+
+	return memory_read_from_buffer(buf, count, &offset,
+				       port->cdat.table,
+				       port->cdat.length);
+}
+
+static BIN_ATTR_RO(cdat, 0);
+
+static umode_t cxl_port_bin_attr_is_visible(struct kobject *kobj,
+					      struct bin_attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct cxl_port *port = to_cxl_port(dev);
+
+	if ((attr == &bin_attr_cdat) && port->cdat.table)
+		return 0400;
+
+	return 0;
+}
+
+static struct bin_attribute *cxl_cdat_bin_attributes[] = {
+	&bin_attr_cdat,
+	NULL,
+};
+
+static struct attribute_group cxl_cdat_attribute_group = {
+	.name = "CDAT",
+	.bin_attrs = cxl_cdat_bin_attributes,
+	.is_bin_visible = cxl_port_bin_attr_is_visible,
+};
+
+static const struct attribute_group *cxl_port_attribute_groups[] = {
+	&cxl_cdat_attribute_group,
+	NULL,
+};
+
 static struct cxl_driver cxl_port_driver = {
 	.name = "cxl_port",
 	.probe = cxl_port_probe,
 	.id = CXL_DEVICE_PORT,
+	.drv = {
+		.dev_groups = cxl_port_attribute_groups,
+	},
 };
 
 module_cxl_driver(cxl_port_driver);