diff mbox series

[V8,03/10] PCI: Create PCI library functions in support of DOE mailboxes.

Message ID 20220414203237.2198665-4-ira.weiny@intel.com
State New, archived
Headers show
Series CXL: Read CDAT and DSMAS data from the device | expand

Commit Message

Ira Weiny April 14, 2022, 8:32 p.m. UTC
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
with standard protocol discovery.  Each mailbox is accessed through a
DOE Extended Capability.

Each DOE mailbox must support the DOE discovery protocol in addition to
any number of additional protocols.

Define core PCI functionality to manage a single PCI DOE mailbox at a
defined config space offset.  Functionality includes creating, supported
protocol queries, submit tasks to, and destroying the new state objects.

If interrupts are desired, interrupts vectors should be allocated prior
to asking for irq's when creating a mailbox object.

[1] https://members.pcisig.com/wg/PCI-SIG/document/16609

Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-developed-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>

---
Changes from V7
	Add a Kconfig for this functionality
	Fix bug in pci_doe_supports_prot()
	Rebased on cxl-pending

Changes from V6
	Clean up signed off by lines
	Make this functionality all PCI library functions
	Clean up header files
	s/pci_doe_irq/pci_doe_irq_handler
	Use pci_{request,free}_irq
		Remove irq_name (maintained by pci_request_irq)
	Fix checks to use an irq
	Consistently use u16 for cap_offset
	Cleanup kdocs and comments
	Create a helper retire_cur_task() to handle locking of the
		current task pointer.
	Remove devm_ calls from PCI layer.
		The devm_ calls do not allow for the pci_doe_mb objects
		to be tied to an auxiliary device.  Leave it to the
		caller to use devm_ if desired.
	From Dan Williams
		s/cb/end_task/; Pass pci_doe_task to end_task
		Clarify exchange/task/request/response.
			Merge pci_doe_task and pci_doe_exchange into
			pci_doe_task which represents a single
			request/response task for the state machine to
			process.
		Simplify submitting work to the mailbox
			Replace pci_doe_exchange_sync() with
			pci_doe_submit_task() Consumers of the mailbox
			are now responsible for setting up callbacks
			within a task object and submitting them to the
			mailbox to be processed.
		Remove WARN_ON when task != NULL and be sure to abort that task.
		Convert abort/dead to atomic flags
		s/state_lock/task_lock to better define what the lock is
			protecting
		Remove all the auxiliary bus code from the PCI layer
			The PCI layer provides helpers to use the DOE
			Mailboxes.  Each subsystem can then use the
			helpers as they see fit.  The CXL layer in this
			series uses aux devices to manage the new
			pci_doe_mb objects.

	From Bjorn
		Clarify the fact that DOE mailboxes are capabilities of
			the device.
		Code clean ups
		Cleanup Makefile
		Update references to PCI SIG spec v6.0
		Move this attribution here:
		This code is based on Jonathan's V4 series here:
		https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/

Changes from V5
	From Bjorn
		s/pci_WARN/pci_warn
			Add timeout period to print
		Trim to 80 chars
		Use Tabs for DOE define spacing
		Use %#x for clarity
	From Jonathan
		Addresses concerns about the order of unwinding stuff
		s/doe/doe_dev in pci_doe_exhcnage_sync
		Correct kernel Doc comment
		Move pci_doe_task_complete() down in the file.
		Rework pci_doe_irq()
			process STATUS_ERROR first
			Return IRQ_NONE if the irq is not processed
			Use PCI_DOE_STATUS_INT_STATUS explicitly to
				clear the irq
	Clean up goto label s/err_free_irqs/err_free_irq
	use devm_kzalloc for doe struct
	clean up error paths in pci_doe_probe
	s/pci_doe_drv/pci_doe
	remove include mutex.h
	remove device name and define, move it in the next patch which uses it
	use devm_kasprintf() for irq_name
	use devm_request_irq()
	remove pci_doe_unregister()
		[get/put]_device() were unneeded and with the use of
		devm_* this function can be removed completely.
	refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
		make this function just a registration of the irq and
		move pci_doe_abort() into pci_doe_probe()
	use devm_* to allocate the protocol array

Changes from Jonathan's V4
	Move the DOE MB code into the DOE auxiliary driver
	Remove Task List in favor of a wait queue

Changes from Ben
	remove CXL references
	propagate rc from pci functions on error

squash: make PCI_DOE optional based on selecting by those who need it.

squash: Create PCI library functions

Fix bug in protocol support check
---
 drivers/pci/Kconfig           |   3 +
 drivers/pci/Makefile          |   1 +
 drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
 include/linux/pci-doe.h       | 119 +++++++
 include/uapi/linux/pci_regs.h |  29 +-
 5 files changed, 758 insertions(+), 1 deletion(-)
 create mode 100644 drivers/pci/pci-doe.c
 create mode 100644 include/linux/pci-doe.h

Comments

Bjorn Helgaas April 28, 2022, 9:27 p.m. UTC | #1
On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
 
  Introduced in PCIe r6.0, sec 6.30, DOE ...

> with standard protocol discovery.  Each mailbox is accessed through a
> DOE Extended Capability.
> 
> Each DOE mailbox must support the DOE discovery protocol in addition to
> any number of additional protocols.
> 
> Define core PCI functionality to manage a single PCI DOE mailbox at a
> defined config space offset.  Functionality includes creating, supported
> protocol queries, submit tasks to, and destroying the new state objects.
> 
> If interrupts are desired, interrupts vectors should be allocated prior
> to asking for irq's when creating a mailbox object.
                IRQs

> [1] https://members.pcisig.com/wg/PCI-SIG/document/16609

The link is only useful for PCI-SIG members, and only as long as the
SIG maintains this structure.  I think "PCIe r6.0" is sufficient.

> Cc: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

Minor comments below.

> ---
> Changes from V7
> 	Add a Kconfig for this functionality
> 	Fix bug in pci_doe_supports_prot()
> 	Rebased on cxl-pending
> 
> Changes from V6
> 	Clean up signed off by lines
> 	Make this functionality all PCI library functions
> 	Clean up header files
> 	s/pci_doe_irq/pci_doe_irq_handler
> 	Use pci_{request,free}_irq
> 		Remove irq_name (maintained by pci_request_irq)
> 	Fix checks to use an irq
> 	Consistently use u16 for cap_offset
> 	Cleanup kdocs and comments
> 	Create a helper retire_cur_task() to handle locking of the
> 		current task pointer.
> 	Remove devm_ calls from PCI layer.
> 		The devm_ calls do not allow for the pci_doe_mb objects
> 		to be tied to an auxiliary device.  Leave it to the
> 		caller to use devm_ if desired.
> 	From Dan Williams
> 		s/cb/end_task/; Pass pci_doe_task to end_task
> 		Clarify exchange/task/request/response.
> 			Merge pci_doe_task and pci_doe_exchange into
> 			pci_doe_task which represents a single
> 			request/response task for the state machine to
> 			process.
> 		Simplify submitting work to the mailbox
> 			Replace pci_doe_exchange_sync() with
> 			pci_doe_submit_task() Consumers of the mailbox
> 			are now responsible for setting up callbacks
> 			within a task object and submitting them to the
> 			mailbox to be processed.
> 		Remove WARN_ON when task != NULL and be sure to abort that task.
> 		Convert abort/dead to atomic flags
> 		s/state_lock/task_lock to better define what the lock is
> 			protecting
> 		Remove all the auxiliary bus code from the PCI layer
> 			The PCI layer provides helpers to use the DOE
> 			Mailboxes.  Each subsystem can then use the
> 			helpers as they see fit.  The CXL layer in this
> 			series uses aux devices to manage the new
> 			pci_doe_mb objects.
> 
> 	From Bjorn
> 		Clarify the fact that DOE mailboxes are capabilities of
> 			the device.
> 		Code clean ups
> 		Cleanup Makefile
> 		Update references to PCI SIG spec v6.0
> 		Move this attribution here:
> 		This code is based on Jonathan's V4 series here:
> 		https://lore.kernel.org/linux-cxl/20210524133938.2815206-1-Jonathan.Cameron@huawei.com/
> 
> Changes from V5
> 	From Bjorn
> 		s/pci_WARN/pci_warn
> 			Add timeout period to print
> 		Trim to 80 chars
> 		Use Tabs for DOE define spacing
> 		Use %#x for clarity
> 	From Jonathan
> 		Addresses concerns about the order of unwinding stuff
> 		s/doe/doe_dev in pci_doe_exhcnage_sync
> 		Correct kernel Doc comment
> 		Move pci_doe_task_complete() down in the file.
> 		Rework pci_doe_irq()
> 			process STATUS_ERROR first
> 			Return IRQ_NONE if the irq is not processed
> 			Use PCI_DOE_STATUS_INT_STATUS explicitly to
> 				clear the irq
> 	Clean up goto label s/err_free_irqs/err_free_irq
> 	use devm_kzalloc for doe struct
> 	clean up error paths in pci_doe_probe
> 	s/pci_doe_drv/pci_doe
> 	remove include mutex.h
> 	remove device name and define, move it in the next patch which uses it
> 	use devm_kasprintf() for irq_name
> 	use devm_request_irq()
> 	remove pci_doe_unregister()
> 		[get/put]_device() were unneeded and with the use of
> 		devm_* this function can be removed completely.
> 	refactor pci_doe_register and s/pci_doe_register/pci_doe_reg_irq
> 		make this function just a registration of the irq and
> 		move pci_doe_abort() into pci_doe_probe()
> 	use devm_* to allocate the protocol array
> 
> Changes from Jonathan's V4
> 	Move the DOE MB code into the DOE auxiliary driver
> 	Remove Task List in favor of a wait queue
> 
> Changes from Ben
> 	remove CXL references
> 	propagate rc from pci functions on error
> 
> squash: make PCI_DOE optional based on selecting by those who need it.
> 
> squash: Create PCI library functions
> 
> Fix bug in protocol support check
> ---
>  drivers/pci/Kconfig           |   3 +
>  drivers/pci/Makefile          |   1 +
>  drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
>  include/linux/pci-doe.h       | 119 +++++++
>  include/uapi/linux/pci_regs.h |  29 +-
>  5 files changed, 758 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/pci/pci-doe.c
>  create mode 100644 include/linux/pci-doe.h
> 
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 133c73207782..b2f2e588a817 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
>  config PCI_ATS
>  	bool
>  
> +config PCI_DOE
> +	bool
> +
>  config PCI_ECAM
>  	bool
>  
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 0da6b1ebc694..b609d8ad813d 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM)		+= ecam.o
>  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
>  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
>  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
> +obj-$(CONFIG_PCI_DOE)		+= pci-doe.o

Is there value in the "pci-" prefix?  There are a few other files with
the prefix, but I can't remember why.  Most things are of the form
"doe.c".

>  # Endpoint library must be initialized before its users
>  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
> new file mode 100644
> index 000000000000..ccf936421d2a
> --- /dev/null
> +++ b/drivers/pci/pci-doe.c
> @@ -0,0 +1,607 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Data Object Exchange
> + * https://members.pcisig.com/wg/PCI-SIG/document/16609

I think "PCIe r6.0, sec 6.30" is more generally useful.

> + * Copyright (C) 2021 Huawei
> + *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + *
> + * Copyright (C) 2022 Intel Corporation
> + *	Ira Weiny <ira.weiny@intel.com>
> + */
> +
> +#include <linux/bitfield.h>
> +#include <linux/delay.h>
> +#include <linux/jiffies.h>
> +#include <linux/mutex.h>
> +#include <linux/pci.h>
> +#include <linux/pci-doe.h>
> +#include <linux/workqueue.h>
> +
> +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> +
> +#define PCI_DOE_BUSY_MAX_RETRIES 16
> +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> +
> +/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */

Pedantic: the PCI specs use "r6.0", not "v6.0".  Reverse for ACPI.  Of
course.  Even worse, PCI uses "Revision" for the major, "Version" for
the minor.

> +#define PCI_DOE_TIMEOUT HZ
> +
> +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> +{
> +	struct pci_doe_mb *doe_mb = data;
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +	/* Leave the error case to be handled outside IRQ */

s/irq/IRQ/ for similar comment and log message uses below.

> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> +					PCI_DOE_STATUS_INT_STATUS);
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}
> +
> +/*
> + * Only called when safe to directly access the DOE from
> + * doe_statemachine_work().  Outside access is not protected.  Users who
> + * perform such access are left with the pieces.
> + */
> +static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	val = PCI_DOE_CTRL_ABORT;
> +	if (doe_mb->irq >= 0)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +
> +	doe_mb->timeout_jiffies = jiffies + HZ;
> +	schedule_delayed_work(&doe_mb->statemachine, HZ);
> +}
> +
> +static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)

Wrap to fit in 80 columns, like the rest of drivers/pci/*

> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +	int i;
> +
> +	/*
> +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> +	 * it is expected that firmware and OS will negotiate access rights via
> +	 * an, as yet to be defined method.
> +	 */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> +		return -EBUSY;
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	/* Write DOE Header */
> +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
> +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> +	/* Length is 2 DW of header + length of payload in DW */
> +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> +					  2 + task->request_pl_sz /
> +						sizeof(u32)));
> +	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
> +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> +				       task->request_pl[i]);
> +
> +	val = PCI_DOE_CTRL_GO;
> +	if (doe_mb->irq >= 0)
> +		val |= PCI_DOE_CTRL_INT_EN;
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> +	/* Request is sent - now wait for poll or IRQ */
> +	return 0;
> +}
> +
> +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	size_t length;
> +	u32 val;
> +	int i;
> +
> +	/* Read the first dword to get the protocol */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
> +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
> +		pci_err(pdev,
> +			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",

Needs a "DOE" or similar hint for where to look when you see this
error.

"%04x" is the typical way we print a PCI vendor ID.  I guess a leading
"0x" isn't a disaster, but I do think it should be 4 hex digits so it
doesn't look like a length or something.

It looks like you anticipate multiple DOE capabilities.  Maybe
messages should include the capability offset to be specific?
I see you print with %u in pci_doe_create_mb().  I'm pretty sure
"lspci -vv" prints them in hex, and it's probably useful to match.

> +			task->prot.vid, task->prot.type,
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> +		return -EIO;
> +	}
> +
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +
> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &task->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +
> +	/* Flush excess length */
> +	for (; i < length; i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}
> +	/* Final error check to pick up on any since Data Object Ready */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> +		return -EIO;
> +
> +	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> +}
> +
> +static void signal_task_complete(struct pci_doe_task *task, int rv)
> +{
> +	task->rv = rv;
> +	task->complete(task);
> +}
> +
> +static void retire_cur_task(struct pci_doe_mb *doe_mb)
> +{
> +	mutex_lock(&doe_mb->task_lock);
> +	doe_mb->cur_task = NULL;
> +	mutex_unlock(&doe_mb->task_lock);
> +	wake_up_interruptible(&doe_mb->wq);
> +}
> +
> +static void doe_statemachine_work(struct work_struct *work)
> +{
> +	struct delayed_work *w = to_delayed_work(work);
> +	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
> +						 statemachine);
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	struct pci_doe_task *task;
> +	u32 val;
> +	int rc;
> +
> +	mutex_lock(&doe_mb->task_lock);
> +	task = doe_mb->cur_task;
> +	mutex_unlock(&doe_mb->task_lock);
> +
> +	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
> +		/*
> +		 * Currently only used during init - care needed if
> +		 * pci_doe_abort() is generally exposed as it would impact
> +		 * queries in flight.
> +		 */
> +		if (task)
> +			pr_err("Aborting with active task!\n");

This should be a dev_err() and probably include "DOE" or some kind of
hint about what this applies to.

> +		doe_mb->state = DOE_WAIT_ABORT;
> +		pci_doe_abort_start(doe_mb);
> +		return;
> +	}
> +
> +	switch (doe_mb->state) {
> +	case DOE_IDLE:
> +		if (task == NULL)
> +			return;
> +
> +		rc = pci_doe_send_req(doe_mb, task);

Add blank line before block comment.

> +		/*
> +		 * The specification does not provide any guidance on how long
> +		 * some other entity could keep the DOE busy, so try for 1
> +		 * second then fail. Busy handling is best effort only, because
> +		 * there is no way of avoiding racing against another user of
> +		 * the DOE.
> +		 */
> +		if (rc == -EBUSY) {
> +			doe_mb->busy_retries++;
> +			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> +				/* Long enough, fail this request */
> +				pci_warn(pdev,
> +					"DOE busy for too long (> 1 sec)\n");
> +				doe_mb->busy_retries = 0;
> +				goto err_busy;
> +			}
> +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> +			return;
> +		}
> +		if (rc)
> +			goto err_abort;
> +		doe_mb->busy_retries = 0;
> +
> +		doe_mb->state = DOE_WAIT_RESP;
> +		doe_mb->timeout_jiffies = jiffies + HZ;
> +		/* Now poll or wait for IRQ with timeout */
> +		if (doe_mb->irq >= 0)
> +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> +		else
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +		return;
> +
> +	case DOE_WAIT_RESP:
> +		/* Not possible to get here with NULL task */
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +			rc = -EIO;
> +			goto err_abort;
> +		}
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> +			/* If not yet at timeout reschedule otherwise abort */
> +			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> +				rc = -ETIMEDOUT;
> +				goto err_abort;
> +			}
> +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> +			return;
> +		}
> +
> +		rc  = pci_doe_recv_resp(doe_mb, task);
> +		if (rc < 0)
> +			goto err_abort;
> +
> +		doe_mb->state = DOE_IDLE;
> +
> +		retire_cur_task(doe_mb);
> +		/* Set the return value to the length of received payload */
> +		signal_task_complete(task, rc);
> +
> +		return;
> +
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> +			/* Back to normal state - carry on */
> +			retire_cur_task(doe_mb);
> +
> +			/*
> +			 * For deliberately triggered abort, someone is
> +			 * waiting.
> +			 */
> +			if (doe_mb->state == DOE_WAIT_ABORT) {
> +				if (task)
> +					signal_task_complete(task, -EFAULT);
> +				complete(&doe_mb->abort_c);
> +			}
> +
> +			doe_mb->state = DOE_IDLE;
> +			return;
> +		}
> +		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> +			/* Task has timed out and is dead - abort */
> +			pci_err(pdev, "DOE ABORT timed out\n");
> +			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> +			retire_cur_task(doe_mb);
> +
> +			if (doe_mb->state == DOE_WAIT_ABORT) {
> +				if (task)
> +					signal_task_complete(task, -EFAULT);
> +				complete(&doe_mb->abort_c);
> +			}
> +		}
> +		return;
> +	}
> +
> +err_abort:
> +	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
> +	pci_doe_abort_start(doe_mb);
> +err_busy:
> +	signal_task_complete(task, rc);
> +	if (doe_mb->state == DOE_IDLE)
> +		retire_cur_task(doe_mb);
> +}
> +
> +static void pci_doe_task_complete(struct pci_doe_task *task)
> +{
> +	complete(task->private);
> +}
> +
> +static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
> +			     u8 *protocol)
> +{
> +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> +				    *index);
> +	u32 response_pl;
> +	DECLARE_COMPLETION_ONSTACK(c);
> +	struct pci_doe_task task = {
> +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> +		.request_pl = &request_pl,
> +		.request_pl_sz = sizeof(request_pl),
> +		.response_pl = &response_pl,
> +		.response_pl_sz = sizeof(response_pl),
> +		.complete = pci_doe_task_complete,
> +		.private = &c,
> +	};
> +	int ret;
> +
> +	ret = pci_doe_submit_task(doe_mb, &task);
> +	if (ret < 0)
> +		return ret;
> +
> +	wait_for_completion(&c);
> +
> +	if (task.rv != sizeof(response_pl))
> +		return -EIO;
> +
> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> +			      response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> +			   response_pl);
> +
> +	return 0;
> +}
> +
> +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> +{
> +	u8 index = 0;
> +	int num_prots;
> +	int rc;
> +
> +	/* Discovery protocol must always be supported and must report itself */
> +	num_prots = 1;
> +
> +	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
> +	if (!doe_mb->prots)
> +		return -ENOMEM;
> +
> +	/*
> +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on

"pci_doe_free_mb()" to match other function name references.

> +	 * error if pci_doe_cache_protocols() fails past this point.
> +	 */
> +	do {
> +		struct pci_doe_protocol *prot;
> +
> +		prot = &doe_mb->prots[num_prots - 1];
> +		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
> +		if (rc)
> +			return rc;
> +
> +		if (index) {
> +			struct pci_doe_protocol *prot_new;
> +
> +			num_prots++;
> +			prot_new = krealloc(doe_mb->prots,
> +					    sizeof(*doe_mb->prots) * num_prots,
> +					    GFP_KERNEL);
> +			if (!prot_new)
> +				return -ENOMEM;
> +
> +			doe_mb->prots = prot_new;
> +		}
> +	} while (index);
> +
> +	doe_mb->num_prots = num_prots;
> +	return 0;
> +}
> +
> +static int pci_doe_abort(struct pci_doe_mb *doe_mb)
> +{
> +	reinit_completion(&doe_mb->abort_c);
> +	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
> +	schedule_delayed_work(&doe_mb->statemachine, 0);
> +	wait_for_completion(&doe_mb->abort_c);
> +
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> +{
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	int doe_irq, rc;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> +
> +	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> +		return -EOPNOTSUPP;
> +
> +	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> +	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> +			     NULL, doe_mb,
> +			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
> +	if (rc)
> +		return rc;
> +
> +	doe_mb->irq = doe_irq;
> +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> +			       PCI_DOE_CTRL_INT_EN);
> +	return 0;
> +}
> +
> +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> +{
> +	if (doe_mb->irq >= 0)
> +		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
> +	kfree(doe_mb->prots);
> +	kfree(doe_mb);
> +}
> +
> +/**
> + * pci_doe_create_mb() - Create a DOE mailbox object
> + *
> + * @pdev: PCI device to create the DOE mailbox for
> + * @cap_offset: Offset of the DOE mailbox
> + * @use_irq: Should the state machine use an irq
> + *
> + * Create a single mailbox object to manage the mailbox protocol at the
> + * cap_offset specified.
> + *
> + * Caller should allocate PCI irq vectors before setting use_irq.
> + *
> + * RETURNS: created mailbox object on success
> + *	    ERR_PTR(-errno) on failure
> + */
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq)
> +{
> +	struct pci_doe_mb *doe_mb;
> +	int rc;
> +
> +	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> +	if (!doe_mb)
> +		return ERR_PTR(-ENOMEM);
> +
> +	doe_mb->pdev = pdev;
> +	init_completion(&doe_mb->abort_c);
> +	doe_mb->irq = -1;
> +	doe_mb->cap_offset = cap_offset;
> +
> +	init_waitqueue_head(&doe_mb->wq);
> +	mutex_init(&doe_mb->task_lock);
> +	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
> +	doe_mb->state = DOE_IDLE;
> +
> +	if (use_irq) {
> +		rc = pci_doe_request_irq(doe_mb);
> +		if (rc)
> +			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
> +				cap_offset, rc);
> +	}
> +
> +	/* Reset the mailbox by issuing an abort */
> +	rc = pci_doe_abort(doe_mb);
> +	if (rc) {
> +		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
> +			cap_offset, rc);
> +		pci_doe_free_mb(doe_mb);
> +		return ERR_PTR(rc);
> +	}
> +
> +	rc = pci_doe_cache_protocols(doe_mb);
> +	if (rc) {
> +		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
> +			cap_offset, rc);
> +		pci_doe_free_mb(doe_mb);
> +		return ERR_PTR(rc);
> +	}
> +
> +	return doe_mb;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_create_mb);
> +
> +/**
> + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> + *			     protocol
> + * @doe_mb: DOE mailbox capability to query
> + * @vid: Protocol Vendor ID
> + * @type: protocol type
             Protocol
> + *
> + * RETURNS: True if the DOE device supports the protocol specified

I think you typically use "DOE mailbox", not "DOE device".

> + */
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
> +{
> +	int i;
> +
> +	/* The discovery protocol must always be supported */
> +	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
> +		return true;
> +
> +	for (i = 0; i < doe_mb->num_prots; i++)
> +		if ((doe_mb->prots[i].vid == vid) &&
> +		    (doe_mb->prots[i].type == type))
> +			return true;
> +
> +	return false;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> +
> +/**
> + * pci_doe_submit_task() - Submit a task to be processed by the state machine
> + *
> + * @doe_mb: DOE mailbox capability to submit to
> + * @task: task to be queued
> + *
> + * Submit a DOE task (request/response) to the DOE mailbox to be processed.
> + * Returns upon queueing the task object.  If the queue is full this function
> + * will sleep until there is room in the queue.
> + *
> + * task->complete will be called when the state machine is done processing this
> + * task.
> + *
> + * Excess data will be discarded.
> + *
> + * RETURNS: 0 when task has been successful queued, -ERRNO on error
> + */
> +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> +{
> +	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
> +		return -EINVAL;
> +
> +	/* DOE requests must be a whole number of DW */
> +	if (task->request_pl_sz % sizeof(u32))
> +		return -EINVAL;
> +
> +again:
> +	mutex_lock(&doe_mb->task_lock);
> +	if (doe_mb->cur_task) {
> +		mutex_unlock(&doe_mb->task_lock);
> +		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
> +		goto again;
> +	}
> +
> +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> +		mutex_unlock(&doe_mb->task_lock);
> +		return -EIO;
> +	}
> +	doe_mb->cur_task = task;
> +	mutex_unlock(&doe_mb->task_lock);
> +	schedule_delayed_work(&doe_mb->statemachine, 0);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_submit_task);
> +
> +/**
> + * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
> + * pci_doe_create_mb()
> + *
> + * @doe_mb: DOE mailbox capability structure to destroy
> + *
> + * The mailbox becomes invalid and should not be used after this call.
> + */
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
> +{
> +	/* abort any work in progress */
> +	pci_doe_abort(doe_mb);
> +
> +	/* halt the state machine */
> +	cancel_delayed_work_sync(&doe_mb->statemachine);
> +
> +	pci_doe_free_mb(doe_mb);
> +}
> +EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
> diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> new file mode 100644
> index 000000000000..7e6ebaf9930a
> --- /dev/null
> +++ b/include/linux/pci-doe.h
> @@ -0,0 +1,119 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Data Object Exchange
> + * https://members.pcisig.com/wg/PCI-SIG/document/16609

Ditto.

> + *
> + * Copyright (C) 2021 Huawei
> + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> + *
> + * Copyright (C) 2022 Intel Corporation
> + *	Ira Weiny <ira.weiny@intel.com>
> + */
> +
> +#ifndef LINUX_PCI_DOE_H
> +#define LINUX_PCI_DOE_H
> +
> +#include <linux/completion.h>
> +
> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +#define PCI_DOE_FLAG_ABORT	0
> +#define PCI_DOE_FLAG_DEAD	1
> +
> +struct pci_doe_protocol {
> +	u16 vid;
> +	u8 type;
> +};
> +
> +/**
> + * struct pci_doe_task - represents a single query/response
> + *
> + * @prot: DOE Protocol
> + * @request_pl: The request payload
> + * @request_pl_sz: Size of the request payload
> + * @response_pl: The response payload
> + * @response_pl_sz: Size of the response payload
> + * @rv: Return value.  Length of received response or error
> + * @complete: Called when task is complete
> + * @private: Private data for the consumer
> + */
> +struct pci_doe_task {
> +	struct pci_doe_protocol prot;
> +	u32 *request_pl;
> +	size_t request_pl_sz;
> +	u32 *response_pl;
> +	size_t response_pl_sz;
> +	int rv;
> +	void (*complete)(struct pci_doe_task *task);
> +	void *private;
> +};
> +
> +/**
> + * struct pci_doe_mb - State for a single DOE mailbox
> + *
> + * This state is used to manage a single DOE mailbox capability.  All fields
> + * should be considered opaque to the consumers and the structure passed into
> + * the helpers below after being created by devm_pci_doe_create()
> + *
> + * @pdev: PCI device this belongs to mailbox belongs to
> + * @abort_c: Completion used for initial abort handling
> + * @irq: Interrupt used for signaling DOE ready or abort
> + * @prots: Array of protocols supported on this DOE
> + * @num_prots: Size of prots array
> + * @cap_offset: Capability offset
> + * @wq: Wait queue to wait on if a query is in progress
> + * @cur_task: Current task the state machine is working on
> + * @task_lock: Protect cur_task
> + * @statemachine: Work item for the DOE state machine
> + * @state: Current state of this DOE
> + * @timeout_jiffies: 1 second after GO set
> + * @busy_retries: Count of retry attempts
> + * @flags: Bit array of PCI_DOE_FLAG_* flags
> + *
> + * Note: prots can't be allocated with struct size because the number of
> + * protocols is not known until after this structure is in use.  However, the
> + * single discovery protocol is always required to query for the number of
> + * protocols.
> + */
> +struct pci_doe_mb {
> +	struct pci_dev *pdev;
> +	struct completion abort_c;
> +	int irq;
> +	struct pci_doe_protocol *prots;
> +	int num_prots;
> +	u16 cap_offset;
> +
> +	wait_queue_head_t wq;
> +	struct pci_doe_task *cur_task;
> +	struct mutex task_lock;
> +	struct delayed_work statemachine;
> +	enum pci_doe_state state;
> +	unsigned long timeout_jiffies;
> +	unsigned int busy_retries;
> +	unsigned long flags;
> +};
> +
> +/**
> + * pci_doe_for_each_off - Iterate each DOE capability
> + * @pdev: struct pci_dev to iterate
> + * @off: u16 of config space offset of each mailbox capability found
> + */
> +#define pci_doe_for_each_off(pdev, off) \
> +	for (off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE); \
> +		off > 0; \
> +		off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE))
> +
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq);
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
> +
> +#endif
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index bee1a9ed6e66..4e96b45ee36d 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -736,7 +736,8 @@
>  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
>  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
>  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
>  
>  #define PCI_EXT_CAP_DSN_SIZEOF	12
>  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> @@ -1102,4 +1103,30 @@
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
>  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
>  
> +/* Data Object Exchange */
> +#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
> +#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
> +#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
> +#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
> +#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
> +#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
> +#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
> +#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
> +#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
> +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> +#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> +#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
> +#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
> +
> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> +
>  #endif /* LINUX_PCI_REGS_H */
> -- 
> 2.35.1
>
Ira Weiny May 2, 2022, 5:36 a.m. UTC | #2
On Thu, Apr 28, 2022 at 04:27:23PM -0500, Bjorn Helgaas wrote:
> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Introduced in a PCI v6.0[1], DOE provides a config space based mailbox
>  
>   Introduced in PCIe r6.0, sec 6.30, DOE ...

Fixed.  Sorry I missed the fact that PCI uses 'r'.

> 
> > with standard protocol discovery.  Each mailbox is accessed through a
> > DOE Extended Capability.
> > 
> > Each DOE mailbox must support the DOE discovery protocol in addition to
> > any number of additional protocols.
> > 
> > Define core PCI functionality to manage a single PCI DOE mailbox at a
> > defined config space offset.  Functionality includes creating, supported
> > protocol queries, submit tasks to, and destroying the new state objects.
> > 
> > If interrupts are desired, interrupts vectors should be allocated prior
> > to asking for irq's when creating a mailbox object.
>                 IRQs

Fixed.

> 
> > [1] https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> The link is only useful for PCI-SIG members, and only as long as the
> SIG maintains this structure.  I think "PCIe r6.0" is sufficient.

Removed.  thanks.

> 
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Co-developed-by: Ira Weiny <ira.weiny@intel.com>
> > Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> 
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>

thanks!

> 
> Minor comments below.
> 

[snip]

> > ---
> >  drivers/pci/Kconfig           |   3 +
> >  drivers/pci/Makefile          |   1 +
> >  drivers/pci/pci-doe.c         | 607 ++++++++++++++++++++++++++++++++++
> >  include/linux/pci-doe.h       | 119 +++++++
> >  include/uapi/linux/pci_regs.h |  29 +-
> >  5 files changed, 758 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/pci/pci-doe.c
> >  create mode 100644 include/linux/pci-doe.h
> > 
> > diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> > index 133c73207782..b2f2e588a817 100644
> > --- a/drivers/pci/Kconfig
> > +++ b/drivers/pci/Kconfig
> > @@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND
> >  config PCI_ATS
> >  	bool
> >  
> > +config PCI_DOE
> > +	bool
> > +
> >  config PCI_ECAM
> >  	bool
> >  
> > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> > index 0da6b1ebc694..b609d8ad813d 100644
> > --- a/drivers/pci/Makefile
> > +++ b/drivers/pci/Makefile
> > @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM)		+= ecam.o
> >  obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
> >  obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
> >  obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
> > +obj-$(CONFIG_PCI_DOE)		+= pci-doe.o
> 
> Is there value in the "pci-" prefix?

Not really.

> There are a few other files with
> the prefix, but I can't remember why.  Most things are of the form
> "doe.c".

Because of the mixed bag I was not sure which was preferred.  I chose poorly.
I'll remove the prefix.

> 
> >  # Endpoint library must be initialized before its users
> >  obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
> > diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
> > new file mode 100644
> > index 000000000000..ccf936421d2a
> > --- /dev/null
> > +++ b/drivers/pci/pci-doe.c
> > @@ -0,0 +1,607 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Data Object Exchange
> > + * https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> I think "PCIe r6.0, sec 6.30" is more generally useful.

Done.

> 
> > + * Copyright (C) 2021 Huawei
> > + *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + *
> > + * Copyright (C) 2022 Intel Corporation
> > + *	Ira Weiny <ira.weiny@intel.com>
> > + */
> > +
> > +#include <linux/bitfield.h>
> > +#include <linux/delay.h>
> > +#include <linux/jiffies.h>
> > +#include <linux/mutex.h>
> > +#include <linux/pci.h>
> > +#include <linux/pci-doe.h>
> > +#include <linux/workqueue.h>
> > +
> > +#define PCI_DOE_PROTOCOL_DISCOVERY 0
> > +
> > +#define PCI_DOE_BUSY_MAX_RETRIES 16
> > +#define PCI_DOE_POLL_INTERVAL (HZ / 128)
> > +
> > +/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */
> 
> Pedantic: the PCI specs use "r6.0", not "v6.0".  Reverse for ACPI.  Of
> course.  Even worse, PCI uses "Revision" for the major, "Version" for
> the minor.

Thanks for letting me know.

> 
> > +#define PCI_DOE_TIMEOUT HZ
> > +
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> 
> s/irq/IRQ/ for similar comment and log message uses below.

Done.

> 
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> > +
> > +/*
> > + * Only called when safe to directly access the DOE from
> > + * doe_statemachine_work().  Outside access is not protected.  Users who
> > + * perform such access are left with the pieces.
> > + */
> > +static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	val = PCI_DOE_CTRL_ABORT;
> > +	if (doe_mb->irq >= 0)
> > +		val |= PCI_DOE_CTRL_INT_EN;
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +
> > +	doe_mb->timeout_jiffies = jiffies + HZ;
> > +	schedule_delayed_work(&doe_mb->statemachine, HZ);
> > +}
> > +
> > +static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> 
> Wrap to fit in 80 columns, like the rest of drivers/pci/*

Done.

> 
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +	int i;
> > +
> > +	/*
> > +	 * Check the DOE busy bit is not set. If it is set, this could indicate
> > +	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
> > +	 * it is expected that firmware and OS will negotiate access rights via
> > +	 * an, as yet to be defined method.
> > +	 */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
> > +		return -EBUSY;
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +		return -EIO;
> > +
> > +	/* Write DOE Header */
> > +	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
> > +		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
> > +	/* Length is 2 DW of header + length of payload in DW */
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
> > +					  2 + task->request_pl_sz /
> > +						sizeof(u32)));
> > +	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
> > +				       task->request_pl[i]);
> > +
> > +	val = PCI_DOE_CTRL_GO;
> > +	if (doe_mb->irq >= 0)
> > +		val |= PCI_DOE_CTRL_INT_EN;
> > +
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
> > +	/* Request is sent - now wait for poll or IRQ */
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	size_t length;
> > +	u32 val;
> > +	int i;
> > +
> > +	/* Read the first dword to get the protocol */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
> > +	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
> > +		pci_err(pdev,
> > +			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
> 
> Needs a "DOE" or similar hint for where to look when you see this
> error.
> 
> "%04x" is the typical way we print a PCI vendor ID.  I guess a leading
> "0x" isn't a disaster, but I do think it should be 4 hex digits so it
> doesn't look like a length or something.

I'll match it up.  Using '%04x' for vendor ID and '%02x' for type.

> 
> It looks like you anticipate multiple DOE capabilities.  Maybe
> messages should include the capability offset to be specific?

Great idea, I've added 'DOE [%x]' as a prefix to all messages.

> I see you print with %u in pci_doe_create_mb().  I'm pretty sure
> "lspci -vv" prints them in hex, and it's probably useful to match.

Yes matching is good.  lspci does print in hex and without the 0x.

> 
> > +			task->prot.vid, task->prot.type,
> > +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
> > +			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
> > +		return -EIO;
> > +	}
> > +
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> > +
> > +	/* Flush excess length */
> > +	for (; i < length; i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> > +	/* Final error check to pick up on any since Data Object Ready */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
> > +		return -EIO;
> > +
> > +	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
> > +}
> > +
> > +static void signal_task_complete(struct pci_doe_task *task, int rv)
> > +{
> > +	task->rv = rv;
> > +	task->complete(task);
> > +}
> > +
> > +static void retire_cur_task(struct pci_doe_mb *doe_mb)
> > +{
> > +	mutex_lock(&doe_mb->task_lock);
> > +	doe_mb->cur_task = NULL;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +	wake_up_interruptible(&doe_mb->wq);
> > +}
> > +
> > +static void doe_statemachine_work(struct work_struct *work)
> > +{
> > +	struct delayed_work *w = to_delayed_work(work);
> > +	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
> > +						 statemachine);
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	struct pci_doe_task *task;
> > +	u32 val;
> > +	int rc;
> > +
> > +	mutex_lock(&doe_mb->task_lock);
> > +	task = doe_mb->cur_task;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +
> > +	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
> > +		/*
> > +		 * Currently only used during init - care needed if
> > +		 * pci_doe_abort() is generally exposed as it would impact
> > +		 * queries in flight.
> > +		 */
> > +		if (task)
> > +			pr_err("Aborting with active task!\n");
> 
> This should be a dev_err() and probably include "DOE" or some kind of
> hint about what this applies to.

I think you mean pci_err()?

I've updated all the pci_*() messages with 'DOE @ [offset]:' prefix.

> 
> > +		doe_mb->state = DOE_WAIT_ABORT;
> > +		pci_doe_abort_start(doe_mb);
> > +		return;
> > +	}
> > +
> > +	switch (doe_mb->state) {
> > +	case DOE_IDLE:
> > +		if (task == NULL)
> > +			return;
> > +
> > +		rc = pci_doe_send_req(doe_mb, task);
> 
> Add blank line before block comment.

done

> 
> > +		/*
> > +		 * The specification does not provide any guidance on how long
> > +		 * some other entity could keep the DOE busy, so try for 1
> > +		 * second then fail. Busy handling is best effort only, because
> > +		 * there is no way of avoiding racing against another user of
> > +		 * the DOE.
> > +		 */
> > +		if (rc == -EBUSY) {
> > +			doe_mb->busy_retries++;
> > +			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
> > +				/* Long enough, fail this request */
> > +				pci_warn(pdev,
> > +					"DOE busy for too long (> 1 sec)\n");
> > +				doe_mb->busy_retries = 0;
> > +				goto err_busy;
> > +			}
> > +			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
> > +			return;
> > +		}
> > +		if (rc)
> > +			goto err_abort;
> > +		doe_mb->busy_retries = 0;
> > +
> > +		doe_mb->state = DOE_WAIT_RESP;
> > +		doe_mb->timeout_jiffies = jiffies + HZ;
> > +		/* Now poll or wait for IRQ with timeout */
> > +		if (doe_mb->irq >= 0)
> > +			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
> > +		else
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +		return;
> > +
> > +	case DOE_WAIT_RESP:
> > +		/* Not possible to get here with NULL task */
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +			rc = -EIO;
> > +			goto err_abort;
> > +		}
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
> > +			/* If not yet at timeout reschedule otherwise abort */
> > +			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> > +				rc = -ETIMEDOUT;
> > +				goto err_abort;
> > +			}
> > +			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
> > +			return;
> > +		}
> > +
> > +		rc  = pci_doe_recv_resp(doe_mb, task);
> > +		if (rc < 0)
> > +			goto err_abort;
> > +
> > +		doe_mb->state = DOE_IDLE;
> > +
> > +		retire_cur_task(doe_mb);
> > +		/* Set the return value to the length of received payload */
> > +		signal_task_complete(task, rc);
> > +
> > +		return;
> > +
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > +			/* Back to normal state - carry on */
> > +			retire_cur_task(doe_mb);
> > +
> > +			/*
> > +			 * For deliberately triggered abort, someone is
> > +			 * waiting.
> > +			 */
> > +			if (doe_mb->state == DOE_WAIT_ABORT) {
> > +				if (task)
> > +					signal_task_complete(task, -EFAULT);
> > +				complete(&doe_mb->abort_c);
> > +			}
> > +
> > +			doe_mb->state = DOE_IDLE;
> > +			return;
> > +		}
> > +		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
> > +			/* Task has timed out and is dead - abort */
> > +			pci_err(pdev, "DOE ABORT timed out\n");
> > +			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
> > +			retire_cur_task(doe_mb);
> > +
> > +			if (doe_mb->state == DOE_WAIT_ABORT) {
> > +				if (task)
> > +					signal_task_complete(task, -EFAULT);
> > +				complete(&doe_mb->abort_c);
> > +			}
> > +		}
> > +		return;
> > +	}
> > +
> > +err_abort:
> > +	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
> > +	pci_doe_abort_start(doe_mb);
> > +err_busy:
> > +	signal_task_complete(task, rc);
> > +	if (doe_mb->state == DOE_IDLE)
> > +		retire_cur_task(doe_mb);
> > +}
> > +
> > +static void pci_doe_task_complete(struct pci_doe_task *task)
> > +{
> > +	complete(task->private);
> > +}
> > +
> > +static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
> > +			     u8 *protocol)
> > +{
> > +	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
> > +				    *index);
> > +	u32 response_pl;
> > +	DECLARE_COMPLETION_ONSTACK(c);
> > +	struct pci_doe_task task = {
> > +		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
> > +		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
> > +		.request_pl = &request_pl,
> > +		.request_pl_sz = sizeof(request_pl),
> > +		.response_pl = &response_pl,
> > +		.response_pl_sz = sizeof(response_pl),
> > +		.complete = pci_doe_task_complete,
> > +		.private = &c,
> > +	};
> > +	int ret;
> > +
> > +	ret = pci_doe_submit_task(doe_mb, &task);
> > +	if (ret < 0)
> > +		return ret;
> > +
> > +	wait_for_completion(&c);
> > +
> > +	if (task.rv != sizeof(response_pl))
> > +		return -EIO;
> > +
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);
> > +
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
> > +{
> > +	u8 index = 0;
> > +	int num_prots;
> > +	int rc;
> > +
> > +	/* Discovery protocol must always be supported and must report itself */
> > +	num_prots = 1;
> > +
> > +	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
> > +	if (!doe_mb->prots)
> > +		return -ENOMEM;
> > +
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> 
> "pci_doe_free_mb()" to match other function name references.

oops...  done.

> 
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */
> > +	do {
> > +		struct pci_doe_protocol *prot;
> > +
> > +		prot = &doe_mb->prots[num_prots - 1];
> > +		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
> > +		if (rc)
> > +			return rc;
> > +
> > +		if (index) {
> > +			struct pci_doe_protocol *prot_new;
> > +
> > +			num_prots++;
> > +			prot_new = krealloc(doe_mb->prots,
> > +					    sizeof(*doe_mb->prots) * num_prots,
> > +					    GFP_KERNEL);
> > +			if (!prot_new)
> > +				return -ENOMEM;
> > +
> > +			doe_mb->prots = prot_new;
> > +		}
> > +	} while (index);
> > +
> > +	doe_mb->num_prots = num_prots;
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_abort(struct pci_doe_mb *doe_mb)
> > +{
> > +	reinit_completion(&doe_mb->abort_c);
> > +	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
> > +	schedule_delayed_work(&doe_mb->statemachine, 0);
> > +	wait_for_completion(&doe_mb->abort_c);
> > +
> > +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
> > +		return -EIO;
> > +
> > +	return 0;
> > +}
> > +
> > +static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
> > +{
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	int doe_irq, rc;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
> > +
> > +	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
> > +		return -EOPNOTSUPP;
> > +
> > +	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
> > +	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
> > +			     NULL, doe_mb,
> > +			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
> > +	if (rc)
> > +		return rc;
> > +
> > +	doe_mb->irq = doe_irq;
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
> > +			       PCI_DOE_CTRL_INT_EN);
> > +	return 0;
> > +}
> > +
> > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
> > +{
> > +	if (doe_mb->irq >= 0)
> > +		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
> > +	kfree(doe_mb->prots);
> > +	kfree(doe_mb);
> > +}
> > +
> > +/**
> > + * pci_doe_create_mb() - Create a DOE mailbox object
> > + *
> > + * @pdev: PCI device to create the DOE mailbox for
> > + * @cap_offset: Offset of the DOE mailbox
> > + * @use_irq: Should the state machine use an irq
> > + *
> > + * Create a single mailbox object to manage the mailbox protocol at the
> > + * cap_offset specified.
> > + *
> > + * Caller should allocate PCI irq vectors before setting use_irq.
> > + *
> > + * RETURNS: created mailbox object on success
> > + *	    ERR_PTR(-errno) on failure
> > + */
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq)
> > +{
> > +	struct pci_doe_mb *doe_mb;
> > +	int rc;
> > +
> > +	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
> > +	if (!doe_mb)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	doe_mb->pdev = pdev;
> > +	init_completion(&doe_mb->abort_c);
> > +	doe_mb->irq = -1;
> > +	doe_mb->cap_offset = cap_offset;
> > +
> > +	init_waitqueue_head(&doe_mb->wq);
> > +	mutex_init(&doe_mb->task_lock);
> > +	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
> > +	doe_mb->state = DOE_IDLE;
> > +
> > +	if (use_irq) {
> > +		rc = pci_doe_request_irq(doe_mb);
> > +		if (rc)
> > +			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",

This and below are changed to print:

"DOE [%x] ..." rather than print the offset at the end.

This makes these consistent with the other errors per your feedback.

It also made these messages more concise.  :-D

> > +				cap_offset, rc);
> > +	}
> > +
> > +	/* Reset the mailbox by issuing an abort */
> > +	rc = pci_doe_abort(doe_mb);
> > +	if (rc) {
> > +		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
> > +			cap_offset, rc);
> > +		pci_doe_free_mb(doe_mb);
> > +		return ERR_PTR(rc);
> > +	}
> > +
> > +	rc = pci_doe_cache_protocols(doe_mb);
> > +	if (rc) {
> > +		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
> > +			cap_offset, rc);
> > +		pci_doe_free_mb(doe_mb);
> > +		return ERR_PTR(rc);
> > +	}
> > +
> > +	return doe_mb;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_create_mb);
> > +
> > +/**
> > + * pci_doe_supports_prot() - Return if the DOE instance supports the given
> > + *			     protocol
> > + * @doe_mb: DOE mailbox capability to query
> > + * @vid: Protocol Vendor ID
> > + * @type: protocol type
>              Protocol

Done.

> > + *
> > + * RETURNS: True if the DOE device supports the protocol specified
> 
> I think you typically use "DOE mailbox", not "DOE device".

Yep.  That was left over from the Auxiliary device stuff.  Fixed.

> 
> > + */
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
> > +{
> > +	int i;
> > +
> > +	/* The discovery protocol must always be supported */
> > +	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
> > +		return true;
> > +
> > +	for (i = 0; i < doe_mb->num_prots; i++)
> > +		if ((doe_mb->prots[i].vid == vid) &&
> > +		    (doe_mb->prots[i].type == type))
> > +			return true;
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
> > +
> > +/**
> > + * pci_doe_submit_task() - Submit a task to be processed by the state machine
> > + *
> > + * @doe_mb: DOE mailbox capability to submit to
> > + * @task: task to be queued
> > + *
> > + * Submit a DOE task (request/response) to the DOE mailbox to be processed.
> > + * Returns upon queueing the task object.  If the queue is full this function
> > + * will sleep until there is room in the queue.
> > + *
> > + * task->complete will be called when the state machine is done processing this
> > + * task.
> > + *
> > + * Excess data will be discarded.
> > + *
> > + * RETURNS: 0 when task has been successful queued, -ERRNO on error
> > + */
> > +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> > +{
> > +	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
> > +		return -EINVAL;
> > +
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> > +
> > +again:
> > +	mutex_lock(&doe_mb->task_lock);
> > +	if (doe_mb->cur_task) {
> > +		mutex_unlock(&doe_mb->task_lock);
> > +		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
> > +		goto again;
> > +	}
> > +
> > +	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
> > +		mutex_unlock(&doe_mb->task_lock);
> > +		return -EIO;
> > +	}
> > +	doe_mb->cur_task = task;
> > +	mutex_unlock(&doe_mb->task_lock);
> > +	schedule_delayed_work(&doe_mb->statemachine, 0);
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_submit_task);
> > +
> > +/**
> > + * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
> > + * pci_doe_create_mb()
> > + *
> > + * @doe_mb: DOE mailbox capability structure to destroy
> > + *
> > + * The mailbox becomes invalid and should not be used after this call.
> > + */
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
> > +{
> > +	/* abort any work in progress */
> > +	pci_doe_abort(doe_mb);
> > +
> > +	/* halt the state machine */
> > +	cancel_delayed_work_sync(&doe_mb->statemachine);
> > +
> > +	pci_doe_free_mb(doe_mb);
> > +}
> > +EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
> > diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
> > new file mode 100644
> > index 000000000000..7e6ebaf9930a
> > --- /dev/null
> > +++ b/include/linux/pci-doe.h
> > @@ -0,0 +1,119 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Data Object Exchange
> > + * https://members.pcisig.com/wg/PCI-SIG/document/16609
> 
> Ditto.

Done.

Thanks for the in depth review,
Ira

> 
> > + *
> > + * Copyright (C) 2021 Huawei
> > + *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > + *
> > + * Copyright (C) 2022 Intel Corporation
> > + *	Ira Weiny <ira.weiny@intel.com>
> > + */
> > +
> > +#ifndef LINUX_PCI_DOE_H
> > +#define LINUX_PCI_DOE_H
> > +
> > +#include <linux/completion.h>
> > +
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1
> > +
> > +struct pci_doe_protocol {
> > +	u16 vid;
> > +	u8 type;
> > +};
> > +
> > +/**
> > + * struct pci_doe_task - represents a single query/response
> > + *
> > + * @prot: DOE Protocol
> > + * @request_pl: The request payload
> > + * @request_pl_sz: Size of the request payload
> > + * @response_pl: The response payload
> > + * @response_pl_sz: Size of the response payload
> > + * @rv: Return value.  Length of received response or error
> > + * @complete: Called when task is complete
> > + * @private: Private data for the consumer
> > + */
> > +struct pci_doe_task {
> > +	struct pci_doe_protocol prot;
> > +	u32 *request_pl;
> > +	size_t request_pl_sz;
> > +	u32 *response_pl;
> > +	size_t response_pl_sz;
> > +	int rv;
> > +	void (*complete)(struct pci_doe_task *task);
> > +	void *private;
> > +};
> > +
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox
> > + *
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()
> > + *
> > + * @pdev: PCI device this belongs to mailbox belongs to
> > + * @abort_c: Completion used for initial abort handling
> > + * @irq: Interrupt used for signaling DOE ready or abort
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array
> > + * @cap_offset: Capability offset
> > + * @wq: Wait queue to wait on if a query is in progress
> > + * @cur_task: Current task the state machine is working on
> > + * @task_lock: Protect cur_task
> > + * @statemachine: Work item for the DOE state machine
> > + * @state: Current state of this DOE
> > + * @timeout_jiffies: 1 second after GO set
> > + * @busy_retries: Count of retry attempts
> > + * @flags: Bit array of PCI_DOE_FLAG_* flags
> > + *
> > + * Note: prots can't be allocated with struct size because the number of
> > + * protocols is not known until after this structure is in use.  However, the
> > + * single discovery protocol is always required to query for the number of
> > + * protocols.
> > + */
> > +struct pci_doe_mb {
> > +	struct pci_dev *pdev;
> > +	struct completion abort_c;
> > +	int irq;
> > +	struct pci_doe_protocol *prots;
> > +	int num_prots;
> > +	u16 cap_offset;
> > +
> > +	wait_queue_head_t wq;
> > +	struct pci_doe_task *cur_task;
> > +	struct mutex task_lock;
> > +	struct delayed_work statemachine;
> > +	enum pci_doe_state state;
> > +	unsigned long timeout_jiffies;
> > +	unsigned int busy_retries;
> > +	unsigned long flags;
> > +};
> > +
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> > +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
> > +
> > +#endif
> > diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> > index bee1a9ed6e66..4e96b45ee36d 100644
> > --- a/include/uapi/linux/pci_regs.h
> > +++ b/include/uapi/linux/pci_regs.h
> > @@ -736,7 +736,8 @@
> >  #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
> >  #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
> >  #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
> > -#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
> > +#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
> > +#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
> >  
> >  #define PCI_EXT_CAP_DSN_SIZEOF	12
> >  #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
> > @@ -1102,4 +1103,30 @@
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
> >  #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
> >  
> > +/* Data Object Exchange */
> > +#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
> > +#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
> > +#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
> > +#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
> > +#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
> > +#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
> > +#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
> > +#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
> > +#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> > +#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> > +#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
> > +#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
> > +
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> > +
> >  #endif /* LINUX_PCI_REGS_H */
> > -- 
> > 2.35.1
> >
Lukas Wunner May 30, 2022, 7:06 p.m. UTC | #3
On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
[...]
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	/* Read the second dword to get the length */
> +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +
> +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> +	if (length > SZ_1M || length < 2)
> +		return -EIO;
> +

This API requires consumers to know the size of the response in advance.
That's not always the case, as exemplified by SPDM VERSION responses.
Jonathan uses a kludge in his SPDM patch which submits a first request
with a fixed size of 2 versions and, if that turns out to be too small,
submits a second request.

It would be nice if consumers are allowed to set request_pl to NULL.
Then a payload can be allocated here in pci_doe_recv_resp() with the
size retrieved above.

A flag may be necessary to indicate that the response is heap-allocated
and needs to be freed upon destruction of the pci_doe_task.


> +	/* First 2 dwords have already been read */
> +	length -= 2;
> +	/* Read the rest of the response payload */
> +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> +				      &task->response_pl[i]);
> +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> +	}

You need to check the Data Object Ready bit.  The device may clear the
bit prematurely (e.g. as a result of a concurrent FLR or Conventional
Reset).  You'll continue reading zero dwords from the mailbox and
pretend success to the caller even though the response is truncated.

If you're concerned about performance when checking the bit on every
loop iteration, checking it only on the last but one iteration should
be sufficient to detect truncation.


> +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> +			      response_pl);
> +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> +			   response_pl);

The fact that you need line breaks here is an indication that the
macros are too long.


> +/* DOE Data Object - note not actually registers */
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> +
> +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000

I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
"Data Object" in "DOE".


> +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */

Another redundancy, I would get rid of the second "_STATUS".


> +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */

I would shorten to PCI_DOE_STATUS_READY.


> 		Simplify submitting work to the mailbox
> 			Replace pci_doe_exchange_sync() with
> 			pci_doe_submit_task() Consumers of the mailbox
> 			are now responsible for setting up callbacks
> 			within a task object and submitting them to the
> 			mailbox to be processed.

I honestly think that's a mistake.  All consumers both in the CDAT
as well as in the SPDM series just want to wait for completion of
the task.  They have no need for an arbitrary callback and shouldn't
be burdended with providing one.  It just unnecessarily complicates
the API.

So only providing pci_doe_exchange_sync() and doing away with
pci_doe_submit_task() would seem like a more appropriate approach.


> +/**
> + * pci_doe_for_each_off - Iterate each DOE capability
> + * @pdev: struct pci_dev to iterate
> + * @off: u16 of config space offset of each mailbox capability found
> + */
> +#define pci_doe_for_each_off(pdev, off) \
> +	for (off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE); \
> +		off > 0; \
> +		off = pci_find_next_ext_capability(pdev, off, \
> +					PCI_EXT_CAP_ID_DOE))
> +
> +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> +				     bool use_irq);
> +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);

Drivers should not be concerned with the intricacies of DOE
capabilities and mailboxes.

Moreover, the above API doesn't allow different drivers to access
the same DOE mailbox concurrently, e.g. if that mailbox supports
multiple protocols.  There's no locking to serialize access to the
mailbox by the drivers.

This should be moved to the PCI core instead:  In pci_init_capabilities(),
add a new call pci_doe_init() which enumerates all DOE capabilities.
Add a list_head to struct pci_dev and add each DOE instance found
to that list.  Destroy the list elements in pci_destroy_dev().
No locking needed for the list_head, you only ever modify the list
on device enumeration and destruction.

Then provide a pci_doe_find_mailbox() library function which drivers
call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
That avoids the need to traverse the list time and again.


> +/**
> + * struct pci_doe_mb - State for a single DOE mailbox

We generally use the same terms as the spec to make it easier for
readers to connect the language in the spec to the implementation.

The spec uniformly refers to "DOE instance".  I guess "mailbox"
is slightly more concise, so keep that, but please at least mention
the term "instance" in the struct's kernel-doc.

This implementation uses the term "task" for one request/response.
That term is not mentioned in the spec at all.  The spec refers to
"exchange" and "transfer" on several occasions, so I would have chosen
either one of those instead of the somewhat unfortunate "task".


> + * This state is used to manage a single DOE mailbox capability.  All fields
> + * should be considered opaque to the consumers and the structure passed into
> + * the helpers below after being created by devm_pci_doe_create()

If the struct is considered opaque, why is it exposed in a public
header file?  Just use a forward declaration in the header
so that consumers can pass around pointers to the struct,
and hide the declaration proper in doe.c.


> + * @pdev: PCI device this belongs to mailbox belongs to
                             ^^^^^^^^^^
Typo.

> + * @prots: Array of protocols supported on this DOE
> + * @num_prots: Size of prots array

Use @prots instead of prots everywhere in the kernel-doc.


> +	/*
> +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> +	 * error if pci_doe_cache_protocols() fails past this point.
> +	 */

s/doe_mb_prots/doe_mb->prots/
s/pci_doe_free_mb/pci_doe_free_mb()/


> +	/* DOE requests must be a whole number of DW */
> +	if (task->request_pl_sz % sizeof(u32))
> +		return -EINVAL;

It would be nice if this restriction could be lifted.  SPDM uses
requests which are not padded to dword-length.  It can run over other
transports which may not impose such restrictions.  The SPDM layer
should not need to worry about quirks of the transport layer.


> +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> +{
> +	struct pci_doe_mb *doe_mb = data;
> +	struct pci_dev *pdev = doe_mb->pdev;
> +	int offset = doe_mb->cap_offset;
> +	u32 val;
> +
> +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +	/* Leave the error case to be handled outside IRQ */
> +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> +					PCI_DOE_STATUS_INT_STATUS);
> +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> +		return IRQ_HANDLED;
> +	}
> +
> +	return IRQ_NONE;
> +}

PCIe 6.0, table 7-316 says that an interrupt is also raised when
"the DOE Busy bit has been Cleared", yet such an interrupt is
not handled here.  It is incorrectly treated as a spurious
interrupt by returning IRQ_NONE.  The right thing to do
is probably to wake the state machine in case it's polling
for the Busy flag to clear.


> +enum pci_doe_state {
> +	DOE_IDLE,
> +	DOE_WAIT_RESP,
> +	DOE_WAIT_ABORT,
> +	DOE_WAIT_ABORT_ON_ERR,
> +};
> +
> +#define PCI_DOE_FLAG_ABORT	0
> +#define PCI_DOE_FLAG_DEAD	1

That's all internal and should live in doe.c, not the header file.

Thanks,

Lukas
Jonathan Cameron May 31, 2022, 10:33 a.m. UTC | #4
On Mon, 30 May 2022 21:06:57 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)  
> [...]
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +  
> 
> This API requires consumers to know the size of the response in advance.
> That's not always the case, as exemplified by SPDM VERSION responses.
> Jonathan uses a kludge in his SPDM patch which submits a first request
> with a fixed size of 2 versions and, if that turns out to be too small,
> submits a second request.
> 
> It would be nice if consumers are allowed to set request_pl to NULL.
> Then a payload can be allocated here in pci_doe_recv_resp() with the
> size retrieved above.
> 
> A flag may be necessary to indicate that the response is heap-allocated
> and needs to be freed upon destruction of the pci_doe_task.

If possible I'd make that the callers problem. It should know it provided
NULL and expected a response.

I'd suggest making this a 'future' feature just to keep this initial
version simple.  Won't be hard to add later.

> 
> 
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}  
> 
> You need to check the Data Object Ready bit.  The device may clear the
> bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> Reset).  You'll continue reading zero dwords from the mailbox and
> pretend success to the caller even though the response is truncated.
> 
> If you're concerned about performance when checking the bit on every
> loop iteration, checking it only on the last but one iteration should
> be sufficient to detect truncation.

Good catch - I hate corner cases.  Thankfully this one is trivial to
check for.

> 
> 
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);  
> 
> The fact that you need line breaks here is an indication that the
> macros are too long.
> 
> 
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000  
> 
> I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> "Data Object" in "DOE".

We need something to make it clear these aren't DVSEC headers for
example, but definitely could be shorter.

PCI_DOE_OBJ_... 

perhaps?

> 
> 
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */  
> 
> Another redundancy, I would get rid of the second "_STATUS".

Hmm. I'll leave this one to Ira's discretion but I'm not keen on cropping
too much of naming given loss of alignment with the spec.

> 
> 
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */  
> 
> I would shorten to PCI_DOE_STATUS_READY.

Seems reasonable though always some burden in moving
away from full spec names.

> 
> 
> > 		Simplify submitting work to the mailbox
> > 			Replace pci_doe_exchange_sync() with
> > 			pci_doe_submit_task() Consumers of the mailbox
> > 			are now responsible for setting up callbacks
> > 			within a task object and submitting them to the
> > 			mailbox to be processed.  
> 
> I honestly think that's a mistake.  All consumers both in the CDAT
> as well as in the SPDM series just want to wait for completion of
> the task.  They have no need for an arbitrary callback and shouldn't
> be burdended with providing one.  It just unnecessarily complicates
> the API.
> 
> So only providing pci_doe_exchange_sync() and doing away with
> pci_doe_submit_task() would seem like a more appropriate approach.

We've gone around the houses with this.  At this stage I don't
care strongly either way and it's all in kernel code so we
can refine it later once we have lots of examples.

> 
> 
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);  
> 
> Drivers should not be concerned with the intricacies of DOE
> capabilities and mailboxes.
> 
> Moreover, the above API doesn't allow different drivers to access
> the same DOE mailbox concurrently, e.g. if that mailbox supports
> multiple protocols.  There's no locking to serialize access to the
> mailbox by the drivers.
> 
> This should be moved to the PCI core instead:  In pci_init_capabilities(),
> add a new call pci_doe_init() which enumerates all DOE capabilities.
> Add a list_head to struct pci_dev and add each DOE instance found
> to that list.  Destroy the list elements in pci_destroy_dev().
> No locking needed for the list_head, you only ever modify the list
> on device enumeration and destruction.
> 
> Then provide a pci_doe_find_mailbox() library function which drivers
> call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> That avoids the need to traverse the list time and again.

I've lost track a bit as this thread has been running a long time,  but
I think we are moving back to something that looks more like this
(and the much earlier versions of the set which did more or less what you
describe - though they had their own issues resolved in the meantime)

There is an exciting corner around interrupts though that complicates
things.  Currently I think I'm right in saying the pci core never
sets up interrupts, that is always a job for the drivers.

My first thought is to do something similar to the hack I did for
the switch ports, and use polled mode to query support protocols.
Then later call from driver that will attempt to enable interrupts
if desired for a given DOE instance.


> 
> 
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox  
> 
> We generally use the same terms as the spec to make it easier for
> readers to connect the language in the spec to the implementation.
> 
> The spec uniformly refers to "DOE instance".  I guess "mailbox"
> is slightly more concise, so keep that, but please at least mention
> the term "instance" in the struct's kernel-doc.
> 
> This implementation uses the term "task" for one request/response.
> That term is not mentioned in the spec at all.  The spec refers to
> "exchange" and "transfer" on several occasions, so I would have chosen
> either one of those instead of the somewhat unfortunate "task".
> 
> 
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()  
> 
> If the struct is considered opaque, why is it exposed in a public
> header file?  Just use a forward declaration in the header
> so that consumers can pass around pointers to the struct,
> and hide the declaration proper in doe.c.
> 
> 
> > + * @pdev: PCI device this belongs to mailbox belongs to  
>                              ^^^^^^^^^^
> Typo.
> 
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array  
> 
> Use @prots instead of prots everywhere in the kernel-doc.
> 
> 
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */  
> 
> s/doe_mb_prots/doe_mb->prots/
> s/pci_doe_free_mb/pci_doe_free_mb()/
> 
> 
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;  
> 
> It would be nice if this restriction could be lifted.  SPDM uses
> requests which are not padded to dword-length.  It can run over other
> transports which may not impose such restrictions.  The SPDM layer
> should not need to worry about quirks of the transport layer.

This is a pain.  DOE absolutely requires 32 bit padding and
reads need to be 32 bit.   We can obviously do something nasty
with dealing with the tail via a bounce buffer though.

I'd pencil that in as a future feature though rather than worry
about it before we have any support at all in place.

> 
> 
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}  
> 
> PCIe 6.0, table 7-316 says that an interrupt is also raised when
> "the DOE Busy bit has been Cleared", yet such an interrupt is
> not handled here.  It is incorrectly treated as a spurious
> interrupt by returning IRQ_NONE.  The right thing to do
> is probably to wake the state machine in case it's polling
> for the Busy flag to clear.

Ah. I remember testing this via a lot of hacking on the QEMU code
to inject the various races that can occur (it was really ugly to do).

Guess we lost the handling at some point.  I think your fix
is the right one.

> 
> 
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1  
> 
> That's all internal and should live in doe.c, not the header file.
> 
> Thanks,
> 
> Lukas
> 

Thanks for the review and thanks again to Ira for taking this forwards.

Jonathan
Ira Weiny May 31, 2022, 11:43 p.m. UTC | #5
On Mon, May 30, 2022 at 09:06:57PM +0200, Lukas Wunner wrote:
> On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
> [...]
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	/* Read the second dword to get the length */
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +
> > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > +	if (length > SZ_1M || length < 2)
> > +		return -EIO;
> > +
> 
> This API requires consumers to know the size of the response in advance.
> That's not always the case, as exemplified by SPDM VERSION responses.
> Jonathan uses a kludge in his SPDM patch which submits a first request
> with a fixed size of 2 versions and, if that turns out to be too small,
> submits a second request.
> 
> It would be nice if consumers are allowed to set request_pl to NULL.
> Then a payload can be allocated here in pci_doe_recv_resp() with the
> size retrieved above.
> 
> A flag may be necessary to indicate that the response is heap-allocated
> and needs to be freed upon destruction of the pci_doe_task.

I'm not comfortable with PCI allocating memory.  Especially optionally
allocating memory.  And I'm not sure this really saves the extra queries.  The
PCI layer will have to query first to determine response length.

If the community does want PCI to allocate memory then I think the interface
should be to always allocate it.

> 
> 
> > +	/* First 2 dwords have already been read */
> > +	length -= 2;
> > +	/* Read the rest of the response payload */
> > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > +				      &task->response_pl[i]);
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > +	}
> 
> You need to check the Data Object Ready bit.  The device may clear the
> bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> Reset).  You'll continue reading zero dwords from the mailbox and
> pretend success to the caller even though the response is truncated.

If an FLR happens then truncation does not seem like a problem.

> 
> If you're concerned about performance when checking the bit on every
> loop iteration, checking it only on the last but one iteration should
> be sufficient to detect truncation.

I'll review.  I don't think this is a fast path.

> 
> 
> > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > +			      response_pl);
> > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > +			   response_pl);
> 
> The fact that you need line breaks here is an indication that the
> macros are too long.
> 
> 
> > +/* DOE Data Object - note not actually registers */
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > +
> > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
> 
> I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> "Data Object" in "DOE".

I'm ok with renaming.  Jonathan chose the names and no one has complained yet.

> 
> 
> > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
> 
> Another redundancy, I would get rid of the second "_STATUS".
> 
> 
> > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
> 
> I would shorten to PCI_DOE_STATUS_READY.
> 
> 
> > 		Simplify submitting work to the mailbox
> > 			Replace pci_doe_exchange_sync() with
> > 			pci_doe_submit_task() Consumers of the mailbox
> > 			are now responsible for setting up callbacks
> > 			within a task object and submitting them to the
> > 			mailbox to be processed.
> 
> I honestly think that's a mistake.  All consumers both in the CDAT
> as well as in the SPDM series just want to wait for completion of
> the task.  They have no need for an arbitrary callback and shouldn't
> be burdended with providing one.  It just unnecessarily complicates
> the API.
> 
> So only providing pci_doe_exchange_sync() and doing away with
> pci_doe_submit_task() would seem like a more appropriate approach.

This was discussed a couple of times and it was decided to keep the
synchronicity out of the library functions and allow users to decide how to
handle it.

> 
> 
> > +/**
> > + * pci_doe_for_each_off - Iterate each DOE capability
> > + * @pdev: struct pci_dev to iterate
> > + * @off: u16 of config space offset of each mailbox capability found
> > + */
> > +#define pci_doe_for_each_off(pdev, off) \
> > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE); \
> > +		off > 0; \
> > +		off = pci_find_next_ext_capability(pdev, off, \
> > +					PCI_EXT_CAP_ID_DOE))
> > +
> > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > +				     bool use_irq);
> > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
> 
> Drivers should not be concerned with the intricacies of DOE
> capabilities and mailboxes.

I think it depends on which driver one is considering.

> 
> Moreover, the above API doesn't allow different drivers to access
> the same DOE mailbox concurrently, e.g. if that mailbox supports
> multiple protocols.  There's no locking to serialize access to the
> mailbox by the drivers.

Correct it is up to the higher level code to do this.

> 
> This should be moved to the PCI core instead:  In pci_init_capabilities(),
> add a new call pci_doe_init() which enumerates all DOE capabilities.
> Add a list_head to struct pci_dev and add each DOE instance found
> to that list.  Destroy the list elements in pci_destroy_dev().
> No locking needed for the list_head, you only ever modify the list
> on device enumeration and destruction.
> 
> Then provide a pci_doe_find_mailbox() library function which drivers
> call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> That avoids the need to traverse the list time and again.

All that could be done above these calls.  I've abandoned the aux bus stuff
after discussions with Dan.  So perhaps burring this away from the CXL drivers
is a good idea.  But I still think this can serve as a base within the PCIe
code.

> 
> 
> > +/**
> > + * struct pci_doe_mb - State for a single DOE mailbox
> 
> We generally use the same terms as the spec to make it easier for
> readers to connect the language in the spec to the implementation.
> 
> The spec uniformly refers to "DOE instance".  I guess "mailbox"
> is slightly more concise, so keep that, but please at least mention
> the term "instance" in the struct's kernel-doc.
> 
> This implementation uses the term "task" for one request/response.
> That term is not mentioned in the spec at all.  The spec refers to
> "exchange" and "transfer" on several occasions, so I would have chosen
> either one of those instead of the somewhat unfortunate "task".

Again this was discussed and we settled on task.  I'm not married to the term
and s/task/exchange/ is fine with me.  But IFRC Dan was not happy with
exchange.

> 
> 
> > + * This state is used to manage a single DOE mailbox capability.  All fields
> > + * should be considered opaque to the consumers and the structure passed into
> > + * the helpers below after being created by devm_pci_doe_create()
> 
> If the struct is considered opaque, why is it exposed in a public
> header file?  Just use a forward declaration in the header
> so that consumers can pass around pointers to the struct,
> and hide the declaration proper in doe.c.

I could do that.  It just did not seem important enough to make it that strict.

> 
> 
> > + * @pdev: PCI device this belongs to mailbox belongs to
>                              ^^^^^^^^^^
> Typo.

oops... yea slipped by in v9 as well.

>
> 
> > + * @prots: Array of protocols supported on this DOE
> > + * @num_prots: Size of prots array
> 
> Use @prots instead of prots everywhere in the kernel-doc.

Ok.

> 
> 
> > +	/*
> > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > +	 * error if pci_doe_cache_protocols() fails past this point.
> > +	 */
> 
> s/doe_mb_prots/doe_mb->prots/
> s/pci_doe_free_mb/pci_doe_free_mb()/

This note has changed.

> 
> 
> > +	/* DOE requests must be a whole number of DW */
> > +	if (task->request_pl_sz % sizeof(u32))
> > +		return -EINVAL;
> 
> It would be nice if this restriction could be lifted.  SPDM uses
> requests which are not padded to dword-length.  It can run over other
> transports which may not impose such restrictions.  The SPDM layer
> should not need to worry about quirks of the transport layer.

Are you suggesting that this layer PAD the request?  I hear your concern but
users of this layer will need to understand this in the request payload
allocated anyway.

> 
> 
> > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > +{
> > +	struct pci_doe_mb *doe_mb = data;
> > +	struct pci_dev *pdev = doe_mb->pdev;
> > +	int offset = doe_mb->cap_offset;
> > +	u32 val;
> > +
> > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +	/* Leave the error case to be handled outside IRQ */
> > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > +					PCI_DOE_STATUS_INT_STATUS);
> > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > +		return IRQ_HANDLED;
> > +	}
> > +
> > +	return IRQ_NONE;
> > +}
> 
> PCIe 6.0, table 7-316 says that an interrupt is also raised when
> "the DOE Busy bit has been Cleared", yet such an interrupt is
> not handled here.  It is incorrectly treated as a spurious
> interrupt by returning IRQ_NONE.  The right thing to do
> is probably to wake the state machine in case it's polling
> for the Busy flag to clear.

I thought Jonathan and I had work through how this was correct.  I'll have to
review those threads.

> 
> 
> > +enum pci_doe_state {
> > +	DOE_IDLE,
> > +	DOE_WAIT_RESP,
> > +	DOE_WAIT_ABORT,
> > +	DOE_WAIT_ABORT_ON_ERR,
> > +};
> > +
> > +#define PCI_DOE_FLAG_ABORT	0
> > +#define PCI_DOE_FLAG_DEAD	1
> 
> That's all internal and should live in doe.c, not the header file.

Sure if struct pci_doe_mb is moved.

Lets see how V9 looks from other perspectives.  I've incorporated some of this
feedback for V10.

Ira

> Thanks,
> 
> Lukas
Ira Weiny June 1, 2022, 2:59 a.m. UTC | #6
On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> On Mon, 30 May 2022 21:06:57 +0200
> Lukas Wunner <lukas@wunner.de> wrote:
> 
> > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)  
> > [...]
> > > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +	/* Read the second dword to get the length */
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
> > > +	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +
> > > +	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
> > > +	if (length > SZ_1M || length < 2)
> > > +		return -EIO;
> > > +  
> > 
> > This API requires consumers to know the size of the response in advance.
> > That's not always the case, as exemplified by SPDM VERSION responses.
> > Jonathan uses a kludge in his SPDM patch which submits a first request
> > with a fixed size of 2 versions and, if that turns out to be too small,
> > submits a second request.
> > 
> > It would be nice if consumers are allowed to set request_pl to NULL.
> > Then a payload can be allocated here in pci_doe_recv_resp() with the
> > size retrieved above.
> > 
> > A flag may be necessary to indicate that the response is heap-allocated
> > and needs to be freed upon destruction of the pci_doe_task.
> 
> If possible I'd make that the callers problem. It should know it provided
> NULL and expected a response.
> 
> I'd suggest making this a 'future' feature just to keep this initial
> version simple.  Won't be hard to add later.

Agreed.

> 
> > 
> > 
> > > +	/* First 2 dwords have already been read */
> > > +	length -= 2;
> > > +	/* Read the rest of the response payload */
> > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > +				      &task->response_pl[i]);
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > +	}  
> > 
> > You need to check the Data Object Ready bit.  The device may clear the
> > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > Reset).  You'll continue reading zero dwords from the mailbox and
> > pretend success to the caller even though the response is truncated.
> > 
> > If you're concerned about performance when checking the bit on every
> > loop iteration, checking it only on the last but one iteration should
> > be sufficient to detect truncation.
> 
> Good catch - I hate corner cases.  Thankfully this one is trivial to
> check for.

Ok looking at the spec:  Strictly speaking this needs to happen multiple times
both in doe_statemachine_work() and inside pci_doe_recv_resp(); not just in
this loop.  :-(

This is because, the check in doe_statemachine_work() only covers the 1st dword
read IIUC.  Therefore, we should put this into the statemachine properly unless
it is agreed to busy wait for the Data Object Ready bit once the first one is
seen?

I'm not sure I really like that because I'm not sure how hardware is going to
behave.  Will it potentially delay mid response payload?

I think complete correctness requires pci_doe_recv_resp() to be split up such
that the state machine could be scheduled on any dword read...  :-(  And I
suspect that anything else is a hack.  That is going to take a bit of rework.
But I think it may be the best thing to do.

Before I embark on this rework, tell me I'm not crazy?

> 
> > 
> > 
> > > +	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
> > > +	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
> > > +			      response_pl);
> > > +	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
> > > +			   response_pl);  
> > 
> > The fact that you need line breaks here is an indication that the
> > macros are too long.
> > 
> > 
> > > +/* DOE Data Object - note not actually registers */
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
> > > +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
> > > +
> > > +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
> > > +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000  
> > 
> > I'd get rid of "DATA_OBJECT_" everywhere, it's redundant with the
> > "Data Object" in "DOE".
> 
> We need something to make it clear these aren't DVSEC headers for
> example, but definitely could be shorter.
> 
> PCI_DOE_OBJ_... 
> 
> perhaps?

FWIW I'm inclined to leave them for now.

> 
> > 
> > 
> > > +#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */  
> > 
> > Another redundancy, I would get rid of the second "_STATUS".
> 
> Hmm. I'll leave this one to Ira's discretion but I'm not keen on cropping
> too much of naming given loss of alignment with the spec.

Again I'm inclined to leave it.

> 
> > 
> > 
> > > +#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */  
> > 
> > I would shorten to PCI_DOE_STATUS_READY.
> 
> Seems reasonable though always some burden in moving
> away from full spec names.

I think this is fine.

> 
> > 
> > 
> > > 		Simplify submitting work to the mailbox
> > > 			Replace pci_doe_exchange_sync() with
> > > 			pci_doe_submit_task() Consumers of the mailbox
> > > 			are now responsible for setting up callbacks
> > > 			within a task object and submitting them to the
> > > 			mailbox to be processed.  
> > 
> > I honestly think that's a mistake.  All consumers both in the CDAT
> > as well as in the SPDM series just want to wait for completion of
> > the task.  They have no need for an arbitrary callback and shouldn't
> > be burdended with providing one.  It just unnecessarily complicates
> > the API.
> > 
> > So only providing pci_doe_exchange_sync() and doing away with
> > pci_doe_submit_task() would seem like a more appropriate approach.
> 
> We've gone around the houses with this.  At this stage I don't
> care strongly either way and it's all in kernel code so we
> can refine it later once we have lots of examples.

Agreed.

> 
> > 
> > 
> > > +/**
> > > + * pci_doe_for_each_off - Iterate each DOE capability
> > > + * @pdev: struct pci_dev to iterate
> > > + * @off: u16 of config space offset of each mailbox capability found
> > > + */
> > > +#define pci_doe_for_each_off(pdev, off) \
> > > +	for (off = pci_find_next_ext_capability(pdev, off, \
> > > +					PCI_EXT_CAP_ID_DOE); \
> > > +		off > 0; \
> > > +		off = pci_find_next_ext_capability(pdev, off, \
> > > +					PCI_EXT_CAP_ID_DOE))
> > > +
> > > +struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
> > > +				     bool use_irq);
> > > +void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
> > > +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);  
> > 
> > Drivers should not be concerned with the intricacies of DOE
> > capabilities and mailboxes.
> > 
> > Moreover, the above API doesn't allow different drivers to access
> > the same DOE mailbox concurrently, e.g. if that mailbox supports
> > multiple protocols.  There's no locking to serialize access to the
> > mailbox by the drivers.
> > 
> > This should be moved to the PCI core instead:  In pci_init_capabilities(),
> > add a new call pci_doe_init() which enumerates all DOE capabilities.
> > Add a list_head to struct pci_dev and add each DOE instance found
> > to that list.  Destroy the list elements in pci_destroy_dev().
> > No locking needed for the list_head, you only ever modify the list
> > on device enumeration and destruction.
> > 
> > Then provide a pci_doe_find_mailbox() library function which drivers
> > call to retrieve the pci_doe_mb for a given pci_dev/vid/type tuple.
> > That avoids the need to traverse the list time and again.
> 
> I've lost track a bit as this thread has been running a long time,  but
> I think we are moving back to something that looks more like this
> (and the much earlier versions of the set which did more or less what you
> describe - though they had their own issues resolved in the meantime)
> 
> There is an exciting corner around interrupts though that complicates
> things.  Currently I think I'm right in saying the pci core never
> sets up interrupts, that is always a job for the drivers.
> 
> My first thought is to do something similar to the hack I did for
> the switch ports, and use polled mode to query support protocols.
> Then later call from driver that will attempt to enable interrupts
> if desired for a given DOE instance.
> 

V9 is already getting respun but lets look there for the move toward this.  I
don't think the CXL code should get any of the DOE mailbox code directly.  But
I'm unclear how to tie it together with the PCI side ATM.

For this series CDAT is only useful for CXL and as the only user I think having
the CXL PCI code instantiate the mailboxes for now is fine.  But having a more
general home for that is needed I think for other protocols and users.

> 
> > 
> > 
> > > +/**
> > > + * struct pci_doe_mb - State for a single DOE mailbox  
> > 
> > We generally use the same terms as the spec to make it easier for
> > readers to connect the language in the spec to the implementation.
> > 
> > The spec uniformly refers to "DOE instance".  I guess "mailbox"
> > is slightly more concise, so keep that, but please at least mention
> > the term "instance" in the struct's kernel-doc.
> > 
> > This implementation uses the term "task" for one request/response.
> > That term is not mentioned in the spec at all.  The spec refers to
> > "exchange" and "transfer" on several occasions, so I would have chosen
> > either one of those instead of the somewhat unfortunate "task".
> > 
> > 
> > > + * This state is used to manage a single DOE mailbox capability.  All fields
> > > + * should be considered opaque to the consumers and the structure passed into
> > > + * the helpers below after being created by devm_pci_doe_create()  
> > 
> > If the struct is considered opaque, why is it exposed in a public
> > header file?  Just use a forward declaration in the header
> > so that consumers can pass around pointers to the struct,
> > and hide the declaration proper in doe.c.
> > 
> > 
> > > + * @pdev: PCI device this belongs to mailbox belongs to  
> >                              ^^^^^^^^^^
> > Typo.
> > 
> > > + * @prots: Array of protocols supported on this DOE
> > > + * @num_prots: Size of prots array  
> > 
> > Use @prots instead of prots everywhere in the kernel-doc.
> > 
> > 
> > > +	/*
> > > +	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
> > > +	 * error if pci_doe_cache_protocols() fails past this point.
> > > +	 */  
> > 
> > s/doe_mb_prots/doe_mb->prots/
> > s/pci_doe_free_mb/pci_doe_free_mb()/
> > 
> > 
> > > +	/* DOE requests must be a whole number of DW */
> > > +	if (task->request_pl_sz % sizeof(u32))
> > > +		return -EINVAL;  
> > 
> > It would be nice if this restriction could be lifted.  SPDM uses
> > requests which are not padded to dword-length.  It can run over other
> > transports which may not impose such restrictions.  The SPDM layer
> > should not need to worry about quirks of the transport layer.
> 
> This is a pain.  DOE absolutely requires 32 bit padding and
> reads need to be 32 bit.   We can obviously do something nasty
> with dealing with the tail via a bounce buffer though.
> 
> I'd pencil that in as a future feature though rather than worry
> about it before we have any support at all in place.

I think any missmatch between something like SPDM and DOE will need to be
handled by SPDM to DOE code.  Not the DOE layer.

> 
> > 
> > 
> > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > +{
> > > +	struct pci_doe_mb *doe_mb = data;
> > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > +	int offset = doe_mb->cap_offset;
> > > +	u32 val;
> > > +
> > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > +
> > > +	/* Leave the error case to be handled outside IRQ */
> > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > +					PCI_DOE_STATUS_INT_STATUS);
> > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > +		return IRQ_HANDLED;
> > > +	}
> > > +
> > > +	return IRQ_NONE;
> > > +}  
> > 
> > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > not handled here.  It is incorrectly treated as a spurious
> > interrupt by returning IRQ_NONE.  The right thing to do
> > is probably to wake the state machine in case it's polling
> > for the Busy flag to clear.
> 
> Ah. I remember testing this via a lot of hacking on the QEMU code
> to inject the various races that can occur (it was really ugly to do).
> 
> Guess we lost the handling at some point.  I think your fix
> is the right one.

Perhaps I am missing something but digging into this more.  I disagree that the
handler fails to handle this case.  If I read the spec correctly DOE Interrupt
Status must be set when an interrupt is generated.  The handler wakes the state
machine in that case.  The state machine then checks for busy if there is work
to be done.

Normally we would not even need to check for status error.  But that is special
cased because clearing that status is left to the state machine.

> 
> > 
> > 
> > > +enum pci_doe_state {
> > > +	DOE_IDLE,
> > > +	DOE_WAIT_RESP,
> > > +	DOE_WAIT_ABORT,
> > > +	DOE_WAIT_ABORT_ON_ERR,
> > > +};
> > > +
> > > +#define PCI_DOE_FLAG_ABORT	0
> > > +#define PCI_DOE_FLAG_DEAD	1  
> > 
> > That's all internal and should live in doe.c, not the header file.
> > 
> > Thanks,
> > 
> > Lukas
> > 
> 
> Thanks for the review and thanks again to Ira for taking this forwards.

Thanks!
Ira
Lukas Wunner June 1, 2022, 7:18 a.m. UTC | #7
On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:
> > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > > +	/* First 2 dwords have already been read */
> > > > +	length -= 2;
> > > > +	/* Read the rest of the response payload */
> > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > +				      &task->response_pl[i]);
> > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > +	}  
> > > 
> > > You need to check the Data Object Ready bit.  The device may clear the
> > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > pretend success to the caller even though the response is truncated.
> > > 
> > > If you're concerned about performance when checking the bit on every
> > > loop iteration, checking it only on the last but one iteration should
> > > be sufficient to detect truncation.
> > 
> > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > check for.
> 
> Ok looking at the spec:  Strictly speaking this needs to happen multiple
> times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> not just in this loop.  :-(
> 
> This is because, the check in doe_statemachine_work() only covers the
> 1st dword read IIUC.

The spec says "this bit indicates the DOE instance has a *data object*
available to be read by system firmware/software".

So, the entire object is available for reading, not just one dword.

You've already got checks in place for the first two dwords which
cover reading an "all zeroes" response.  No need to amend them.

You only need to re-check the Data Object Ready bit on the last-but-one
dword in case the function was reset concurrently.  Per sec. 6.30.2,
"An FLR to a Function must result in the aborting of any DOE transfer
in progress."


> > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > +{
> > > > +	struct pci_doe_mb *doe_mb = data;
> > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > +	int offset = doe_mb->cap_offset;
> > > > +	u32 val;
> > > > +
> > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > +
> > > > +	/* Leave the error case to be handled outside IRQ */
> > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > +		return IRQ_HANDLED;
> > > > +	}
> > > > +
> > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > +		return IRQ_HANDLED;
> > > > +	}
> > > > +
> > > > +	return IRQ_NONE;
> > > > +}  
> > > 
> > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > not handled here.  It is incorrectly treated as a spurious
> > > interrupt by returning IRQ_NONE.  The right thing to do
> > > is probably to wake the state machine in case it's polling
> > > for the Busy flag to clear.
> > 
> > Ah. I remember testing this via a lot of hacking on the QEMU code
> > to inject the various races that can occur (it was really ugly to do).
> > 
> > Guess we lost the handling at some point.  I think your fix
> > is the right one.
> 
> Perhaps I am missing something but digging into this more.  I disagree
> that the handler fails to handle this case.  If I read the spec correctly
> DOE Interrupt Status must be set when an interrupt is generated.
> The handler wakes the state machine in that case.  The state machine
> then checks for busy if there is work to be done.

Right, I was mistaken, sorry for the noise.


> Normally we would not even need to check for status error.  But that is
> special cased because clearing that status is left to the state machine.

That however looks wrong because the DOE Interrupt Status bit is never
cleared after a DOE Error is signaled.  The state machine performs an
explicit abort upon an error by setting the DOE Abort bit, but that
doesn't seem to clear DOE Interrupt Status:

Per section 6.30.2, "At any time, the system firmware/software is
permitted to set the DOE Abort bit in the DOE Control Register,
and the DOE instance must Clear the Data Object Ready bit,
if not already Clear, and Clear the DOE Error bit, if already Set,
in the DOE Status Register, within 1 second."

No mention of the DOE Interrupt Status bit, so we cannot assume that
it's cleared by a DOE Abort and we must clear it explicitly.

Thanks,

Lukas
Jonathan Cameron June 1, 2022, 2:23 p.m. UTC | #8
On Wed, 1 Jun 2022 09:18:08 +0200
Lukas Wunner <lukas@wunner.de> wrote:

> On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > > > > +	/* First 2 dwords have already been read */
> > > > > +	length -= 2;
> > > > > +	/* Read the rest of the response payload */
> > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > +				      &task->response_pl[i]);
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > +	}    
> > > > 
> > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > pretend success to the caller even though the response is truncated.
> > > > 
> > > > If you're concerned about performance when checking the bit on every
> > > > loop iteration, checking it only on the last but one iteration should
> > > > be sufficient to detect truncation.  
> > > 
> > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > check for.  
> > 
> > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > not just in this loop.  :-(
> > 
> > This is because, the check in doe_statemachine_work() only covers the
> > 1st dword read IIUC.  
> 
> The spec says "this bit indicates the DOE instance has a *data object*
> available to be read by system firmware/software".
> 
> So, the entire object is available for reading, not just one dword.

Agreed

> 
> You've already got checks in place for the first two dwords which
> cover reading an "all zeroes" response.  No need to amend them.
> 
> You only need to re-check the Data Object Ready bit on the last-but-one
> dword in case the function was reset concurrently.  Per sec. 6.30.2,
> "An FLR to a Function must result in the aborting of any DOE transfer
> in progress."

Ouch, isn't that racy as you can only check it slightly before reading the
last dword and a reset could occur in between?

> 
> 
> > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > +{
> > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > +	int offset = doe_mb->cap_offset;
> > > > > +	u32 val;
> > > > > +
> > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > +
> > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	return IRQ_NONE;
> > > > > +}    
> > > > 
> > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > not handled here.  It is incorrectly treated as a spurious
> > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > is probably to wake the state machine in case it's polling
> > > > for the Busy flag to clear.  
> > > 
> > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > to inject the various races that can occur (it was really ugly to do).
> > > 
> > > Guess we lost the handling at some point.  I think your fix
> > > is the right one.  
> > 
> > Perhaps I am missing something but digging into this more.  I disagree
> > that the handler fails to handle this case.  If I read the spec correctly
> > DOE Interrupt Status must be set when an interrupt is generated.
> > The handler wakes the state machine in that case.  The state machine
> > then checks for busy if there is work to be done.  
> 
> Right, I was mistaken, sorry for the noise.

Ah. Makes sense - managed to confuse me too ;)
I really don't like the absence of a status bit for the DOE busy bit has
cleared interrupt source, but glad we did handle it.

> 
> 
> > Normally we would not even need to check for status error.  But that is
> > special cased because clearing that status is left to the state machine.  
> 
> That however looks wrong because the DOE Interrupt Status bit is never
> cleared after a DOE Error is signaled.  The state machine performs an
> explicit abort upon an error by setting the DOE Abort bit, but that
> doesn't seem to clear DOE Interrupt Status:
> 
> Per section 6.30.2, "At any time, the system firmware/software is
> permitted to set the DOE Abort bit in the DOE Control Register,
> and the DOE instance must Clear the Data Object Ready bit,
> if not already Clear, and Clear the DOE Error bit, if already Set,
> in the DOE Status Register, within 1 second."
> 
> No mention of the DOE Interrupt Status bit, so we cannot assume that
> it's cleared by a DOE Abort and we must clear it explicitly.

Gah.

Jonathan

> 
> Thanks,
> 
> Lukas
>
Ira Weiny June 1, 2022, 5:16 p.m. UTC | #9
On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:
> > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:
> > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:
> > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:
> > > > > +	/* First 2 dwords have already been read */
> > > > > +	length -= 2;
> > > > > +	/* Read the rest of the response payload */
> > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > +				      &task->response_pl[i]);
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > +	}  
> > > > 
> > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > pretend success to the caller even though the response is truncated.
> > > > 
> > > > If you're concerned about performance when checking the bit on every
> > > > loop iteration, checking it only on the last but one iteration should
> > > > be sufficient to detect truncation.
> > > 
> > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > check for.
> > 
> > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > not just in this loop.  :-(
> > 
> > This is because, the check in doe_statemachine_work() only covers the
> > 1st dword read IIUC.
> 
> The spec says "this bit indicates the DOE instance has a *data object*
> available to be read by system firmware/software".

Ok yea.  I got confused by step 6 in sec 6.30.2.

> 
> So, the entire object is available for reading, not just one dword.

Yes cool!

> 
> You've already got checks in place for the first two dwords which
> cover reading an "all zeroes" response.  No need to amend them.
> 
> You only need to re-check the Data Object Ready bit on the last-but-one
> dword in case the function was reset concurrently.  Per sec. 6.30.2,
> "An FLR to a Function must result in the aborting of any DOE transfer
> in progress."

I think I disagree.  Even if we do that and an FLR comes before the last read
the last read could be 0's.

I think the interpretation of the data needs to happen above this and if an FLR
happens during this read we are going to have other issues.

But I can add the check if you feel strongly about it.

> 
> 
> > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > +{
> > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > +	int offset = doe_mb->cap_offset;
> > > > > +	u32 val;
> > > > > +
> > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > +
> > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > +		return IRQ_HANDLED;
> > > > > +	}
> > > > > +
> > > > > +	return IRQ_NONE;
> > > > > +}  
> > > > 
> > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > not handled here.  It is incorrectly treated as a spurious
> > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > is probably to wake the state machine in case it's polling
> > > > for the Busy flag to clear.
> > > 
> > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > to inject the various races that can occur (it was really ugly to do).
> > > 
> > > Guess we lost the handling at some point.  I think your fix
> > > is the right one.
> > 
> > Perhaps I am missing something but digging into this more.  I disagree
> > that the handler fails to handle this case.  If I read the spec correctly
> > DOE Interrupt Status must be set when an interrupt is generated.
> > The handler wakes the state machine in that case.  The state machine
> > then checks for busy if there is work to be done.
> 
> Right, I was mistaken, sorry for the noise.

NP I'm not always following this either.

> 
> 
> > Normally we would not even need to check for status error.  But that is
> > special cased because clearing that status is left to the state machine.
> 
> That however looks wrong because the DOE Interrupt Status bit is never
> cleared after a DOE Error is signaled.  The state machine performs an
> explicit abort upon an error by setting the DOE Abort bit, but that
> doesn't seem to clear DOE Interrupt Status:
> 
> Per section 6.30.2, "At any time, the system firmware/software is
> permitted to set the DOE Abort bit in the DOE Control Register,
> and the DOE instance must Clear the Data Object Ready bit,
> if not already Clear, and Clear the DOE Error bit, if already Set,
> in the DOE Status Register, within 1 second."

I thought that meant the hardware (the DOE instance) must clear those bits
within 1 second?

> 
> No mention of the DOE Interrupt Status bit, so we cannot assume that
> it's cleared by a DOE Abort and we must clear it explicitly.

Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
but I think Lukas is correct here.

Should we drop the special case in pci_doe_irq_handler() and just clear the
status always?  Or should we wait and clear it is pci_doe_abort_start?

Ira

> 
> Thanks,
> 
> Lukas
Lukas Wunner June 1, 2022, 5:56 p.m. UTC | #10
On Wed, Jun 01, 2022 at 10:16:15AM -0700, Ira Weiny wrote:
> On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > You only need to re-check the Data Object Ready bit on the last-but-one
> > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > "An FLR to a Function must result in the aborting of any DOE transfer
> > in progress."
> 
> I think I disagree.  Even if we do that and an FLR comes before the last read
> the last read could be 0's.

PCIe r6.0, Table 7-316 says:

  "If there is no additional data object ready for transfer, the
   DOE instance must clear this bit after the entire data object has been
   transferred, as indicated by software writing to the DOE Read Data
   Mailbox Register after reading the final DW of the data object."

Remember that you *read* a dword from the mailbox and then acknowledge
reception to the mailbox by *writing* a dword to the mailbox.

So you check that the Data Object Ready bit is set before acknowledging
the final dword with a register write.  That's race-free.

(I realize me talking about the "last-but-one dword" above was quite
unclear, sorry about that.)

Thanks,

Lukas
Ira Weiny June 1, 2022, 8:17 p.m. UTC | #11
On Wed, Jun 01, 2022 at 07:56:47PM +0200, Lukas Wunner wrote:
> On Wed, Jun 01, 2022 at 10:16:15AM -0700, Ira Weiny wrote:
> > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > > You only need to re-check the Data Object Ready bit on the last-but-one
> > > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > > "An FLR to a Function must result in the aborting of any DOE transfer
> > > in progress."
> > 
> > I think I disagree.  Even if we do that and an FLR comes before the last read
> > the last read could be 0's.
> 
> PCIe r6.0, Table 7-316 says:
> 
>   "If there is no additional data object ready for transfer, the
>    DOE instance must clear this bit after the entire data object has been
>    transferred, as indicated by software writing to the DOE Read Data
>    Mailbox Register after reading the final DW of the data object."
> 
> Remember that you *read* a dword from the mailbox and then acknowledge
> reception to the mailbox by *writing* a dword to the mailbox.
> 
> So you check that the Data Object Ready bit is set before acknowledging
> the final dword with a register write.  That's race-free.

Ok.

> 
> (I realize me talking about the "last-but-one dword" above was quite
> unclear, sorry about that.)
> 

Ah yes.  Ok, I'll put in a check before the final write.

Ira
Jonathan Cameron June 6, 2022, 2:46 p.m. UTC | #12
On Wed, 1 Jun 2022 10:16:15 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:  
> > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > > > > > +	/* First 2 dwords have already been read */
> > > > > > +	length -= 2;
> > > > > > +	/* Read the rest of the response payload */
> > > > > > +	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
> > > > > > +		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
> > > > > > +				      &task->response_pl[i]);
> > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
> > > > > > +	}    
> > > > > 
> > > > > You need to check the Data Object Ready bit.  The device may clear the
> > > > > bit prematurely (e.g. as a result of a concurrent FLR or Conventional
> > > > > Reset).  You'll continue reading zero dwords from the mailbox and
> > > > > pretend success to the caller even though the response is truncated.
> > > > > 
> > > > > If you're concerned about performance when checking the bit on every
> > > > > loop iteration, checking it only on the last but one iteration should
> > > > > be sufficient to detect truncation.  
> > > > 
> > > > Good catch - I hate corner cases.  Thankfully this one is trivial to
> > > > check for.  
> > > 
> > > Ok looking at the spec:  Strictly speaking this needs to happen multiple
> > > times both in doe_statemachine_work() and inside pci_doe_recv_resp();
> > > not just in this loop.  :-(
> > > 
> > > This is because, the check in doe_statemachine_work() only covers the
> > > 1st dword read IIUC.  
> > 
> > The spec says "this bit indicates the DOE instance has a *data object*
> > available to be read by system firmware/software".  
> 
> Ok yea.  I got confused by step 6 in sec 6.30.2.
> 
> > 
> > So, the entire object is available for reading, not just one dword.  
> 
> Yes cool!
> 
> > 
> > You've already got checks in place for the first two dwords which
> > cover reading an "all zeroes" response.  No need to amend them.
> > 
> > You only need to re-check the Data Object Ready bit on the last-but-one
> > dword in case the function was reset concurrently.  Per sec. 6.30.2,
> > "An FLR to a Function must result in the aborting of any DOE transfer
> > in progress."  
> 
> I think I disagree.  Even if we do that and an FLR comes before the last read
> the last read could be 0's.
> 
> I think the interpretation of the data needs to happen above this and if an FLR
> happens during this read we are going to have other issues.
> 
> But I can add the check if you feel strongly about it.
> 
> > 
> >   
> > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > +{
> > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > +	u32 val;
> > > > > > +
> > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > +
> > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > +		return IRQ_HANDLED;
> > > > > > +	}
> > > > > > +
> > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > +		return IRQ_HANDLED;
> > > > > > +	}
> > > > > > +
> > > > > > +	return IRQ_NONE;
> > > > > > +}    
> > > > > 
> > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > is probably to wake the state machine in case it's polling
> > > > > for the Busy flag to clear.  
> > > > 
> > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > to inject the various races that can occur (it was really ugly to do).
> > > > 
> > > > Guess we lost the handling at some point.  I think your fix
> > > > is the right one.  
> > > 
> > > Perhaps I am missing something but digging into this more.  I disagree
> > > that the handler fails to handle this case.  If I read the spec correctly
> > > DOE Interrupt Status must be set when an interrupt is generated.
> > > The handler wakes the state machine in that case.  The state machine
> > > then checks for busy if there is work to be done.  
> > 
> > Right, I was mistaken, sorry for the noise.  
> 
> NP I'm not always following this either.
> 
> > 
> >   
> > > Normally we would not even need to check for status error.  But that is
> > > special cased because clearing that status is left to the state machine.  
> > 
> > That however looks wrong because the DOE Interrupt Status bit is never
> > cleared after a DOE Error is signaled.  The state machine performs an
> > explicit abort upon an error by setting the DOE Abort bit, but that
> > doesn't seem to clear DOE Interrupt Status:
> > 
> > Per section 6.30.2, "At any time, the system firmware/software is
> > permitted to set the DOE Abort bit in the DOE Control Register,
> > and the DOE instance must Clear the Data Object Ready bit,
> > if not already Clear, and Clear the DOE Error bit, if already Set,
> > in the DOE Status Register, within 1 second."  
> 
> I thought that meant the hardware (the DOE instance) must clear those bits
> within 1 second?
> 
> > 
> > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > it's cleared by a DOE Abort and we must clear it explicitly.  
> 
> Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> but I think Lukas is correct here.

Hmm. I thought we were good as well, but Lukas is correct in saying
the interrupt status bit isn't cleared (which is 'novel' give the associated
bit to tell you what the interrupt means will be cleared). 

I'm not sure I want to think around the race conditions that result...

> 
> Should we drop the special case in pci_doe_irq_handler() and just clear the
> status always?  Or should we wait and clear it is pci_doe_abort_start?

I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.

I've not figured out completely if there are races however...

It is set when no already set and we get transitions of any of the following:
- DOE error bit set (this can't happen until abort so no race here)

- Data Object Ready bit is set: Can this happen with the DOE error set? I don't
  immediately see language saying it can't. However, I don't think it can
  for any of the challenge response protocols yet defined (and there are other
  problems if anyone wants to implement unsolicited messages)

- DOE busy bit has cleared - can definitely happen after an abort (which is
  fine as nothing to do anyway, so we'll handle a pointless interrupt).
  Could it in theory happen when error is set? I think not but only because
  of the statement  "Clear this bit when it is able to receive a new data
  object."

So I think we are fine doing it preabort, but wouldn't put it past a hardware
designer to find some path through that which results in a bonus interrupt
and potentially us resetting twice.

If we clear it at the end of abort instead, what happens?
Definitely no interrupts until we clear it. As we are doing query response
protocols only, no new data until state machine moves on, so fine there.

So what about just doing it unconditionally..

+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {

	here...

+			/* Back to normal state - carry on */
+			retire_cur_task(doe_mb);

This feels a little bit more 'standard' as we are allowing new interrupts
only after everything is back to a nice state.

> 
> Ira
> 
> > 
> > Thanks,
> > 
> > Lukas
Ira Weiny June 6, 2022, 7:56 p.m. UTC | #13
On Mon, Jun 06, 2022 at 03:46:46PM +0100, Jonathan Cameron wrote:
> On Wed, 1 Jun 2022 10:16:15 -0700
> Ira Weiny <ira.weiny@intel.com> wrote:
> 
> > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:
> > > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:  
> > > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:  
> > > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:  
> > > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:  
> > 
> > > 
> > >   
> > > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > > +{
> > > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > > +	u32 val;
> > > > > > > +
> > > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > > +
> > > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > +		return IRQ_HANDLED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > +		return IRQ_HANDLED;
> > > > > > > +	}
> > > > > > > +
> > > > > > > +	return IRQ_NONE;
> > > > > > > +}    
> > > > > > 
> > > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > > is probably to wake the state machine in case it's polling
> > > > > > for the Busy flag to clear.  
> > > > > 
> > > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > > to inject the various races that can occur (it was really ugly to do).
> > > > > 
> > > > > Guess we lost the handling at some point.  I think your fix
> > > > > is the right one.  
> > > > 
> > > > Perhaps I am missing something but digging into this more.  I disagree
> > > > that the handler fails to handle this case.  If I read the spec correctly
> > > > DOE Interrupt Status must be set when an interrupt is generated.
> > > > The handler wakes the state machine in that case.  The state machine
> > > > then checks for busy if there is work to be done.  
> > > 
> > > Right, I was mistaken, sorry for the noise.  
> > 
> > NP I'm not always following this either.
> > 
> > > 
> > >   
> > > > Normally we would not even need to check for status error.  But that is
> > > > special cased because clearing that status is left to the state machine.  
> > > 
> > > That however looks wrong because the DOE Interrupt Status bit is never
> > > cleared after a DOE Error is signaled.  The state machine performs an
> > > explicit abort upon an error by setting the DOE Abort bit, but that
> > > doesn't seem to clear DOE Interrupt Status:
> > > 
> > > Per section 6.30.2, "At any time, the system firmware/software is
> > > permitted to set the DOE Abort bit in the DOE Control Register,
> > > and the DOE instance must Clear the Data Object Ready bit,
> > > if not already Clear, and Clear the DOE Error bit, if already Set,
> > > in the DOE Status Register, within 1 second."  
> > 
> > I thought that meant the hardware (the DOE instance) must clear those bits
> > within 1 second?
> > 
> > > 
> > > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > > it's cleared by a DOE Abort and we must clear it explicitly.  
> > 
> > Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> > but I think Lukas is correct here.
> 
> Hmm. I thought we were good as well, but Lukas is correct in saying
> the interrupt status bit isn't cleared (which is 'novel' give the associated
> bit to tell you what the interrupt means will be cleared). 
> 
> I'm not sure I want to think around the race conditions that result...
> 
> > 
> > Should we drop the special case in pci_doe_irq_handler() and just clear the
> > status always?  Or should we wait and clear it is pci_doe_abort_start?
> 
> I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.

I agree and that is what V10 does.

> 
> I've not figured out completely if there are races however...

This is why I reworked the handling of cur_task in those error cases.

> 
> It is set when no already set and we get transitions of any of the following:
> - DOE error bit set (this can't happen until abort so no race here)
> 
> - Data Object Ready bit is set: Can this happen with the DOE error set? I don't
>   immediately see language saying it can't. However, I don't think it can
>   for any of the challenge response protocols yet defined (and there are other
>   problems if anyone wants to implement unsolicited messages)
> 
> - DOE busy bit has cleared - can definitely happen after an abort (which is
>   fine as nothing to do anyway, so we'll handle a pointless interrupt).
>   Could it in theory happen when error is set? I think not but only because
>   of the statement  "Clear this bit when it is able to receive a new data
>   object."
> 
> So I think we are fine doing it preabort,

That is what I though for V10 especially after reworking the cur_task locking.
An extra interrupt would either start processing the next task or return with
nothing to do.

> but wouldn't put it past a hardware
> designer to find some path through that which results in a bonus interrupt
> and potentially us resetting twice.
> 
> If we clear it at the end of abort instead, what happens?
> Definitely no interrupts until we clear it. As we are doing query response
> protocols only, no new data until state machine moves on, so fine there.
> 
> So what about just doing it unconditionally..
> 
> +	case DOE_WAIT_ABORT:
> +	case DOE_WAIT_ABORT_ON_ERR:
> +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> +
> +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> 
> 	here...
> 
> +			/* Back to normal state - carry on */
> +			retire_cur_task(doe_mb);
> 
> This feels a little bit more 'standard' as we are allowing new interrupts
> only after everything is back to a nice state.

As I reworked the cur_task locking I really thought about locking cur_task
throughout doe_statemachine_work().  It seems a lot safer for a lot of reasons.
Doing so would make the extra work item no big deal.

So I looked at this again because you got me worried.  If mod_delayed_work()
can cause doe_statemachine_work() while another thread is in the middle of
processing the interrupt there is a chance that signal_task_complete() is
called a second time on a given task pointer.

However, I _don't_ _think_ that can happen.  Because I don't think
mod_delayed_work() can cause the work item to run while it is already running.

So unless I misunderstand how mod_delayed_work() works we are guaranteed that
the extra interrupt will see the correct mailbox state and do the right thing.

Ira
Jonathan Cameron June 7, 2022, 9:58 a.m. UTC | #14
On Mon, 6 Jun 2022 12:56:05 -0700
Ira Weiny <ira.weiny@intel.com> wrote:

> On Mon, Jun 06, 2022 at 03:46:46PM +0100, Jonathan Cameron wrote:
> > On Wed, 1 Jun 2022 10:16:15 -0700
> > Ira Weiny <ira.weiny@intel.com> wrote:
> >   
> > > On Wed, Jun 01, 2022 at 09:18:08AM +0200, Lukas Wunner wrote:  
> > > > On Tue, May 31, 2022 at 07:59:21PM -0700, Ira Weiny wrote:    
> > > > > On Tue, May 31, 2022 at 11:33:50AM +0100, Jonathan Cameron wrote:    
> > > > > > On Mon, 30 May 2022 21:06:57 +0200 Lukas Wunner <lukas@wunner.de> wrote:    
> > > > > > > On Thu, Apr 14, 2022 at 01:32:30PM -0700, ira.weiny@intel.com wrote:    
> > >   
> > > > 
> > > >     
> > > > > > > > +static irqreturn_t pci_doe_irq_handler(int irq, void *data)
> > > > > > > > +{
> > > > > > > > +	struct pci_doe_mb *doe_mb = data;
> > > > > > > > +	struct pci_dev *pdev = doe_mb->pdev;
> > > > > > > > +	int offset = doe_mb->cap_offset;
> > > > > > > > +	u32 val;
> > > > > > > > +
> > > > > > > > +	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > > > > > > > +
> > > > > > > > +	/* Leave the error case to be handled outside IRQ */
> > > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
> > > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > > +		return IRQ_HANDLED;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
> > > > > > > > +		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
> > > > > > > > +					PCI_DOE_STATUS_INT_STATUS);
> > > > > > > > +		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
> > > > > > > > +		return IRQ_HANDLED;
> > > > > > > > +	}
> > > > > > > > +
> > > > > > > > +	return IRQ_NONE;
> > > > > > > > +}      
> > > > > > > 
> > > > > > > PCIe 6.0, table 7-316 says that an interrupt is also raised when
> > > > > > > "the DOE Busy bit has been Cleared", yet such an interrupt is
> > > > > > > not handled here.  It is incorrectly treated as a spurious
> > > > > > > interrupt by returning IRQ_NONE.  The right thing to do
> > > > > > > is probably to wake the state machine in case it's polling
> > > > > > > for the Busy flag to clear.    
> > > > > > 
> > > > > > Ah. I remember testing this via a lot of hacking on the QEMU code
> > > > > > to inject the various races that can occur (it was really ugly to do).
> > > > > > 
> > > > > > Guess we lost the handling at some point.  I think your fix
> > > > > > is the right one.    
> > > > > 
> > > > > Perhaps I am missing something but digging into this more.  I disagree
> > > > > that the handler fails to handle this case.  If I read the spec correctly
> > > > > DOE Interrupt Status must be set when an interrupt is generated.
> > > > > The handler wakes the state machine in that case.  The state machine
> > > > > then checks for busy if there is work to be done.    
> > > > 
> > > > Right, I was mistaken, sorry for the noise.    
> > > 
> > > NP I'm not always following this either.
> > >   
> > > > 
> > > >     
> > > > > Normally we would not even need to check for status error.  But that is
> > > > > special cased because clearing that status is left to the state machine.    
> > > > 
> > > > That however looks wrong because the DOE Interrupt Status bit is never
> > > > cleared after a DOE Error is signaled.  The state machine performs an
> > > > explicit abort upon an error by setting the DOE Abort bit, but that
> > > > doesn't seem to clear DOE Interrupt Status:
> > > > 
> > > > Per section 6.30.2, "At any time, the system firmware/software is
> > > > permitted to set the DOE Abort bit in the DOE Control Register,
> > > > and the DOE instance must Clear the Data Object Ready bit,
> > > > if not already Clear, and Clear the DOE Error bit, if already Set,
> > > > in the DOE Status Register, within 1 second."    
> > > 
> > > I thought that meant the hardware (the DOE instance) must clear those bits
> > > within 1 second?
> > >   
> > > > 
> > > > No mention of the DOE Interrupt Status bit, so we cannot assume that
> > > > it's cleared by a DOE Abort and we must clear it explicitly.    
> > > 
> > > Oh...  yea.  Jonathan?  We discussed this before and I was convinced it worked
> > > but I think Lukas is correct here.  
> > 
> > Hmm. I thought we were good as well, but Lukas is correct in saying
> > the interrupt status bit isn't cleared (which is 'novel' give the associated
> > bit to tell you what the interrupt means will be cleared). 
> > 
> > I'm not sure I want to think around the race conditions that result...
> >   
> > > 
> > > Should we drop the special case in pci_doe_irq_handler() and just clear the
> > > status always?  Or should we wait and clear it is pci_doe_abort_start?  
> > 
> > I don't think it matters.  pci_doe_irq_handler() seems a little cleaner.  
> 
> I agree and that is what V10 does.
> 
> > 
> > I've not figured out completely if there are races however...  
> 
> This is why I reworked the handling of cur_task in those error cases.
> 
> > 
> > It is set when no already set and we get transitions of any of the following:
> > - DOE error bit set (this can't happen until abort so no race here)
> > 
> > - Data Object Ready bit is set: Can this happen with the DOE error set? I don't
> >   immediately see language saying it can't. However, I don't think it can
> >   for any of the challenge response protocols yet defined (and there are other
> >   problems if anyone wants to implement unsolicited messages)
> > 
> > - DOE busy bit has cleared - can definitely happen after an abort (which is
> >   fine as nothing to do anyway, so we'll handle a pointless interrupt).
> >   Could it in theory happen when error is set? I think not but only because
> >   of the statement  "Clear this bit when it is able to receive a new data
> >   object."
> > 
> > So I think we are fine doing it preabort,  
> 
> That is what I though for V10 especially after reworking the cur_task locking.
> An extra interrupt would either start processing the next task or return with
> nothing to do.
> 
> > but wouldn't put it past a hardware
> > designer to find some path through that which results in a bonus interrupt
> > and potentially us resetting twice.
> > 
> > If we clear it at the end of abort instead, what happens?
> > Definitely no interrupts until we clear it. As we are doing query response
> > protocols only, no new data until state machine moves on, so fine there.
> > 
> > So what about just doing it unconditionally..
> > 
> > +	case DOE_WAIT_ABORT:
> > +	case DOE_WAIT_ABORT_ON_ERR:
> > +		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
> > +
> > +		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
> > +		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
> > 
> > 	here...
> > 
> > +			/* Back to normal state - carry on */
> > +			retire_cur_task(doe_mb);
> > 
> > This feels a little bit more 'standard' as we are allowing new interrupts
> > only after everything is back to a nice state.  
> 
> As I reworked the cur_task locking I really thought about locking cur_task
> throughout doe_statemachine_work().  It seems a lot safer for a lot of reasons.
> Doing so would make the extra work item no big deal.
> 
> So I looked at this again because you got me worried.  If mod_delayed_work()
> can cause doe_statemachine_work() while another thread is in the middle of
> processing the interrupt there is a chance that signal_task_complete() is
> called a second time on a given task pointer.
> 
> However, I _don't_ _think_ that can happen.  Because I don't think
> mod_delayed_work() can cause the work item to run while it is already running.

You are correct. I remember looking into that exact question for
a different project a while ago.

> 
> So unless I misunderstand how mod_delayed_work() works we are guaranteed that
> the extra interrupt will see the correct mailbox state and do the right thing.

Agreed.  Far as I can tell we are fine.  More eyes always good though if anyone
else wants to take a look!

Jonathan

p.s. I liked the original heavy weight queuing the whole thing on a mutex as it
was a lot easier to reason about :)  Was ugly though!


> 
> Ira
diff mbox series

Patch

diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 133c73207782..b2f2e588a817 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -121,6 +121,9 @@  config XEN_PCIDEV_FRONTEND
 config PCI_ATS
 	bool
 
+config PCI_DOE
+	bool
+
 config PCI_ECAM
 	bool
 
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 0da6b1ebc694..b609d8ad813d 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -31,6 +31,7 @@  obj-$(CONFIG_PCI_ECAM)		+= ecam.o
 obj-$(CONFIG_PCI_P2PDMA)	+= p2pdma.o
 obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o
 obj-$(CONFIG_VGA_ARB)		+= vgaarb.o
+obj-$(CONFIG_PCI_DOE)		+= pci-doe.o
 
 # Endpoint library must be initialized before its users
 obj-$(CONFIG_PCI_ENDPOINT)	+= endpoint/
diff --git a/drivers/pci/pci-doe.c b/drivers/pci/pci-doe.c
new file mode 100644
index 000000000000..ccf936421d2a
--- /dev/null
+++ b/drivers/pci/pci-doe.c
@@ -0,0 +1,607 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Object Exchange
+ * https://members.pcisig.com/wg/PCI-SIG/document/16609
+ *
+ * Copyright (C) 2021 Huawei
+ *	Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *	Ira Weiny <ira.weiny@intel.com>
+ */
+
+#include <linux/bitfield.h>
+#include <linux/delay.h>
+#include <linux/jiffies.h>
+#include <linux/mutex.h>
+#include <linux/pci.h>
+#include <linux/pci-doe.h>
+#include <linux/workqueue.h>
+
+#define PCI_DOE_PROTOCOL_DISCOVERY 0
+
+#define PCI_DOE_BUSY_MAX_RETRIES 16
+#define PCI_DOE_POLL_INTERVAL (HZ / 128)
+
+/* Timeout of 1 second from 6.30.2 Operation, PCI Spec v6.0 */
+#define PCI_DOE_TIMEOUT HZ
+
+static irqreturn_t pci_doe_irq_handler(int irq, void *data)
+{
+	struct pci_doe_mb *doe_mb = data;
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+	/* Leave the error case to be handled outside IRQ */
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) {
+		pci_write_config_dword(pdev, offset + PCI_DOE_STATUS,
+					PCI_DOE_STATUS_INT_STATUS);
+		mod_delayed_work(system_wq, &doe_mb->statemachine, 0);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
+/*
+ * Only called when safe to directly access the DOE from
+ * doe_statemachine_work().  Outside access is not protected.  Users who
+ * perform such access are left with the pieces.
+ */
+static void pci_doe_abort_start(struct pci_doe_mb *doe_mb)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+
+	val = PCI_DOE_CTRL_ABORT;
+	if (doe_mb->irq >= 0)
+		val |= PCI_DOE_CTRL_INT_EN;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+
+	doe_mb->timeout_jiffies = jiffies + HZ;
+	schedule_delayed_work(&doe_mb->statemachine, HZ);
+}
+
+static int pci_doe_send_req(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	u32 val;
+	int i;
+
+	/*
+	 * Check the DOE busy bit is not set. If it is set, this could indicate
+	 * someone other than Linux (e.g. firmware) is using the mailbox. Note
+	 * it is expected that firmware and OS will negotiate access rights via
+	 * an, as yet to be defined method.
+	 */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_BUSY, val))
+		return -EBUSY;
+
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	/* Write DOE Header */
+	val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) |
+		FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type);
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val);
+	/* Length is 2 DW of header + length of payload in DW */
+	pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+			       FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH,
+					  2 + task->request_pl_sz /
+						sizeof(u32)));
+	for (i = 0; i < task->request_pl_sz / sizeof(u32); i++)
+		pci_write_config_dword(pdev, offset + PCI_DOE_WRITE,
+				       task->request_pl[i]);
+
+	val = PCI_DOE_CTRL_GO;
+	if (doe_mb->irq >= 0)
+		val |= PCI_DOE_CTRL_INT_EN;
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val);
+	/* Request is sent - now wait for poll or IRQ */
+	return 0;
+}
+
+static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	size_t length;
+	u32 val;
+	int i;
+
+	/* Read the first dword to get the protocol */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) ||
+	    (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) {
+		pci_err(pdev,
+			"Expected [VID, Protocol] = [%#x, %#x], got [%#x, %#x]\n",
+			task->prot.vid, task->prot.type,
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val),
+			FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val));
+		return -EIO;
+	}
+
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	/* Read the second dword to get the length */
+	pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+	pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+
+	length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val);
+	if (length > SZ_1M || length < 2)
+		return -EIO;
+
+	/* First 2 dwords have already been read */
+	length -= 2;
+	/* Read the rest of the response payload */
+	for (i = 0; i < min(length, task->response_pl_sz / sizeof(u32)); i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ,
+				      &task->response_pl[i]);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+
+	/* Flush excess length */
+	for (; i < length; i++) {
+		pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val);
+		pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0);
+	}
+	/* Final error check to pick up on any since Data Object Ready */
+	pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+	if (FIELD_GET(PCI_DOE_STATUS_ERROR, val))
+		return -EIO;
+
+	return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32);
+}
+
+static void signal_task_complete(struct pci_doe_task *task, int rv)
+{
+	task->rv = rv;
+	task->complete(task);
+}
+
+static void retire_cur_task(struct pci_doe_mb *doe_mb)
+{
+	mutex_lock(&doe_mb->task_lock);
+	doe_mb->cur_task = NULL;
+	mutex_unlock(&doe_mb->task_lock);
+	wake_up_interruptible(&doe_mb->wq);
+}
+
+static void doe_statemachine_work(struct work_struct *work)
+{
+	struct delayed_work *w = to_delayed_work(work);
+	struct pci_doe_mb *doe_mb = container_of(w, struct pci_doe_mb,
+						 statemachine);
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	struct pci_doe_task *task;
+	u32 val;
+	int rc;
+
+	mutex_lock(&doe_mb->task_lock);
+	task = doe_mb->cur_task;
+	mutex_unlock(&doe_mb->task_lock);
+
+	if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) {
+		/*
+		 * Currently only used during init - care needed if
+		 * pci_doe_abort() is generally exposed as it would impact
+		 * queries in flight.
+		 */
+		if (task)
+			pr_err("Aborting with active task!\n");
+		doe_mb->state = DOE_WAIT_ABORT;
+		pci_doe_abort_start(doe_mb);
+		return;
+	}
+
+	switch (doe_mb->state) {
+	case DOE_IDLE:
+		if (task == NULL)
+			return;
+
+		rc = pci_doe_send_req(doe_mb, task);
+		/*
+		 * The specification does not provide any guidance on how long
+		 * some other entity could keep the DOE busy, so try for 1
+		 * second then fail. Busy handling is best effort only, because
+		 * there is no way of avoiding racing against another user of
+		 * the DOE.
+		 */
+		if (rc == -EBUSY) {
+			doe_mb->busy_retries++;
+			if (doe_mb->busy_retries == PCI_DOE_BUSY_MAX_RETRIES) {
+				/* Long enough, fail this request */
+				pci_warn(pdev,
+					"DOE busy for too long (> 1 sec)\n");
+				doe_mb->busy_retries = 0;
+				goto err_busy;
+			}
+			schedule_delayed_work(w, HZ / PCI_DOE_BUSY_MAX_RETRIES);
+			return;
+		}
+		if (rc)
+			goto err_abort;
+		doe_mb->busy_retries = 0;
+
+		doe_mb->state = DOE_WAIT_RESP;
+		doe_mb->timeout_jiffies = jiffies + HZ;
+		/* Now poll or wait for IRQ with timeout */
+		if (doe_mb->irq >= 0)
+			schedule_delayed_work(w, PCI_DOE_TIMEOUT);
+		else
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+		return;
+
+	case DOE_WAIT_RESP:
+		/* Not possible to get here with NULL task */
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+		if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) {
+			rc = -EIO;
+			goto err_abort;
+		}
+
+		if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) {
+			/* If not yet at timeout reschedule otherwise abort */
+			if (time_after(jiffies, doe_mb->timeout_jiffies)) {
+				rc = -ETIMEDOUT;
+				goto err_abort;
+			}
+			schedule_delayed_work(w, PCI_DOE_POLL_INTERVAL);
+			return;
+		}
+
+		rc  = pci_doe_recv_resp(doe_mb, task);
+		if (rc < 0)
+			goto err_abort;
+
+		doe_mb->state = DOE_IDLE;
+
+		retire_cur_task(doe_mb);
+		/* Set the return value to the length of received payload */
+		signal_task_complete(task, rc);
+
+		return;
+
+	case DOE_WAIT_ABORT:
+	case DOE_WAIT_ABORT_ON_ERR:
+		pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val);
+
+		if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) &&
+		    !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) {
+			/* Back to normal state - carry on */
+			retire_cur_task(doe_mb);
+
+			/*
+			 * For deliberately triggered abort, someone is
+			 * waiting.
+			 */
+			if (doe_mb->state == DOE_WAIT_ABORT) {
+				if (task)
+					signal_task_complete(task, -EFAULT);
+				complete(&doe_mb->abort_c);
+			}
+
+			doe_mb->state = DOE_IDLE;
+			return;
+		}
+		if (time_after(jiffies, doe_mb->timeout_jiffies)) {
+			/* Task has timed out and is dead - abort */
+			pci_err(pdev, "DOE ABORT timed out\n");
+			set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags);
+			retire_cur_task(doe_mb);
+
+			if (doe_mb->state == DOE_WAIT_ABORT) {
+				if (task)
+					signal_task_complete(task, -EFAULT);
+				complete(&doe_mb->abort_c);
+			}
+		}
+		return;
+	}
+
+err_abort:
+	doe_mb->state = DOE_WAIT_ABORT_ON_ERR;
+	pci_doe_abort_start(doe_mb);
+err_busy:
+	signal_task_complete(task, rc);
+	if (doe_mb->state == DOE_IDLE)
+		retire_cur_task(doe_mb);
+}
+
+static void pci_doe_task_complete(struct pci_doe_task *task)
+{
+	complete(task->private);
+}
+
+static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid,
+			     u8 *protocol)
+{
+	u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX,
+				    *index);
+	u32 response_pl;
+	DECLARE_COMPLETION_ONSTACK(c);
+	struct pci_doe_task task = {
+		.prot.vid = PCI_VENDOR_ID_PCI_SIG,
+		.prot.type = PCI_DOE_PROTOCOL_DISCOVERY,
+		.request_pl = &request_pl,
+		.request_pl_sz = sizeof(request_pl),
+		.response_pl = &response_pl,
+		.response_pl_sz = sizeof(response_pl),
+		.complete = pci_doe_task_complete,
+		.private = &c,
+	};
+	int ret;
+
+	ret = pci_doe_submit_task(doe_mb, &task);
+	if (ret < 0)
+		return ret;
+
+	wait_for_completion(&c);
+
+	if (task.rv != sizeof(response_pl))
+		return -EIO;
+
+	*vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl);
+	*protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL,
+			      response_pl);
+	*index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX,
+			   response_pl);
+
+	return 0;
+}
+
+static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb)
+{
+	u8 index = 0;
+	int num_prots;
+	int rc;
+
+	/* Discovery protocol must always be supported and must report itself */
+	num_prots = 1;
+
+	doe_mb->prots = kcalloc(num_prots, sizeof(*doe_mb->prots), GFP_KERNEL);
+	if (!doe_mb->prots)
+		return -ENOMEM;
+
+	/*
+	 * NOTE: doe_mb_prots is freed by pci_doe_free_mb automatically on
+	 * error if pci_doe_cache_protocols() fails past this point.
+	 */
+	do {
+		struct pci_doe_protocol *prot;
+
+		prot = &doe_mb->prots[num_prots - 1];
+		rc = pci_doe_discovery(doe_mb, &index, &prot->vid, &prot->type);
+		if (rc)
+			return rc;
+
+		if (index) {
+			struct pci_doe_protocol *prot_new;
+
+			num_prots++;
+			prot_new = krealloc(doe_mb->prots,
+					    sizeof(*doe_mb->prots) * num_prots,
+					    GFP_KERNEL);
+			if (!prot_new)
+				return -ENOMEM;
+
+			doe_mb->prots = prot_new;
+		}
+	} while (index);
+
+	doe_mb->num_prots = num_prots;
+	return 0;
+}
+
+static int pci_doe_abort(struct pci_doe_mb *doe_mb)
+{
+	reinit_completion(&doe_mb->abort_c);
+	set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags);
+	schedule_delayed_work(&doe_mb->statemachine, 0);
+	wait_for_completion(&doe_mb->abort_c);
+
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags))
+		return -EIO;
+
+	return 0;
+}
+
+static int pci_doe_request_irq(struct pci_doe_mb *doe_mb)
+{
+	struct pci_dev *pdev = doe_mb->pdev;
+	int offset = doe_mb->cap_offset;
+	int doe_irq, rc;
+	u32 val;
+
+	pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val);
+
+	if (!FIELD_GET(PCI_DOE_CAP_INT, val))
+		return -EOPNOTSUPP;
+
+	doe_irq = FIELD_GET(PCI_DOE_CAP_IRQ, val);
+	rc = pci_request_irq(pdev, doe_irq, pci_doe_irq_handler,
+			     NULL, doe_mb,
+			     "DOE[%d:%s]", doe_irq, pci_name(pdev));
+	if (rc)
+		return rc;
+
+	doe_mb->irq = doe_irq;
+	pci_write_config_dword(pdev, offset + PCI_DOE_CTRL,
+			       PCI_DOE_CTRL_INT_EN);
+	return 0;
+}
+
+static void pci_doe_free_mb(struct pci_doe_mb *doe_mb)
+{
+	if (doe_mb->irq >= 0)
+		pci_free_irq(doe_mb->pdev, doe_mb->irq, doe_mb);
+	kfree(doe_mb->prots);
+	kfree(doe_mb);
+}
+
+/**
+ * pci_doe_create_mb() - Create a DOE mailbox object
+ *
+ * @pdev: PCI device to create the DOE mailbox for
+ * @cap_offset: Offset of the DOE mailbox
+ * @use_irq: Should the state machine use an irq
+ *
+ * Create a single mailbox object to manage the mailbox protocol at the
+ * cap_offset specified.
+ *
+ * Caller should allocate PCI irq vectors before setting use_irq.
+ *
+ * RETURNS: created mailbox object on success
+ *	    ERR_PTR(-errno) on failure
+ */
+struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+				     bool use_irq)
+{
+	struct pci_doe_mb *doe_mb;
+	int rc;
+
+	doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL);
+	if (!doe_mb)
+		return ERR_PTR(-ENOMEM);
+
+	doe_mb->pdev = pdev;
+	init_completion(&doe_mb->abort_c);
+	doe_mb->irq = -1;
+	doe_mb->cap_offset = cap_offset;
+
+	init_waitqueue_head(&doe_mb->wq);
+	mutex_init(&doe_mb->task_lock);
+	INIT_DELAYED_WORK(&doe_mb->statemachine, doe_statemachine_work);
+	doe_mb->state = DOE_IDLE;
+
+	if (use_irq) {
+		rc = pci_doe_request_irq(doe_mb);
+		if (rc)
+			pci_err(pdev, "DOE request irq failed for mailbox @ %u : %d\n",
+				cap_offset, rc);
+	}
+
+	/* Reset the mailbox by issuing an abort */
+	rc = pci_doe_abort(doe_mb);
+	if (rc) {
+		pci_err(pdev, "DOE failed to reset the mailbox @ %u : %d\n",
+			cap_offset, rc);
+		pci_doe_free_mb(doe_mb);
+		return ERR_PTR(rc);
+	}
+
+	rc = pci_doe_cache_protocols(doe_mb);
+	if (rc) {
+		pci_err(pdev, "DOE failed to cache protocols for mailbox @ %u : %d\n",
+			cap_offset, rc);
+		pci_doe_free_mb(doe_mb);
+		return ERR_PTR(rc);
+	}
+
+	return doe_mb;
+}
+EXPORT_SYMBOL_GPL(pci_doe_create_mb);
+
+/**
+ * pci_doe_supports_prot() - Return if the DOE instance supports the given
+ *			     protocol
+ * @doe_mb: DOE mailbox capability to query
+ * @vid: Protocol Vendor ID
+ * @type: protocol type
+ *
+ * RETURNS: True if the DOE device supports the protocol specified
+ */
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type)
+{
+	int i;
+
+	/* The discovery protocol must always be supported */
+	if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY)
+		return true;
+
+	for (i = 0; i < doe_mb->num_prots; i++)
+		if ((doe_mb->prots[i].vid == vid) &&
+		    (doe_mb->prots[i].type == type))
+			return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(pci_doe_supports_prot);
+
+/**
+ * pci_doe_submit_task() - Submit a task to be processed by the state machine
+ *
+ * @doe_mb: DOE mailbox capability to submit to
+ * @task: task to be queued
+ *
+ * Submit a DOE task (request/response) to the DOE mailbox to be processed.
+ * Returns upon queueing the task object.  If the queue is full this function
+ * will sleep until there is room in the queue.
+ *
+ * task->complete will be called when the state machine is done processing this
+ * task.
+ *
+ * Excess data will be discarded.
+ *
+ * RETURNS: 0 when task has been successful queued, -ERRNO on error
+ */
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task)
+{
+	if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type))
+		return -EINVAL;
+
+	/* DOE requests must be a whole number of DW */
+	if (task->request_pl_sz % sizeof(u32))
+		return -EINVAL;
+
+again:
+	mutex_lock(&doe_mb->task_lock);
+	if (doe_mb->cur_task) {
+		mutex_unlock(&doe_mb->task_lock);
+		wait_event_interruptible(doe_mb->wq, doe_mb->cur_task == NULL);
+		goto again;
+	}
+
+	if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) {
+		mutex_unlock(&doe_mb->task_lock);
+		return -EIO;
+	}
+	doe_mb->cur_task = task;
+	mutex_unlock(&doe_mb->task_lock);
+	schedule_delayed_work(&doe_mb->statemachine, 0);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_doe_submit_task);
+
+/**
+ * pci_doe_destroy_mb() - Destroy a DOE mailbox object created with
+ * pci_doe_create_mb()
+ *
+ * @doe_mb: DOE mailbox capability structure to destroy
+ *
+ * The mailbox becomes invalid and should not be used after this call.
+ */
+void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb)
+{
+	/* abort any work in progress */
+	pci_doe_abort(doe_mb);
+
+	/* halt the state machine */
+	cancel_delayed_work_sync(&doe_mb->statemachine);
+
+	pci_doe_free_mb(doe_mb);
+}
+EXPORT_SYMBOL_GPL(pci_doe_destroy_mb);
diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h
new file mode 100644
index 000000000000..7e6ebaf9930a
--- /dev/null
+++ b/include/linux/pci-doe.h
@@ -0,0 +1,119 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Object Exchange
+ * https://members.pcisig.com/wg/PCI-SIG/document/16609
+ *
+ * Copyright (C) 2021 Huawei
+ *     Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ *
+ * Copyright (C) 2022 Intel Corporation
+ *	Ira Weiny <ira.weiny@intel.com>
+ */
+
+#ifndef LINUX_PCI_DOE_H
+#define LINUX_PCI_DOE_H
+
+#include <linux/completion.h>
+
+enum pci_doe_state {
+	DOE_IDLE,
+	DOE_WAIT_RESP,
+	DOE_WAIT_ABORT,
+	DOE_WAIT_ABORT_ON_ERR,
+};
+
+#define PCI_DOE_FLAG_ABORT	0
+#define PCI_DOE_FLAG_DEAD	1
+
+struct pci_doe_protocol {
+	u16 vid;
+	u8 type;
+};
+
+/**
+ * struct pci_doe_task - represents a single query/response
+ *
+ * @prot: DOE Protocol
+ * @request_pl: The request payload
+ * @request_pl_sz: Size of the request payload
+ * @response_pl: The response payload
+ * @response_pl_sz: Size of the response payload
+ * @rv: Return value.  Length of received response or error
+ * @complete: Called when task is complete
+ * @private: Private data for the consumer
+ */
+struct pci_doe_task {
+	struct pci_doe_protocol prot;
+	u32 *request_pl;
+	size_t request_pl_sz;
+	u32 *response_pl;
+	size_t response_pl_sz;
+	int rv;
+	void (*complete)(struct pci_doe_task *task);
+	void *private;
+};
+
+/**
+ * struct pci_doe_mb - State for a single DOE mailbox
+ *
+ * This state is used to manage a single DOE mailbox capability.  All fields
+ * should be considered opaque to the consumers and the structure passed into
+ * the helpers below after being created by devm_pci_doe_create()
+ *
+ * @pdev: PCI device this belongs to mailbox belongs to
+ * @abort_c: Completion used for initial abort handling
+ * @irq: Interrupt used for signaling DOE ready or abort
+ * @prots: Array of protocols supported on this DOE
+ * @num_prots: Size of prots array
+ * @cap_offset: Capability offset
+ * @wq: Wait queue to wait on if a query is in progress
+ * @cur_task: Current task the state machine is working on
+ * @task_lock: Protect cur_task
+ * @statemachine: Work item for the DOE state machine
+ * @state: Current state of this DOE
+ * @timeout_jiffies: 1 second after GO set
+ * @busy_retries: Count of retry attempts
+ * @flags: Bit array of PCI_DOE_FLAG_* flags
+ *
+ * Note: prots can't be allocated with struct size because the number of
+ * protocols is not known until after this structure is in use.  However, the
+ * single discovery protocol is always required to query for the number of
+ * protocols.
+ */
+struct pci_doe_mb {
+	struct pci_dev *pdev;
+	struct completion abort_c;
+	int irq;
+	struct pci_doe_protocol *prots;
+	int num_prots;
+	u16 cap_offset;
+
+	wait_queue_head_t wq;
+	struct pci_doe_task *cur_task;
+	struct mutex task_lock;
+	struct delayed_work statemachine;
+	enum pci_doe_state state;
+	unsigned long timeout_jiffies;
+	unsigned int busy_retries;
+	unsigned long flags;
+};
+
+/**
+ * pci_doe_for_each_off - Iterate each DOE capability
+ * @pdev: struct pci_dev to iterate
+ * @off: u16 of config space offset of each mailbox capability found
+ */
+#define pci_doe_for_each_off(pdev, off) \
+	for (off = pci_find_next_ext_capability(pdev, off, \
+					PCI_EXT_CAP_ID_DOE); \
+		off > 0; \
+		off = pci_find_next_ext_capability(pdev, off, \
+					PCI_EXT_CAP_ID_DOE))
+
+struct pci_doe_mb *pci_doe_create_mb(struct pci_dev *pdev, u16 cap_offset,
+				     bool use_irq);
+void pci_doe_destroy_mb(struct pci_doe_mb *doe_mb);
+bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type);
+int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task);
+
+#endif
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index bee1a9ed6e66..4e96b45ee36d 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -736,7 +736,8 @@ 
 #define PCI_EXT_CAP_ID_DVSEC	0x23	/* Designated Vendor-Specific */
 #define PCI_EXT_CAP_ID_DLF	0x25	/* Data Link Feature */
 #define PCI_EXT_CAP_ID_PL_16GT	0x26	/* Physical Layer 16.0 GT/s */
-#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_PL_16GT
+#define PCI_EXT_CAP_ID_DOE	0x2E	/* Data Object Exchange */
+#define PCI_EXT_CAP_ID_MAX	PCI_EXT_CAP_ID_DOE
 
 #define PCI_EXT_CAP_DSN_SIZEOF	12
 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40
@@ -1102,4 +1103,30 @@ 
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK		0x000000F0
 #define  PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT	4
 
+/* Data Object Exchange */
+#define PCI_DOE_CAP		0x04    /* DOE Capabilities Register */
+#define  PCI_DOE_CAP_INT			0x00000001  /* Interrupt Support */
+#define  PCI_DOE_CAP_IRQ			0x00000ffe  /* Interrupt Message Number */
+#define PCI_DOE_CTRL		0x08    /* DOE Control Register */
+#define  PCI_DOE_CTRL_ABORT			0x00000001  /* DOE Abort */
+#define  PCI_DOE_CTRL_INT_EN			0x00000002  /* DOE Interrupt Enable */
+#define  PCI_DOE_CTRL_GO			0x80000000  /* DOE Go */
+#define PCI_DOE_STATUS		0x0c    /* DOE Status Register */
+#define  PCI_DOE_STATUS_BUSY			0x00000001  /* DOE Busy */
+#define  PCI_DOE_STATUS_INT_STATUS		0x00000002  /* DOE Interrupt Status */
+#define  PCI_DOE_STATUS_ERROR			0x00000004  /* DOE Error */
+#define  PCI_DOE_STATUS_DATA_OBJECT_READY	0x80000000  /* Data Object Ready */
+#define PCI_DOE_WRITE		0x10    /* DOE Write Data Mailbox Register */
+#define PCI_DOE_READ		0x14    /* DOE Read Data Mailbox Register */
+
+/* DOE Data Object - note not actually registers */
+#define PCI_DOE_DATA_OBJECT_HEADER_1_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH		0x0003ffff
+
+#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX		0x000000ff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID		0x0000ffff
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
+#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
+
 #endif /* LINUX_PCI_REGS_H */