Message ID | 20220628041527.742333-4-ira.weiny@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | CXL: Read CDAT and DSMAS data | expand |
On Mon, 27 Jun 2022 21:15:21 -0700 ira.weiny@intel.com wrote: > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based > mailbox with standard protocol discovery. Each mailbox is accessed > through a DOE Extended Capability. > > Each DOE mailbox must support the DOE discovery protocol in addition to > any number of additional protocols. > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a > defined config space offset. Functionality includes iterating, > creating, query of supported protocol, and task submission. Destruction > of the mailboxes is device managed. > > If interrupts are desired, the interrupt number can be queried and > passed to the create function. Passing a negative value disables > interrupts for that mailbox. It is the caller's responsibility to ensure > enough interrupt vectors are allocated. > > Cc: "Li, Ming" <ming4.li@intel.com> > Cc: Bjorn Helgaas <helgaas@kernel.org> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > Hi Ira, Thanks for keeping at this! I think this has reintroduced some of the races around that annoying interrupt source form BUSY transitioning to low that has no explicit 'cause' flag. I think we'd hammered all those out in the previous version but maybe there were still some there... I 'think' it will work as is, but depending on the timing a given DOE implementation has, the interrupt may be completely pointless as it will be signaling the wrong event. Jonathan > diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile > index 0da6b1ebc694..2680e4c92f0a 100644 > --- a/drivers/pci/Makefile > +++ b/drivers/pci/Makefile > @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM) += ecam.o > obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o > obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o > obj-$(CONFIG_VGA_ARB) += vgaarb.o > +obj-$(CONFIG_PCI_DOE) += doe.o > > # Endpoint library must be initialized before its users > obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > new file mode 100644 > index 000000000000..4a7a1e988124 > --- /dev/null > +++ b/drivers/pci/doe.c > +/** > + * struct pci_doe_mb - State for a single DOE mailbox > + * > + * This state is used to manage a single DOE mailbox capability. All fields > + * should be considered opaque to the consumers and the structure passed into > + * the helpers below after being created by devm_pci_doe_create() > + * > + * @pdev: PCI device this mailbox belongs to > + * @cap_offset: Capability offset > + * @int_msg_num: DOE Interrupt Message Number; negative if irqs are not used > + * @prots: Array of protocols supported (encoded as long values) > + * @wq: Wait queue for work items awaiting irq/abort > + * @work_queue: Queue of pci_doe_work items > + * @flags: Bit array of PCI_DOE_FLAG_* flags > + * > + * Note: @prots can't be allocated with struct size because the number of > + * protocols is not known until after this structure is in use. However, the > + * single discovery protocol is always required to query for the number of > + * protocols. > + */ > +struct pci_doe_mb { > + struct pci_dev *pdev; > + u16 cap_offset; > + int int_msg_num; > + struct xarray prots; > + > + wait_queue_head_t wq; > + struct workqueue_struct *work_queue; > + unsigned long flags; > +}; > + > +static irqreturn_t pci_doe_irq_handler(int irq, void *data) > +{ > + struct pci_doe_mb *doe_mb = data; > + struct pci_dev *pdev = doe_mb->pdev; > + int offset = doe_mb->cap_offset; > + u32 val; > + > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { > + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, > + PCI_DOE_STATUS_INT_STATUS); > + set_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); > + wake_up(&doe_mb->wq); > + return IRQ_HANDLED; > + } > + > + return IRQ_NONE; > +} > + > +/* > + * Returned from the wait functions to indicate that an abort has been issued > + */ > +#define DOE_WAIT_ABORT 1 > + > +static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb) Feels like there should be a naming to convey the return value as a boolean rather than pushing through a flag value. > +{ > + if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) > + return DOE_WAIT_ABORT; > + clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); > + return 0; > +} > + > +static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val) > +{ > + struct pci_dev *pdev = doe_mb->pdev; > + int offset = doe_mb->cap_offset; > + > + if (pci_doe_irq_enabled(doe_mb)) > + val |= PCI_DOE_CTRL_INT_EN; > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > +} > + > +static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb) Can we rename this as it does more than simply issue the abort, it waits for it to finish > +{ > + struct pci_dev *pdev = doe_mb->pdev; > + int offset = doe_mb->cap_offset; > + unsigned long timeout_jiffies; > + > + pci_dbg(pdev, "[%x] Issuing Abort\n", offset); > + > + /* > + * Abort detected while aborting; something is really broken or the > + * mailbox is being destroyed. > + */ > + if (pci_doe_arm_wait(doe_mb)) > + return -EIO; > + > + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT; > + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT); > + > + do { > + u32 val; > + > + /* > + * Abort detected while aborting; something is really broken or > + * the mailbox is being destroyed. > + */ > + if (pci_doe_wait_irq_or_poll(doe_mb)) > + return -EIO; > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + > + /* Abort success! */ > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) > + return 0; > + > + } while (!time_after(jiffies, timeout_jiffies)); > + > + /* Abort has timed out and the MB is dead */ > + pci_err(pdev, "[%x] ABORT timed out\n", offset); Does this print mention it's a DOE somewhere? > + return -EIO; > +} > + ... > +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task) > +{ > + struct pci_dev *pdev = doe_mb->pdev; > + int offset = doe_mb->cap_offset; > + size_t length, payload_length; > + u32 val; > + int i; > + > + /* Read the first dword to get the protocol */ > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) || > + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) { > + pci_err(pdev, > + "[%x] expected [VID, Protocol] = [%04x, %02x], got [%04x, %02x]\n", > + doe_mb->cap_offset, > + task->prot.vid, task->prot.type, > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val), > + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val)); > + return -EIO; > + } > + > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + /* Read the second dword to get the length */ > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + > + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val); > + if (length > SZ_1M || length < 2) > + return -EIO; > + > + /* First 2 dwords have already been read */ > + length -= 2; > + payload_length = min(length, task->response_pl_sz / sizeof(u32)); > + /* Read the rest of the response payload */ > + for (i = 0; i < payload_length; i++) { > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, > + &task->response_pl[i]); > + /* Prior to the last ack, ensure Data Object Ready */ > + if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb)) spaces around - > + return -EIO; > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + } > + > + /* Flush excess length */ > + for (; i < length; i++) { > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > + } > + > + /* Final error check to pick up on any since Data Object Ready */ > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > + return -EIO; > + > + return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32); > +} > + > + > +static void doe_statemachine_work(struct work_struct *work) > +{ > + struct pci_doe_task *task = container_of(work, struct pci_doe_task, > + work); > + struct pci_doe_mb *doe_mb = task->doe_mb; > + struct pci_dev *pdev = doe_mb->pdev; > + int offset = doe_mb->cap_offset; > + unsigned int busy_retries = 0; > + unsigned long timeout_jiffies; > + u32 val; > + int rc; > + > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { > + signal_task_complete(task, -EIO); > + return; > + } > + > + /* Send request */ > +retry_req: > + if (pci_doe_arm_wait(doe_mb)) { > + signal_task_abort(task, -EIO); > + return; > + } Is there a race here? If Busy drops at this point we queue up a message, but IRQ bit is already set. Hence when we hit wait_event_timeout() the flag is already set and IIRC we'll drop straight through. It'll probably be fine because it will end up polling below but doesn't look ideal. Upshot is that you sort of have to handle "spurious interrupts" cleanly and rewait on the interrupt if you get one whilst also handling race conditions around RW1C of the interrupt status flag. > + > + rc = pci_doe_send_req(doe_mb, task); > + > + /* > + * The specification does not provide any guidance on how long > + * some other entity could keep the DOE busy, so try for 1 > + * second then fail. Busy handling is best effort only, because > + * there is no way of avoiding racing against another user of > + * the DOE. > + */ > + if (rc == -EBUSY) { > + busy_retries++; > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > + pci_warn(pdev, > + "[%x] busy for too long (> 1 sec)\n", > + offset); > + signal_task_complete(task, rc); > + return; > + } > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > + signal_task_abort(task, rc); > + return; > + } > + goto retry_req; > + } else if (rc) { > + signal_task_abort(task, rc); > + return; > + } > + > + timeout_jiffies = jiffies + HZ; > + if (pci_doe_wait_irq_or_poll(doe_mb)) { So this may well be passed as a result of a BUSY transition to 0 very soon after the doe_send_req but well before the data is ready.... > + signal_task_abort(task, -EIO); > + return; > + } > + > + /* Poll for response */ > +retry_resp: > + if (pci_doe_arm_wait(doe_mb)) { I think we can get here between Busy drop and Object Ready which means this can get another IRQ_FLAG setting just after it. Does it matter? Don't think so, as we don't use that bit again in this run through and it will be cleared at beginning of next one, but if so why is this call here? I think it's only useful for detecting an abort, if so do that explicitly. > + signal_task_abort(task, -EIO); > + return; > + } > + > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > + signal_task_abort(task, -EIO); > + return; > + } > + > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > + if (time_after(jiffies, timeout_jiffies)) { > + signal_task_abort(task, -ETIMEDOUT); > + return; > + } > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { Following on from above.... As a result of the interrupt having fired on the BUSY off transition, I think we will almost always end up spinning here until Object Ready is set. Fine, but seems a shame we don't let an interrupt do this for us in most cases. (note in QEMU response is instantaneous so when the interrupt for Busy drop is set, object ready also true so by the time we get here data ready will already be sent). > + signal_task_abort(task, -EIO); > + return; > + } > + goto retry_resp; > + } > + > + rc = pci_doe_recv_resp(doe_mb, task); > + if (rc < 0) { > + signal_task_abort(task, rc); > + return; > + } > + > + signal_task_complete(task, rc); > +} > + > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > +{ > + if (doe_mb->work_queue) I'm not a great fan of free functions that check a bunch of conditions because they may be called before things are set up. To my mind that generally means we should be calling individual cleanup in the appropriate error handlers. Either that or just use devm handling for each item. Sure it's a few more lines of code, but I find it a lot easier to go Oh look that thing we just set up is cleaned up by this. > + destroy_workqueue(doe_mb->work_queue); > + if (pci_doe_irq_enabled(doe_mb)) > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); > + xa_destroy(&doe_mb->prots); > + kfree(doe_mb); > +} > + ... > + > +static void pci_doe_destroy_mb(void *mb) > +{ > + struct pci_doe_mb *doe_mb = mb; > + > + /* Mark going down */ > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); > + > + /* Abort any in progress work items */ > + pci_doe_abort(doe_mb); Abort is getting used for two things in here. Perhaps rename this one to pci_doe_abort_tasks() or something like that? > + > + /* Flush remaining work items */ > + flush_workqueue(doe_mb->work_queue); > + > + pci_doe_free_mb(doe_mb); > +} > + > +/** > + * pcim_doe_create_mb() - Create a DOE mailbox object > + * > + * @pdev: PCI device to create the DOE mailbox for > + * @cap_offset: Offset of the DOE mailbox > + * @int_msg_num: Interrupt message number to use; a negative value means don't > + * use interrupts > + * > + * Create a single mailbox object to manage the mailbox protocol at the > + * cap_offset specified. > + * > + * Caller should allocate PCI IRQ vectors before passing a possitive value for positive > + * int_msg_num. > + * > + * RETURNS: created mailbox object on success > + * ERR_PTR(-errno) on failure > + */ > +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset, > + int int_msg_num) > +{ > + struct pci_doe_mb *doe_mb; > + int rc; > + > + doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL); > + if (!doe_mb) > + return ERR_PTR(-ENOMEM); > + > + doe_mb->pdev = pdev; > + doe_mb->int_msg_num = -1; > + doe_mb->cap_offset = cap_offset; > + > + xa_init(&doe_mb->prots); > + init_waitqueue_head(&doe_mb->wq); > + > + if (int_msg_num >= 0) { > + rc = pci_doe_enable_irq(doe_mb, int_msg_num); > + if (rc) > + pci_err(pdev, > + "[%x] enable requested IRQ (%d) failed : %d\n", > + doe_mb->cap_offset, int_msg_num, rc); If we are printing an error, I'd argue we should not continue. Or at very least we should a comment here to say why we should do so... > + } > + > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, > + doe_mb->cap_offset); > + if (!doe_mb->work_queue) { > + pci_err(pdev, "[%x] failed to allocate work queue\n", > + doe_mb->cap_offset); > + pci_doe_free_mb(doe_mb); As above, I'd rather this explicitly freed what has been set up and only that rather than calling a free function that does a bunch of stuff conditionally. > + return ERR_PTR(-ENOMEM); > + } > + > + /* Reset the mailbox by issuing an abort */ > + rc = pci_doe_issue_abort(doe_mb); > + if (rc) { > + pci_err(pdev, "[%x] failed to reset : %d\n", > + doe_mb->cap_offset, rc); > + pci_doe_free_mb(doe_mb); > + return ERR_PTR(rc); > + } > + > + if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb)) > + return ERR_PTR(-EIO); > + > + rc = pci_doe_cache_protocols(doe_mb); > + if (rc) { > + pci_err(pdev, "[%x] failed to cache protocols : %d\n", > + doe_mb->cap_offset, rc); > + return ERR_PTR(rc); > + } > + > + return doe_mb; > +} > +EXPORT_SYMBOL_GPL(pcim_doe_create_mb); > +
On Mon, 27 Jun 2022 21:15:21 -0700 ira.weiny@intel.com wrote: > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based > mailbox with standard protocol discovery. Each mailbox is accessed > through a DOE Extended Capability. > > Each DOE mailbox must support the DOE discovery protocol in addition to > any number of additional protocols. > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a > defined config space offset. Functionality includes iterating, > creating, query of supported protocol, and task submission. Destruction > of the mailboxes is device managed. > > If interrupts are desired, the interrupt number can be queried and > passed to the create function. Passing a negative value disables > interrupts for that mailbox. It is the caller's responsibility to ensure > enough interrupt vectors are allocated. > > Cc: "Li, Ming" <ming4.li@intel.com> > Cc: Bjorn Helgaas <helgaas@kernel.org> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> +static void *pci_doe_xa_entry(u16 vid, u8 prot) +{ + return (void *)(((unsigned long)vid << 16) | prot); +} ... > +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb) > +{ > + u8 index = 0; > + u8 xa_idx = 0; > + > + do { > + int rc; > + u16 vid; > + u8 prot; > + > + rc = pci_doe_discovery(doe_mb, &index, &vid, &prot); > + if (rc) > + return rc; > + > + pci_dbg(doe_mb->pdev, > + "[%x] Found protocol %d vid: %x prot: %x\n", > + doe_mb->cap_offset, xa_idx, vid, prot); > + > + rc = xa_insert(&doe_mb->prots, xa_idx++, > + pci_doe_xa_entry(vid, prot), GFP_KERNEL); I'm not that familiar with xarray, but the docs suggest that you have to use xa_mk_value() to store an integer directly into it. > + if (rc) > + return -ENOMEM; > + } while (index); > + > + return 0; > +} > +
On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote: > On Mon, 27 Jun 2022 21:15:21 -0700 > ira.weiny@intel.com wrote: > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > + /* Abort has timed out and the MB is dead */ > > + pci_err(pdev, "[%x] ABORT timed out\n", offset); > > Does this print mention it's a DOE somewhere? It does because of this earlier mention: > > +#define dev_fmt(fmt) "DOE: " fmt
On Tue, Jun 28, 2022 at 03:38:48PM +0100, Jonathan Cameron wrote: > On Mon, 27 Jun 2022 21:15:21 -0700 > ira.weiny@intel.com wrote: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based > > mailbox with standard protocol discovery. Each mailbox is accessed > > through a DOE Extended Capability. > > > > Each DOE mailbox must support the DOE discovery protocol in addition to > > any number of additional protocols. > > > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a > > defined config space offset. Functionality includes iterating, > > creating, query of supported protocol, and task submission. Destruction > > of the mailboxes is device managed. > > > > If interrupts are desired, the interrupt number can be queried and > > passed to the create function. Passing a negative value disables > > interrupts for that mailbox. It is the caller's responsibility to ensure > > enough interrupt vectors are allocated. > > > > Cc: "Li, Ming" <ming4.li@intel.com> > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > +static void *pci_doe_xa_entry(u16 vid, u8 prot) > +{ > + return (void *)(((unsigned long)vid << 16) | prot); > +} > ... > > > +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb) > > +{ > > + u8 index = 0; > > + u8 xa_idx = 0; > > + > > + do { > > + int rc; > > + u16 vid; > > + u8 prot; > > + > > + rc = pci_doe_discovery(doe_mb, &index, &vid, &prot); > > + if (rc) > > + return rc; > > + > > + pci_dbg(doe_mb->pdev, > > + "[%x] Found protocol %d vid: %x prot: %x\n", > > + doe_mb->cap_offset, xa_idx, vid, prot); > > + > > + rc = xa_insert(&doe_mb->prots, xa_idx++, > > + pci_doe_xa_entry(vid, prot), GFP_KERNEL); > > I'm not that familiar with xarray, but the docs suggest that you have > to use xa_mk_value() to store an integer directly into it. Indeed I missed that. Thanks. Ira > > > + if (rc) > > + return -ENOMEM; > > + } while (index); > > + > > + return 0; > > +} > > +
On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote: > On Mon, 27 Jun 2022 21:15:21 -0700 > ira.weiny@intel.com wrote: > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based > > mailbox with standard protocol discovery. Each mailbox is accessed > > through a DOE Extended Capability. > > > > Each DOE mailbox must support the DOE discovery protocol in addition to > > any number of additional protocols. > > > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a > > defined config space offset. Functionality includes iterating, > > creating, query of supported protocol, and task submission. Destruction > > of the mailboxes is device managed. > > > > If interrupts are desired, the interrupt number can be queried and > > passed to the create function. Passing a negative value disables > > interrupts for that mailbox. It is the caller's responsibility to ensure > > enough interrupt vectors are allocated. > > > > Cc: "Li, Ming" <ming4.li@intel.com> > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > Hi Ira, > > Thanks for keeping at this! > > I think this has reintroduced some of the races around that annoying > interrupt source form BUSY transitioning to low that has > no explicit 'cause' flag. I think we'd hammered all those out in the > previous version but maybe there were still some there... Well I really tried hard not to introduce races which would be a problem. But I would not be surprised. > > I 'think' it will work as is, but depending on the timing a given DOE > implementation has, the interrupt may be completely pointless as it > will be signaling the wrong event. :-/ There is a chance that an IRQ comes in just after we timeout waiting for it. I think that has always been the case and the IRQ will effectively be missed I _think_. > > Jonathan > [snip] > > > +/* > > + * Returned from the wait functions to indicate that an abort has been issued > > + */ > > +#define DOE_WAIT_ABORT 1 > > + > > +static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb) > > Feels like there should be a naming to convey the return value as > a boolean rather than pushing through a flag value. Something like? static bool pci_doe_arm_abort_seen(struct pci_doe_mb *doe_mb) { if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) return true; clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); return false; } ... if (pci_doe_arm_abort_seen(mb)) ... Process abort ... ... > > > +{ > > + if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) > > + return DOE_WAIT_ABORT; > > + clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); > > + return 0; > > +} > > + > > > +static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val) > > +{ > > + struct pci_dev *pdev = doe_mb->pdev; > > + int offset = doe_mb->cap_offset; > > + > > + if (pci_doe_irq_enabled(doe_mb)) > > + val |= PCI_DOE_CTRL_INT_EN; > > + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); > > +} > > + > > +static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb) > Can we rename this as it does more than simply issue the abort, > it waits for it to finish Sure. How about just pci_doe_abort()? I'm probably going to open code that call now. > > > +{ > > + struct pci_dev *pdev = doe_mb->pdev; > > + int offset = doe_mb->cap_offset; > > + unsigned long timeout_jiffies; > > + > > + pci_dbg(pdev, "[%x] Issuing Abort\n", offset); > > + > > + /* > > + * Abort detected while aborting; something is really broken or the > > + * mailbox is being destroyed. > > + */ > > + if (pci_doe_arm_wait(doe_mb)) > > + return -EIO; > > + > > + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT; > > + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT); > > + > > + do { > > + u32 val; > > + > > + /* > > + * Abort detected while aborting; something is really broken or > > + * the mailbox is being destroyed. > > + */ > > + if (pci_doe_wait_irq_or_poll(doe_mb)) > > + return -EIO; > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + > > + /* Abort success! */ > > + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && > > + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) > > + return 0; > > + > > + } while (!time_after(jiffies, timeout_jiffies)); > > + > > + /* Abort has timed out and the MB is dead */ > > + pci_err(pdev, "[%x] ABORT timed out\n", offset); > > Does this print mention it's a DOE somewhere? Yep, per Bjorn's suggestion I removed all the 'DOE ' strings and use the dev_fmt specifier. [snip] > > + > > + /* First 2 dwords have already been read */ > > + length -= 2; > > + payload_length = min(length, task->response_pl_sz / sizeof(u32)); > > + /* Read the rest of the response payload */ > > + for (i = 0; i < payload_length; i++) { > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, > > + &task->response_pl[i]); > > + /* Prior to the last ack, ensure Data Object Ready */ > > + if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb)) > > spaces around - Done. > > > + return -EIO; > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + } > > + > > + /* Flush excess length */ > > + for (; i < length; i++) { > > + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); > > + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); > > + } > > + > > + /* Final error check to pick up on any since Data Object Ready */ > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) > > + return -EIO; > > + > > + return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32); > > +} > > + > > > + > > +static void doe_statemachine_work(struct work_struct *work) > > +{ > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task, > > + work); > > + struct pci_doe_mb *doe_mb = task->doe_mb; > > + struct pci_dev *pdev = doe_mb->pdev; > > + int offset = doe_mb->cap_offset; > > + unsigned int busy_retries = 0; > > + unsigned long timeout_jiffies; > > + u32 val; > > + int rc; > > + > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { > > + signal_task_complete(task, -EIO); > > + return; > > + } > > + > > + /* Send request */ > > +retry_req: > > + if (pci_doe_arm_wait(doe_mb)) { > > + signal_task_abort(task, -EIO); > > + return; > > + } > > Is there a race here? If Busy drops at this point we queue up > a message, but IRQ bit is already set. Hence when we hit > wait_event_timeout() the flag is already set and IIRC we'll > drop straight through. > I did not realize that the device would interrupt when Busy dropped? I was thinking that V11 did not respond to IRQ but indeed it did via setting the work item to run immediately... However, I only see this in the spec: 1) System firmware/software checks that the DOE Busy bit is Clear to ensure that the DOE instance is ready to receive a DOE request. > > It'll probably be fine because it will end up polling below > but doesn't look ideal. I agree it would not be ideal but I think if we are waiting for Busy to be cleared then the pci_doe_arm_wait() should be benign. > > Upshot is that you sort of have to handle "spurious interrupts" > cleanly and rewait on the interrupt if you get one whilst also handling > race conditions around RW1C of the interrupt status flag. Sorry I'm not sure what 'RW1C' means here? Anyway, spurious interrupts was something I was concerned about but I don't think there is anything we can do about an interrupt coming in when we are expecting one but the device did not really send one. AFAIK that is virtually impossible anyway. If we actually 'miss' one because we timed out before the device sent it then I think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around. Actually timeout is handled by the abort call and that IRQ will, depending on timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not ideal. However, by that time the error and busy flags should be clear and we can safely continue. Otherwise we are going to take the mailbox down. It may seem better to arm wait on each iteration through the abort loop. But this is not logically correct because the abort operation should trigger an IRQ. So there is always a race if we missed an IRQ because we timed out early. > > > > + > > + rc = pci_doe_send_req(doe_mb, task); > > + > > + /* > > + * The specification does not provide any guidance on how long > > + * some other entity could keep the DOE busy, so try for 1 > > + * second then fail. Busy handling is best effort only, because > > + * there is no way of avoiding racing against another user of > > + * the DOE. > > + */ > > + if (rc == -EBUSY) { > > + busy_retries++; > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > > + pci_warn(pdev, > > + "[%x] busy for too long (> 1 sec)\n", > > + offset); > > + signal_task_complete(task, rc); > > + return; > > + } > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > + signal_task_abort(task, rc); > > + return; > > + } > > + goto retry_req; > > + } else if (rc) { > > + signal_task_abort(task, rc); > > + return; > > + } > > + > > + timeout_jiffies = jiffies + HZ; > > + if (pci_doe_wait_irq_or_poll(doe_mb)) { > > So this may well be passed as a result of a BUSY transition to 0 very soon > after the doe_send_req but well before the data is ready.... I think the simple fix is to make the BUSY wait on an IRQ. Something like: 21:13:53 > git di diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c index 12f9f8045eb7..afd326320798 100644 --- a/drivers/pci/doe.c +++ b/drivers/pci/doe.c @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work) signal_task_complete(task, rc); return; } - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { + if (pci_doe_wait_irq_or_poll(...)) { signal_task_abort(task, rc); return; } > > > > + signal_task_abort(task, -EIO); > > + return; > > + } > > + > > + /* Poll for response */ > > +retry_resp: > > + if (pci_doe_arm_wait(doe_mb)) { > I think we can get here between Busy drop and Object Ready which means > this can get another IRQ_FLAG setting just after it. Does it matter? > Don't think so, as we don't use that bit again in this run through > and it will be cleared at beginning of next one, Yea basically I agree. > but if so why is > this call here? Seemed like the right thing to do at the time... ;-) j/k > I think it's only useful for detecting an abort, if > so do that explicitly. Actually it is the right thing to do... However, the wait poll below also needs to be an IRQ or poll. I'm not sure how I missed that logic. > > > + signal_task_abort(task, -EIO); > > + return; > > + } > > + > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > + signal_task_abort(task, -EIO); > > + return; > > + } > > + > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > > + if (time_after(jiffies, timeout_jiffies)) { > > + signal_task_abort(task, -ETIMEDOUT); > > + return; > > + } > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { > > Following on from above.... > As a result of the interrupt having fired on the BUSY off transition, > I think we will almost always end up spinning here until Object Ready > is set. Fine, but seems a shame we don't let an interrupt do this > for us in most cases. (note in QEMU response is instantaneous so > when the interrupt for Busy drop is set, object ready also true so > by the time we get here data ready will already be sent). This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above ensures we continue to look for that interrupt. I'm starting to see how I got confused. The timeout values all vary and mod_delayed_work() made it very easy for you to interrupt any of those. I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and got confused. :-( > > > + signal_task_abort(task, -EIO); > > + return; > > + } > > + goto retry_resp; > > + } > > + > > + rc = pci_doe_recv_resp(doe_mb, task); > > + if (rc < 0) { > > + signal_task_abort(task, rc); > > + return; > > + } > > + > > + signal_task_complete(task, rc); > > +} > > + > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > > +{ > > + if (doe_mb->work_queue) > > I'm not a great fan of free functions that check a bunch of conditions > because they may be called before things are set up. I'll see what I can do. I do kind of like this but I think it gets muddled and I'm not dead set on either way. > To my > mind that generally means we should be calling individual cleanup > in the appropriate error handlers. > > Either that or just use devm handling for each item. Sure > it's a few more lines of code, but I find it a lot easier to go > > Oh look that thing we just set up is cleaned up by this. > > > > + destroy_workqueue(doe_mb->work_queue); > > + if (pci_doe_irq_enabled(doe_mb)) > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); > > + xa_destroy(&doe_mb->prots); > > + kfree(doe_mb); > > +} > > + > > ... > > > + > > +static void pci_doe_destroy_mb(void *mb) > > +{ > > + struct pci_doe_mb *doe_mb = mb; > > + > > + /* Mark going down */ > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); > > + > > + /* Abort any in progress work items */ > > + pci_doe_abort(doe_mb); > > Abort is getting used for two things in here. Perhaps > rename this one to > pci_doe_abort_tasks() or something like that? What do you mean two things? Oh I think I see. You mean abort the work item vs abort sent to the hardware? This no longer aborts all the tasks just the one which may be in progress. Because the work queue is ordered only one task may be in progress. I'll clean up the comment too. This sets the abort flag and wakes it up if it is sleeping. If not then the abort flag will be detected in the next arm. FWIW I think I may just remove this call and open code it here. > > > + > > + /* Flush remaining work items */ > > + flush_workqueue(doe_mb->work_queue); > > + > > + pci_doe_free_mb(doe_mb); > > +} > > + > > +/** > > + * pcim_doe_create_mb() - Create a DOE mailbox object > > + * > > + * @pdev: PCI device to create the DOE mailbox for > > + * @cap_offset: Offset of the DOE mailbox > > + * @int_msg_num: Interrupt message number to use; a negative value means don't > > + * use interrupts > > + * > > + * Create a single mailbox object to manage the mailbox protocol at the > > + * cap_offset specified. > > + * > > + * Caller should allocate PCI IRQ vectors before passing a possitive value for > > positive Thanks fixed. > > > + * int_msg_num. > > + * > > + * RETURNS: created mailbox object on success > > + * ERR_PTR(-errno) on failure > > + */ > > +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset, > > + int int_msg_num) > > +{ > > + struct pci_doe_mb *doe_mb; > > + int rc; > > + > > + doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL); > > + if (!doe_mb) > > + return ERR_PTR(-ENOMEM); > > + > > + doe_mb->pdev = pdev; > > + doe_mb->int_msg_num = -1; > > + doe_mb->cap_offset = cap_offset; > > + > > + xa_init(&doe_mb->prots); > > + init_waitqueue_head(&doe_mb->wq); > > + > > + if (int_msg_num >= 0) { > > + rc = pci_doe_enable_irq(doe_mb, int_msg_num); > > + if (rc) > > + pci_err(pdev, > > + "[%x] enable requested IRQ (%d) failed : %d\n", > > + doe_mb->cap_offset, int_msg_num, rc); > > If we are printing an error, I'd argue we should not continue. > Or at very least we should a comment here to say why we should do so... > Not continue seems reasonable. > > > + } > > + > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, > > + doe_mb->cap_offset); > > + if (!doe_mb->work_queue) { > > + pci_err(pdev, "[%x] failed to allocate work queue\n", > > + doe_mb->cap_offset); > > + pci_doe_free_mb(doe_mb); > > As above, I'd rather this explicitly freed what has been set up > and only that rather than calling a free function that does a bunch of > stuff conditionally. I think I can make that work. This is the only conditional in free however, because the other conditional is the IRQ support which may not be set up. Thanks again for the in depth review! Ira > > > > + return ERR_PTR(-ENOMEM); > > + } > > + > > + /* Reset the mailbox by issuing an abort */ > > + rc = pci_doe_issue_abort(doe_mb); > > + if (rc) { > > + pci_err(pdev, "[%x] failed to reset : %d\n", > > + doe_mb->cap_offset, rc); > > + pci_doe_free_mb(doe_mb); > > + return ERR_PTR(rc); > > + } > > + > > + if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb)) > > + return ERR_PTR(-EIO); > > + > > + rc = pci_doe_cache_protocols(doe_mb); > > + if (rc) { > > + pci_err(pdev, "[%x] failed to cache protocols : %d\n", > > + doe_mb->cap_offset, rc); > > + return ERR_PTR(rc); > > + } > > + > > + return doe_mb; > > +} > > +EXPORT_SYMBOL_GPL(pcim_doe_create_mb); > > + > > >
On Tue, 28 Jun 2022 11:20:32 -0700 Ira Weiny <ira.weiny@intel.com> wrote: > On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote: > > On Mon, 27 Jun 2022 21:15:21 -0700 > > ira.weiny@intel.com wrote: > > > > > From: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > > > > Introduced in a PCIe r6.0, sec 6.30, DOE provides a config space based > > > mailbox with standard protocol discovery. Each mailbox is accessed > > > through a DOE Extended Capability. > > > > > > Each DOE mailbox must support the DOE discovery protocol in addition to > > > any number of additional protocols. > > > > > > Define core PCIe functionality to manage a single PCIe DOE mailbox at a > > > defined config space offset. Functionality includes iterating, > > > creating, query of supported protocol, and task submission. Destruction > > > of the mailboxes is device managed. > > > > > > If interrupts are desired, the interrupt number can be queried and > > > passed to the create function. Passing a negative value disables > > > interrupts for that mailbox. It is the caller's responsibility to ensure > > > enough interrupt vectors are allocated. > > > > > > Cc: "Li, Ming" <ming4.li@intel.com> > > > Cc: Bjorn Helgaas <helgaas@kernel.org> > > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > > Co-developed-by: Ira Weiny <ira.weiny@intel.com> > > > Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > > > Hi Ira, > > > > Thanks for keeping at this! > > > > I think this has reintroduced some of the races around that annoying > > interrupt source form BUSY transitioning to low that has > > no explicit 'cause' flag. I think we'd hammered all those out in the > > previous version but maybe there were still some there... > > Well I really tried hard not to introduce races which would be a problem. But > I would not be surprised. > > > > > I 'think' it will work as is, but depending on the timing a given DOE > > implementation has, the interrupt may be completely pointless as it > > will be signaling the wrong event. > > :-/ > > There is a chance that an IRQ comes in just after we timeout waiting for it. I > think that has always been the case and the IRQ will effectively be missed I > _think_. The timeout case I'm not that worried about as it means the device is out of spec, so whilst it might happen and we don't want to break in that case it should be uncommon enough that a perf disadvantage doesn't matter. ... > > > + > > > +static void doe_statemachine_work(struct work_struct *work) > > > +{ > > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task, > > > + work); > > > + struct pci_doe_mb *doe_mb = task->doe_mb; > > > + struct pci_dev *pdev = doe_mb->pdev; > > > + int offset = doe_mb->cap_offset; > > > + unsigned int busy_retries = 0; > > > + unsigned long timeout_jiffies; > > > + u32 val; > > > + int rc; > > > + > > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { > > > + signal_task_complete(task, -EIO); > > > + return; > > > + } > > > + > > > + /* Send request */ > > > +retry_req: > > > + if (pci_doe_arm_wait(doe_mb)) { > > > + signal_task_abort(task, -EIO); > > > + return; > > > + } > > > > Is there a race here? If Busy drops at this point we queue up > > a message, but IRQ bit is already set. Hence when we hit > > wait_event_timeout() the flag is already set and IIRC we'll > > drop straight through. > > > > I did not realize that the device would interrupt when Busy dropped? I was > thinking that V11 did not respond to IRQ but indeed it did via setting the work > item to run immediately... > > However, I only see this in the spec: > > 1) System firmware/software checks that the DOE Busy bit is Clear to ensure > that the DOE instance is ready to receive a DOE request. I missed this particular one originally and someone else pointed it out in review (can't remember who though). The busy drop is mentioned in the bit definition. It's in the capability definition. "DOE Interrupt Status - This bit must be Set when an interrupt is generated to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or that the DOE Busy bit has been Cleared." (the formatting is in the release spec.. hohum) Anyhow, upshot is that the status can be set as a result of Busy Bit clearing. 6.30.3 Interrupt Geneeration: then says that interrupt is generate on a transition of the logical AND of 1. Vector unmasked 2. DOE interrupt Enable bit is 1 3. Value of the DOE interrupt Status bit is 1. So if interrupt status bit is set to 1 due to a Busy drop and we then clear it before Data Object Ready, we'll get 2 interrupts. There is another vague bit of language that sort of allows other uses of this interrupt for protocol specific stuff. Hopefully no one falls for that, but we should safely handle that case (perf drop as a result is fine though!) I can't remember where the exact language is, but I've had a few 'polite discussions' to persuade people using it that way would be a very bad idea... > > > > > It'll probably be fine because it will end up polling below > > but doesn't look ideal. > > I agree it would not be ideal but I think if we are waiting for Busy to be > cleared then the pci_doe_arm_wait() should be benign. I think in some of these paths we are waiting for Data Object Ready to be set, the busy drop is effectively acting as a spurious interrupt if we clear the status before the data object ready event which could be much later because of Busy can clear really quickly. > > > > > Upshot is that you sort of have to handle "spurious interrupts" > > cleanly and rewait on the interrupt if you get one whilst also handling > > race conditions around RW1C of the interrupt status flag. > > Sorry I'm not sure what 'RW1C' means here? Read / Write 1 to clear. In this case I meant reading it and then clearing it without looking at the other status bits. > > Anyway, spurious interrupts was something I was concerned about but I don't > think there is anything we can do about an interrupt coming in when we are > expecting one but the device did not really send one. AFAIK that is virtually > impossible anyway. In this case seeing 2 interrupts is highly likely. We see the Busy drop one and the interrupt handler clears the Interrupt Status Bit, then data object becomes ready and we go around again. > > If we actually 'miss' one because we timed out before the device sent it then I > think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around. > > Actually timeout is handled by the abort call and that IRQ will, depending on > timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not > ideal. However, by that time the error and busy flags should be clear and we > can safely continue. Otherwise we are going to take the mailbox down. > > It may seem better to arm wait on each iteration through the abort loop. But > this is not logically correct because the abort operation should trigger an > IRQ. So there is always a race if we missed an IRQ because we timed out early. I probably stuck that comment in the wrong place. The initial call to clear the flag before this should be fine (short of the 'spurious' case of people using the interrupt for protocol specific usage). > > > > > > > > + > > > + rc = pci_doe_send_req(doe_mb, task); > > > + > > > + /* > > > + * The specification does not provide any guidance on how long > > > + * some other entity could keep the DOE busy, so try for 1 > > > + * second then fail. Busy handling is best effort only, because > > > + * there is no way of avoiding racing against another user of > > > + * the DOE. > > > + */ > > > + if (rc == -EBUSY) { > > > + busy_retries++; > > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > > > + pci_warn(pdev, > > > + "[%x] busy for too long (> 1 sec)\n", > > > + offset); > > > + signal_task_complete(task, rc); > > > + return; > > > + } > > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > > + signal_task_abort(task, rc); > > > + return; > > > + } > > > + goto retry_req; > > > + } else if (rc) { > > > + signal_task_abort(task, rc); > > > + return; > > > + } > > > + > > > + timeout_jiffies = jiffies + HZ; > > > + if (pci_doe_wait_irq_or_poll(doe_mb)) { > > > > So this may well be passed as a result of a BUSY transition to 0 very soon > > after the doe_send_req but well before the data is ready.... > > I think the simple fix is to make the BUSY wait on an IRQ. Something like: > > 21:13:53 > git di > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > index 12f9f8045eb7..afd326320798 100644 > --- a/drivers/pci/doe.c > +++ b/drivers/pci/doe.c > @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work) > signal_task_complete(task, rc); > return; > } > - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > + if (pci_doe_wait_irq_or_poll(...)) { > signal_task_abort(task, rc); > return; This case (which I think is the -EBUSY from pci_doe_send_req() handling) isn't important because it's trying to paper over a weird condition. We don't normally expect to get here. I was concerned with the line just above my comment which may not act as a gate at all because it's tripped by the the Busy Drop, which may be well before the data object ready that we are actually waiting for. > } > > > > > > > > + signal_task_abort(task, -EIO); > > > + return; > > > + } > > > + > > > + /* Poll for response */ > > > +retry_resp: > > > + if (pci_doe_arm_wait(doe_mb)) { > > I think we can get here between Busy drop and Object Ready which means > > this can get another IRQ_FLAG setting just after it. Does it matter? > > Don't think so, as we don't use that bit again in this run through > > and it will be cleared at beginning of next one, > > Yea basically I agree. > > > but if so why is > > this call here? > > Seemed like the right thing to do at the time... ;-) j/k > > > I think it's only useful for detecting an abort, if > > so do that explicitly. > > Actually it is the right thing to do... However, the wait poll below also > needs to be an IRQ or poll. I'm not sure how I missed that logic. Sounds write though without whole code laid out to follow through I'm not 100% sure yet! > > > > > > + signal_task_abort(task, -EIO); > > > + return; > > > + } > > > + > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > > + signal_task_abort(task, -EIO); > > > + return; > > > + } > > > + > > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > > > + if (time_after(jiffies, timeout_jiffies)) { > > > + signal_task_abort(task, -ETIMEDOUT); > > > + return; > > > + } > > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { > > > > Following on from above.... > > As a result of the interrupt having fired on the BUSY off transition, > > I think we will almost always end up spinning here until Object Ready > > is set. Fine, but seems a shame we don't let an interrupt do this > > for us in most cases. (note in QEMU response is instantaneous so > > when the interrupt for Busy drop is set, object ready also true so > > by the time we get here data ready will already be sent). > > This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above > ensures we continue to look for that interrupt. > > I'm starting to see how I got confused. The timeout values all vary and > mod_delayed_work() made it very easy for you to interrupt any of those. Yeah. That was a nice suggestion Dan made long ago but it doesn't play well with the single workqueue. > > I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and > got confused. :-( > > > > > > + signal_task_abort(task, -EIO); > > > + return; > > > + } > > > + goto retry_resp; > > > + } > > > + > > > + rc = pci_doe_recv_resp(doe_mb, task); > > > + if (rc < 0) { > > > + signal_task_abort(task, rc); > > > + return; > > > + } > > > + > > > + signal_task_complete(task, rc); > > > +} > > > + > > > > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > > > +{ > > > + if (doe_mb->work_queue) > > > > I'm not a great fan of free functions that check a bunch of conditions > > because they may be called before things are set up. > > I'll see what I can do. I do kind of like this but I think it gets muddled and > I'm not dead set on either way. > > > To my > > mind that generally means we should be calling individual cleanup > > in the appropriate error handlers. > > > > Either that or just use devm handling for each item. Sure > > it's a few more lines of code, but I find it a lot easier to go > > > > Oh look that thing we just set up is cleaned up by this. > > > > > > > + destroy_workqueue(doe_mb->work_queue); > > > + if (pci_doe_irq_enabled(doe_mb)) > > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); > > > + xa_destroy(&doe_mb->prots); > > > + kfree(doe_mb); > > > +} > > > + > > > > ... > > > > > + > > > +static void pci_doe_destroy_mb(void *mb) > > > +{ > > > + struct pci_doe_mb *doe_mb = mb; > > > + > > > + /* Mark going down */ > > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); > > > + > > > + /* Abort any in progress work items */ > > > + pci_doe_abort(doe_mb); > > > > Abort is getting used for two things in here. Perhaps > > rename this one to > > pci_doe_abort_tasks() or something like that? > > What do you mean two things? Oh I think I see. You mean abort the work item > vs abort sent to the hardware? yup. > > This no longer aborts all the tasks just the one which may be in progress. > Because the work queue is ordered only one task may be in progress. I'll clean > up the comment too. Hmm. It puts a requirement on the caller to not queue multiple requests that might require ordering. One advantage of flushing the lot was ordering was unaffected (though any caller that queued multiple items would have to then requeue multiple items so would have to maintain their own retry buffer). > > This sets the abort flag and wakes it up if it is sleeping. If not then the > abort flag will be detected in the next arm. > > FWIW I think I may just remove this call and open code it here. Sounds good, avoid naming confusion by getting rid of the name :) > > > + > > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, > > > + doe_mb->cap_offset); > > > + if (!doe_mb->work_queue) { > > > + pci_err(pdev, "[%x] failed to allocate work queue\n", > > > + doe_mb->cap_offset); > > > + pci_doe_free_mb(doe_mb); > > > > As above, I'd rather this explicitly freed what has been set up > > and only that rather than calling a free function that does a bunch of > > stuff conditionally. > > I think I can make that work. This is the only conditional in free however, > because the other conditional is the IRQ support which may not be set up. If you split to multiple devm_ calls you can not setup a tear down for the irq if we don't have one. Or, don't use pci_request_irq() but call devm_request_threaded_irq() directly and let that clean up for you. > > Thanks again for the in depth review! No problem. I know how nasty this seemingly simple little bit of code is, so you have my sympathies :) Jonathan
On Tue, Jun 28, 2022 at 11:20:32AM -0700, Ira wrote: > On Tue, Jun 28, 2022 at 03:16:26PM +0100, Jonathan Cameron wrote: > > On Mon, 27 Jun 2022 21:15:21 -0700 > > ira.weiny@intel.com wrote: > > [snip] > > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > > > +{ > > > + if (doe_mb->work_queue) > > > > I'm not a great fan of free functions that check a bunch of conditions > > because they may be called before things are set up. > > I'll see what I can do. I do kind of like this but I think it gets muddled and > I'm not dead set on either way. I've completely reworked pci_doe_create_mb() to use devm throughout. It is much cleaner. Thanks Jonathan! Ira
On Wed, Jun 29, 2022 at 03:09:47PM +0100, Jonathan Cameron wrote: > On Tue, 28 Jun 2022 11:20:32 -0700 > Ira Weiny <ira.weiny@intel.com> wrote: > [snip] > > > > > > > Hi Ira, > > > > > > Thanks for keeping at this! > > > > > > I think this has reintroduced some of the races around that annoying > > > interrupt source form BUSY transitioning to low that has > > > no explicit 'cause' flag. I think we'd hammered all those out in the > > > previous version but maybe there were still some there... > > > > Well I really tried hard not to introduce races which would be a problem. But > > I would not be surprised. > > > > > > > > I 'think' it will work as is, but depending on the timing a given DOE > > > implementation has, the interrupt may be completely pointless as it > > > will be signaling the wrong event. > > > > :-/ > > > > There is a chance that an IRQ comes in just after we timeout waiting for it. I > > think that has always been the case and the IRQ will effectively be missed I > > _think_. > > The timeout case I'm not that worried about as it means the device > is out of spec, so whilst it might happen and we don't want to break in that > case it should be uncommon enough that a perf disadvantage doesn't matter. Ok I think we are agreed here. I've removed the irq stuff for now per the call yesterday. But I'm still interested in how to solve the problem so see below. > > > ... > > > > > + > > > > +static void doe_statemachine_work(struct work_struct *work) > > > > +{ > > > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task, > > > > + work); > > > > + struct pci_doe_mb *doe_mb = task->doe_mb; > > > > + struct pci_dev *pdev = doe_mb->pdev; > > > > + int offset = doe_mb->cap_offset; > > > > + unsigned int busy_retries = 0; > > > > + unsigned long timeout_jiffies; > > > > + u32 val; > > > > + int rc; > > > > + > > > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { > > > > + signal_task_complete(task, -EIO); > > > > + return; > > > > + } > > > > + > > > > + /* Send request */ > > > > +retry_req: > > > > + if (pci_doe_arm_wait(doe_mb)) { > > > > + signal_task_abort(task, -EIO); > > > > + return; > > > > + } > > > > > > Is there a race here? If Busy drops at this point we queue up > > > a message, but IRQ bit is already set. Hence when we hit > > > wait_event_timeout() the flag is already set and IIRC we'll > > > drop straight through. > > > > > > > I did not realize that the device would interrupt when Busy dropped? I was > > thinking that V11 did not respond to IRQ but indeed it did via setting the work > > item to run immediately... > > > > However, I only see this in the spec: > > > > 1) System firmware/software checks that the DOE Busy bit is Clear to ensure > > that the DOE instance is ready to receive a DOE request. > > I missed this particular one originally and someone else pointed it out in > review (can't remember who though). The busy drop is mentioned in the > bit definition. It's in the capability definition. > > "DOE Interrupt Status - This bit must be Set when an interrupt is generated > to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or > that the DOE Busy bit has been Cleared." > (the formatting is in the release spec.. hohum) > > Anyhow, upshot is that the status can be set as a result of Busy Bit clearing. > 6.30.3 Interrupt Geneeration: then says that interrupt is generate on a > transition of the logical AND of > > 1. Vector unmasked > 2. DOE interrupt Enable bit is 1 > 3. Value of the DOE interrupt Status bit is 1. > > So if interrupt status bit is set to 1 due to a Busy drop and we then > clear it before Data Object Ready, we'll get 2 interrupts. I don't understand this. I think the issue is we _don't_ clear it between pci_doe_arm_wait() and pci_doe_send_req(). Then we interpret the Busy drop interrupt incorrectly as the Data Object Ready interrupt and start polling for a response immediately rather than waiting for the Data Object Ready IRQ. I _think_ this will be ok because the response will not be read until Data Object Ready is actually set. So we check DOR wait again see the DOR IRQ there and goto retry_resp to check Data Object Ready again. Effectively I'm not sure I agree with you about _when_ we get the interrupts but I do agree that we get an extra one which I'm not checking for _why_. More important I think that getting more IRQs is better than missing an interrupt and incorrectly thinking we timed out. > > There is another vague bit of language that sort of allows other > uses of this interrupt for protocol specific stuff. Hopefully > no one falls for that, but we should safely handle that case (perf > drop as a result is fine though!) I can't remember where the exact > language is, but I've had a few 'polite discussions' to persuade > people using it that way would be a very bad idea... > > > > > > > > > > > It'll probably be fine because it will end up polling below > > > but doesn't look ideal. > > > > I agree it would not be ideal but I think if we are waiting for Busy to be > > cleared then the pci_doe_arm_wait() should be benign. > > I think in some of these paths we are waiting for Data Object Ready to be > set, the busy drop is effectively acting as a spurious interrupt if we > clear the status before the data object ready event which could be much later > because of Busy can clear really quickly. Ok yea I think this is what I am seeing. > > > > > > > > > Upshot is that you sort of have to handle "spurious interrupts" > > > cleanly and rewait on the interrupt if you get one whilst also handling > > > race conditions around RW1C of the interrupt status flag. > > > > Sorry I'm not sure what 'RW1C' means here? > > Read / Write 1 to clear. In this case I meant reading it and then clearing it > without looking at the other status bits. Ah. Perhaps the handler should be more involved in this setting different flags and having the *_wait() functions be more specific about what exactly we are waiting for. I'll have to think about that. > > > > > Anyway, spurious interrupts was something I was concerned about but I don't > > think there is anything we can do about an interrupt coming in when we are > > expecting one but the device did not really send one. AFAIK that is virtually > > impossible anyway. > > In this case seeing 2 interrupts is highly likely. > We see the Busy drop one and the interrupt handler clears the Interrupt Status > Bit, then data object becomes ready and we go around again. But we are only going to see this if some other entity is using the mailbox right? And I don't think that is going to be common, is it? Is this the sequence you are speaking of? If so I think this is how it would flow given the fix I suggested below. Device Other Entity Linux CPU Sets Busy pci_doe_arm_wait() <- clears FLAG_IRQ Clears Busy pci_doe_irq_handler() <set FLAG_IRQ> pci_doe_send_req() <- Sees !BUSY sends query pci_doe_wait_irq() <- No waiting because of 'spurious' Busy Drop!!! pci_doe_arm_wait() <- clears FLAG_IRQ <DOR not set> pci_doe_wait_irq() <- NOW waits!!! Set DOR pci_doe_irq_handler() <set FLAG_IRQ> <goto retry_resp> <DOR set> pci_doe_recv_resp() What am I missing? Ira > > > > > If we actually 'miss' one because we timed out before the device sent it then I > > think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around. > > > > Actually timeout is handled by the abort call and that IRQ will, depending on > > timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not > > ideal. However, by that time the error and busy flags should be clear and we > > can safely continue. Otherwise we are going to take the mailbox down. > > > > It may seem better to arm wait on each iteration through the abort loop. But > > this is not logically correct because the abort operation should trigger an > > IRQ. So there is always a race if we missed an IRQ because we timed out early. > > I probably stuck that comment in the wrong place. The initial call to clear > the flag before this should be fine (short of the 'spurious' case of people > using the interrupt for protocol specific usage). > > > > > > > > > > > > > + > > > > + rc = pci_doe_send_req(doe_mb, task); > > > > + > > > > + /* > > > > + * The specification does not provide any guidance on how long > > > > + * some other entity could keep the DOE busy, so try for 1 > > > > + * second then fail. Busy handling is best effort only, because > > > > + * there is no way of avoiding racing against another user of > > > > + * the DOE. > > > > + */ > > > > + if (rc == -EBUSY) { > > > > + busy_retries++; > > > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > > > > + pci_warn(pdev, > > > > + "[%x] busy for too long (> 1 sec)\n", > > > > + offset); > > > > + signal_task_complete(task, rc); > > > > + return; > > > > + } > > > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > > > + signal_task_abort(task, rc); > > > > + return; > > > > + } > > > > + goto retry_req; > > > > + } else if (rc) { > > > > + signal_task_abort(task, rc); > > > > + return; > > > > + } > > > > + > > > > + timeout_jiffies = jiffies + HZ; > > > > + if (pci_doe_wait_irq_or_poll(doe_mb)) { > > > > > > So this may well be passed as a result of a BUSY transition to 0 very soon > > > after the doe_send_req but well before the data is ready.... > > > > I think the simple fix is to make the BUSY wait on an IRQ. Something like: > > > > > > 21:13:53 > git di > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > > index 12f9f8045eb7..afd326320798 100644 > > --- a/drivers/pci/doe.c > > +++ b/drivers/pci/doe.c > > @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work) > > signal_task_complete(task, rc); > > return; > > } > > - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > + if (pci_doe_wait_irq_or_poll(...)) { > > signal_task_abort(task, rc); > > return; > > This case (which I think is the -EBUSY from pci_doe_send_req() handling) > isn't important because it's trying to paper over a weird condition. We > don't normally expect to get here. > > I was concerned with the line just above my comment which may not act as > a gate at all because it's tripped by the the Busy Drop, which may be > well before the data object ready that we are actually waiting for. > > > > } > > > > > > > > > > > > + signal_task_abort(task, -EIO); > > > > + return; > > > > + } > > > > + > > > > + /* Poll for response */ > > > > +retry_resp: > > > > + if (pci_doe_arm_wait(doe_mb)) { > > > I think we can get here between Busy drop and Object Ready which means > > > this can get another IRQ_FLAG setting just after it. Does it matter? > > > Don't think so, as we don't use that bit again in this run through > > > and it will be cleared at beginning of next one, > > > > Yea basically I agree. > > > > > but if so why is > > > this call here? > > > > Seemed like the right thing to do at the time... ;-) j/k > > > > > I think it's only useful for detecting an abort, if > > > so do that explicitly. > > > > Actually it is the right thing to do... However, the wait poll below also > > needs to be an IRQ or poll. I'm not sure how I missed that logic. > > Sounds write though without whole code laid out to follow through I'm > not 100% sure yet! > > > > > > > > > > + signal_task_abort(task, -EIO); > > > > + return; > > > > + } > > > > + > > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > > > + signal_task_abort(task, -EIO); > > > > + return; > > > > + } > > > > + > > > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > > > > + if (time_after(jiffies, timeout_jiffies)) { > > > > + signal_task_abort(task, -ETIMEDOUT); > > > > + return; > > > > + } > > > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { > > > > > > Following on from above.... > > > As a result of the interrupt having fired on the BUSY off transition, > > > I think we will almost always end up spinning here until Object Ready > > > is set. Fine, but seems a shame we don't let an interrupt do this > > > for us in most cases. (note in QEMU response is instantaneous so > > > when the interrupt for Busy drop is set, object ready also true so > > > by the time we get here data ready will already be sent). > > > > This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above > > ensures we continue to look for that interrupt. > > > > I'm starting to see how I got confused. The timeout values all vary and > > mod_delayed_work() made it very easy for you to interrupt any of those. > > Yeah. That was a nice suggestion Dan made long ago but it doesn't play well > with the single workqueue. > > > > > I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and > > got confused. :-( > > > > > > > > > + signal_task_abort(task, -EIO); > > > > + return; > > > > + } > > > > + goto retry_resp; > > > > + } > > > > + > > > > + rc = pci_doe_recv_resp(doe_mb, task); > > > > + if (rc < 0) { > > > > + signal_task_abort(task, rc); > > > > + return; > > > > + } > > > > + > > > > + signal_task_complete(task, rc); > > > > +} > > > > + > > > > > > > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > > > > +{ > > > > + if (doe_mb->work_queue) > > > > > > I'm not a great fan of free functions that check a bunch of conditions > > > because they may be called before things are set up. > > > > I'll see what I can do. I do kind of like this but I think it gets muddled and > > I'm not dead set on either way. > > > > > To my > > > mind that generally means we should be calling individual cleanup > > > in the appropriate error handlers. > > > > > > Either that or just use devm handling for each item. Sure > > > it's a few more lines of code, but I find it a lot easier to go > > > > > > Oh look that thing we just set up is cleaned up by this. > > > > > > > > > > + destroy_workqueue(doe_mb->work_queue); > > > > + if (pci_doe_irq_enabled(doe_mb)) > > > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); > > > > + xa_destroy(&doe_mb->prots); > > > > + kfree(doe_mb); > > > > +} > > > > + > > > > > > ... > > > > > > > + > > > > +static void pci_doe_destroy_mb(void *mb) > > > > +{ > > > > + struct pci_doe_mb *doe_mb = mb; > > > > + > > > > + /* Mark going down */ > > > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); > > > > + > > > > + /* Abort any in progress work items */ > > > > + pci_doe_abort(doe_mb); > > > > > > Abort is getting used for two things in here. Perhaps > > > rename this one to > > > pci_doe_abort_tasks() or something like that? > > > > What do you mean two things? Oh I think I see. You mean abort the work item > > vs abort sent to the hardware? > > yup. > > > > > This no longer aborts all the tasks just the one which may be in progress. > > Because the work queue is ordered only one task may be in progress. I'll clean > > up the comment too. > > Hmm. It puts a requirement on the caller to not queue multiple requests that > might require ordering. One advantage of flushing the lot was ordering was > unaffected (though any caller that queued multiple items would have to then > requeue multiple items so would have to maintain their own retry buffer). > > > > > This sets the abort flag and wakes it up if it is sleeping. If not then the > > abort flag will be detected in the next arm. > > > > FWIW I think I may just remove this call and open code it here. > > Sounds good, avoid naming confusion by getting rid of the name :) > > > > > > > + > > > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, > > > > + doe_mb->cap_offset); > > > > + if (!doe_mb->work_queue) { > > > > + pci_err(pdev, "[%x] failed to allocate work queue\n", > > > > + doe_mb->cap_offset); > > > > + pci_doe_free_mb(doe_mb); > > > > > > As above, I'd rather this explicitly freed what has been set up > > > and only that rather than calling a free function that does a bunch of > > > stuff conditionally. > > > > I think I can make that work. This is the only conditional in free however, > > because the other conditional is the IRQ support which may not be set up. > > If you split to multiple devm_ calls you can not setup a tear down for the > irq if we don't have one. Or, don't use pci_request_irq() but call > devm_request_threaded_irq() directly and let that clean up for you. > > > > > > Thanks again for the in depth review! > > No problem. I know how nasty this seemingly simple little bit of code > is, so you have my sympathies :) > > > Jonathan
On Wed, 29 Jun 2022 21:34:18 -0700 Ira Weiny <ira.weiny@intel.com> wrote: > On Wed, Jun 29, 2022 at 03:09:47PM +0100, Jonathan Cameron wrote: > > On Tue, 28 Jun 2022 11:20:32 -0700 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > [snip] > > > > > > > > > > Hi Ira, > > > > > > > > Thanks for keeping at this! > > > > > > > > I think this has reintroduced some of the races around that annoying > > > > interrupt source form BUSY transitioning to low that has > > > > no explicit 'cause' flag. I think we'd hammered all those out in the > > > > previous version but maybe there were still some there... > > > > > > Well I really tried hard not to introduce races which would be a problem. But > > > I would not be surprised. > > > > > > > > > > > I 'think' it will work as is, but depending on the timing a given DOE > > > > implementation has, the interrupt may be completely pointless as it > > > > will be signaling the wrong event. > > > > > > :-/ > > > > > > There is a chance that an IRQ comes in just after we timeout waiting for it. I > > > think that has always been the case and the IRQ will effectively be missed I > > > _think_. > > > > The timeout case I'm not that worried about as it means the device > > is out of spec, so whilst it might happen and we don't want to break in that > > case it should be uncommon enough that a perf disadvantage doesn't matter. > > Ok I think we are agreed here. > > I've removed the irq stuff for now per the call yesterday. But I'm still > interested in how to solve the problem so see below. > > > > > > > ... > > > > > > > + > > > > > +static void doe_statemachine_work(struct work_struct *work) > > > > > +{ > > > > > + struct pci_doe_task *task = container_of(work, struct pci_doe_task, > > > > > + work); > > > > > + struct pci_doe_mb *doe_mb = task->doe_mb; > > > > > + struct pci_dev *pdev = doe_mb->pdev; > > > > > + int offset = doe_mb->cap_offset; > > > > > + unsigned int busy_retries = 0; > > > > > + unsigned long timeout_jiffies; > > > > > + u32 val; > > > > > + int rc; > > > > > + > > > > > + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { > > > > > + signal_task_complete(task, -EIO); > > > > > + return; > > > > > + } > > > > > + > > > > > + /* Send request */ > > > > > +retry_req: > > > > > + if (pci_doe_arm_wait(doe_mb)) { > > > > > + signal_task_abort(task, -EIO); > > > > > + return; > > > > > + } > > > > > > > > Is there a race here? If Busy drops at this point we queue up > > > > a message, but IRQ bit is already set. Hence when we hit > > > > wait_event_timeout() the flag is already set and IIRC we'll > > > > drop straight through. > > > > > > > > > > I did not realize that the device would interrupt when Busy dropped? I was > > > thinking that V11 did not respond to IRQ but indeed it did via setting the work > > > item to run immediately... > > > > > > However, I only see this in the spec: > > > > > > 1) System firmware/software checks that the DOE Busy bit is Clear to ensure > > > that the DOE instance is ready to receive a DOE request. > > > > I missed this particular one originally and someone else pointed it out in > > review (can't remember who though). The busy drop is mentioned in the > > bit definition. It's in the capability definition. > > > > "DOE Interrupt Status - This bit must be Set when an interrupt is generated > > to indicate that the Data Object Ready bit or the DOE Error bit has been Set. or > > that the DOE Busy bit has been Cleared." > > (the formatting is in the release spec.. hohum) > > > > Anyhow, upshot is that the status can be set as a result of Busy Bit clearing. > > 6.30.3 Interrupt Geneeration: then says that interrupt is generate on a > > transition of the logical AND of > > > > 1. Vector unmasked > > 2. DOE interrupt Enable bit is 1 > > 3. Value of the DOE interrupt Status bit is 1. > > > > So if interrupt status bit is set to 1 due to a Busy drop and we then > > clear it before Data Object Ready, we'll get 2 interrupts. > > I don't understand this. I think the issue is we _don't_ clear it between > pci_doe_arm_wait() and pci_doe_send_req(). > > Then we interpret the Busy drop interrupt incorrectly as the Data Object Ready > interrupt and start polling for a response immediately rather than waiting for > the Data Object Ready IRQ. > > I _think_ this will be ok because the response will not be read until Data > Object Ready is actually set. So we check DOR wait again see the DOR IRQ there > and goto retry_resp to check Data Object Ready again. > > Effectively I'm not sure I agree with you about _when_ we get the interrupts > but I do agree that we get an extra one which I'm not checking for _why_. More > important I think that getting more IRQs is better than missing an interrupt > and incorrectly thinking we timed out. > I think the confusion here is down to where I put the comment that there was an issue. There are two paths... See below for the waveform for how BUSY is allowed to change on a normal send_req. May be easier to read that before this bit... I think you are referring to the retry_req path - I didn't focus on that because that shouldn't ever happen in reality. It's papering over a system doing something stupid. We never expect to enter the state machine handling with BUSY set but try to carry on if it is. In that path we could have had an interrupt occur as BUSY stops being true. If that happens between pci_doe_arm_wait() and send_req() then we have PCI_DOE_FLAG_IRQ set which means we drop straight through the pci_doe_wait_irq_or_poll() just before the /* Poll for response */ and as you as you say end up in a non irq escaped polling which is expensive for no reason. This one I think we solve by separating the busy check from sending data. So we wait for !BUSY, rearm the interrupt and move on to actually send data (which can result in BUSY -> !BUSY transition of it's own). That normal flow BUSY -> !BUSY transition just after send_req (every time we send anything the device may briefly set BUSY) results in an interrupt that means we again drop straight through the wait before /* Poll for response */ possibly well before Data Object ready is true. So there we need another dance to ensure the interrupt flag is cleared but we don't end up waiting on an interrupt that we missed. That's hard to do with a single flag, you may be better of clearing the interrupt status directly in the statemachine. As long as you clear it before reading data_object_ready the race should be closed. If data object ready isn't set, wait for another interrupt / timeout. I tried drawing this in asci art but it didn't work out. Best bet is probably to fix as many races as you can find with clear comments then we'll see if anyone can find others. > > > > There is another vague bit of language that sort of allows other > > uses of this interrupt for protocol specific stuff. Hopefully > > no one falls for that, but we should safely handle that case (perf > > drop as a result is fine though!) I can't remember where the exact > > language is, but I've had a few 'polite discussions' to persuade > > people using it that way would be a very bad idea... > > > > > > > > > > > > > > > > > It'll probably be fine because it will end up polling below > > > > but doesn't look ideal. > > > > > > I agree it would not be ideal but I think if we are waiting for Busy to be > > > cleared then the pci_doe_arm_wait() should be benign. > > > > I think in some of these paths we are waiting for Data Object Ready to be > > set, the busy drop is effectively acting as a spurious interrupt if we > > clear the status before the data object ready event which could be much later > > because of Busy can clear really quickly. > > Ok yea I think this is what I am seeing. > > > > > > > > > > > > > > Upshot is that you sort of have to handle "spurious interrupts" > > > > cleanly and rewait on the interrupt if you get one whilst also handling > > > > race conditions around RW1C of the interrupt status flag. > > > > > > Sorry I'm not sure what 'RW1C' means here? > > > > Read / Write 1 to clear. In this case I meant reading it and then clearing it > > without looking at the other status bits. > > Ah. Perhaps the handler should be more involved in this setting different > flags and having the *_wait() functions be more specific about what exactly we > are waiting for. I'll have to think about that. Either that or potentially don't clear the interrupt status in the handler, but do it in the state machine. The silliness with BUSY means you are going to get useless interrupts unforunately. > > > > > > > > > Anyway, spurious interrupts was something I was concerned about but I don't > > > think there is anything we can do about an interrupt coming in when we are > > > expecting one but the device did not really send one. AFAIK that is virtually > > > impossible anyway. > > > > In this case seeing 2 interrupts is highly likely. > > We see the Busy drop one and the interrupt handler clears the Interrupt Status > > Bit, then data object becomes ready and we go around again. > > But we are only going to see this if some other entity is using the mailbox > right? And I don't think that is going to be common, is it? BUSY on entry to doe_statemachine_work() is indeed only relevant if some other entity is trampling on us. It's best effort only. BUSY during normal flow is the one I care about. In most cases it will go like (assuming we clear the int status in the handler as now) Send Object BUSY ________|-----___________________ PROC ________|------------------______ OBJ RDY ___________________________------- Int Status______________-____________-_____ where I've added PROC to mean the device is processing the data. Once it clears the input buffer on the device and hence the device can accept another protocol request BUSY will drop. If device has some pipelining or runs multiple protocols in different threads, you can think of that busy period just being the time it needs to copy out the request to some protocol thread specific storage. You won't see this in QEMU without extra hacks because we shorten the flow so that whole thing is instantaneous. If those two interrupts per transfer occur well spread out they can result in your INT flag being set too often and some of the waits dropping through early. It will 'work' I think though because you ultimately spin on Data object ready which won't be set until after the second interrupt. > > Is this the sequence you are speaking of? If so I think this is how it would > flow given the fix I suggested below. > > Device Other Entity Linux CPU > Sets Busy > pci_doe_arm_wait() <- clears FLAG_IRQ > Clears Busy > pci_doe_irq_handler() <set FLAG_IRQ> > pci_doe_send_req() <- Sees !BUSY sends query > pci_doe_wait_irq() <- No waiting because of 'spurious' Busy Drop!!! > > pci_doe_arm_wait() <- clears FLAG_IRQ > <DOR not set> > pci_doe_wait_irq() <- NOW waits!!! > Set DOR > pci_doe_irq_handler() <set FLAG_IRQ> > <goto retry_resp> > <DOR set> > pci_doe_recv_resp() > > What am I missing? It's not some other entity causing BUSY to be set, it's us :) One simple route might be to have your interrupt handler not set the flag unless we have Data Object Ready or Abort, but you need to take care not to race in the handler, probably by reading that only after clearing the interrupt status (thus allowing a new interrupt). Jonathan > > Ira > > > > > > > > > If we actually 'miss' one because we timed out before the device sent it then I > > > think we are going to ignore the PCI_DOE_FLAG_IRQ flag on the next go around. > > > > > > Actually timeout is handled by the abort call and that IRQ will, depending on > > > timing, cause a full PCI_DOE_TIMEOUT to expire. :-( That is indeed not > > > ideal. However, by that time the error and busy flags should be clear and we > > > can safely continue. Otherwise we are going to take the mailbox down. > > > > > > It may seem better to arm wait on each iteration through the abort loop. But > > > this is not logically correct because the abort operation should trigger an > > > IRQ. So there is always a race if we missed an IRQ because we timed out early. > > > > I probably stuck that comment in the wrong place. The initial call to clear > > the flag before this should be fine (short of the 'spurious' case of people > > using the interrupt for protocol specific usage). > > > > > > > > > > > > > > > > > > + > > > > > + rc = pci_doe_send_req(doe_mb, task); > > > > > + > > > > > + /* > > > > > + * The specification does not provide any guidance on how long > > > > > + * some other entity could keep the DOE busy, so try for 1 > > > > > + * second then fail. Busy handling is best effort only, because > > > > > + * there is no way of avoiding racing against another user of > > > > > + * the DOE. > > > > > + */ > > > > > + if (rc == -EBUSY) { > > > > > + busy_retries++; > > > > > + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { > > > > > + pci_warn(pdev, > > > > > + "[%x] busy for too long (> 1 sec)\n", > > > > > + offset); > > > > > + signal_task_complete(task, rc); > > > > > + return; > > > > > + } > > > > > + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > > > > + signal_task_abort(task, rc); > > > > > + return; > > > > > + } > > > > > + goto retry_req; > > > > > + } else if (rc) { > > > > > + signal_task_abort(task, rc); > > > > > + return; > > > > > + } > > > > > + > > > > > + timeout_jiffies = jiffies + HZ; > > > > > + if (pci_doe_wait_irq_or_poll(doe_mb)) { > > > > > > > > So this may well be passed as a result of a BUSY transition to 0 very soon > > > > after the doe_send_req but well before the data is ready.... > > > > > > I think the simple fix is to make the BUSY wait on an IRQ. Something like: > > > > > > > > > > 21:13:53 > git di > > > diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c > > > index 12f9f8045eb7..afd326320798 100644 > > > --- a/drivers/pci/doe.c > > > +++ b/drivers/pci/doe.c > > > @@ -352,7 +352,7 @@ static void doe_statemachine_work(struct work_struct *work) > > > signal_task_complete(task, rc); > > > return; > > > } > > > - if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { > > > + if (pci_doe_wait_irq_or_poll(...)) { > > > signal_task_abort(task, rc); > > > return; > > > > This case (which I think is the -EBUSY from pci_doe_send_req() handling) > > isn't important because it's trying to paper over a weird condition. We > > don't normally expect to get here. > > > > I was concerned with the line just above my comment which may not act as > > a gate at all because it's tripped by the the Busy Drop, which may be > > well before the data object ready that we are actually waiting for. > > > > > > > } > > > > > > > > > > > > > > > > + signal_task_abort(task, -EIO); > > > > > + return; > > > > > + } > > > > > + > > > > > + /* Poll for response */ > > > > > +retry_resp: > > > > > + if (pci_doe_arm_wait(doe_mb)) { > > > > I think we can get here between Busy drop and Object Ready which means > > > > this can get another IRQ_FLAG setting just after it. Does it matter? > > > > Don't think so, as we don't use that bit again in this run through > > > > and it will be cleared at beginning of next one, > > > > > > Yea basically I agree. > > > > > > > but if so why is > > > > this call here? > > > > > > Seemed like the right thing to do at the time... ;-) j/k > > > > > > > I think it's only useful for detecting an abort, if > > > > so do that explicitly. > > > > > > Actually it is the right thing to do... However, the wait poll below also > > > needs to be an IRQ or poll. I'm not sure how I missed that logic. > > > > Sounds write though without whole code laid out to follow through I'm > > not 100% sure yet! > > > > > > > > > > > > > > + signal_task_abort(task, -EIO); > > > > > + return; > > > > > + } > > > > > + > > > > > + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); > > > > > + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { > > > > > + signal_task_abort(task, -EIO); > > > > > + return; > > > > > + } > > > > > + > > > > > + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { > > > > > + if (time_after(jiffies, timeout_jiffies)) { > > > > > + signal_task_abort(task, -ETIMEDOUT); > > > > > + return; > > > > > + } > > > > > + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { > > > > > > > > Following on from above.... > > > > As a result of the interrupt having fired on the BUSY off transition, > > > > I think we will almost always end up spinning here until Object Ready > > > > is set. Fine, but seems a shame we don't let an interrupt do this > > > > for us in most cases. (note in QEMU response is instantaneous so > > > > when the interrupt for Busy drop is set, object ready also true so > > > > by the time we get here data ready will already be sent). > > > > > > This needs to be pci_doe_wait_irq_or_poll() as well and the arm wait above > > > ensures we continue to look for that interrupt. > > > > > > I'm starting to see how I got confused. The timeout values all vary and > > > mod_delayed_work() made it very easy for you to interrupt any of those. > > > > Yeah. That was a nice suggestion Dan made long ago but it doesn't play well > > with the single workqueue. > > > > > > > > I tried to bundle the poll vs interrupt modes in pci_doe_wait_irq_or_poll() and > > > got confused. :-( > > > > > > > > > > > > + signal_task_abort(task, -EIO); > > > > > + return; > > > > > + } > > > > > + goto retry_resp; > > > > > + } > > > > > + > > > > > + rc = pci_doe_recv_resp(doe_mb, task); > > > > > + if (rc < 0) { > > > > > + signal_task_abort(task, rc); > > > > > + return; > > > > > + } > > > > > + > > > > > + signal_task_complete(task, rc); > > > > > +} > > > > > + > > > > > > > > > > > > > +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) > > > > > +{ > > > > > + if (doe_mb->work_queue) > > > > > > > > I'm not a great fan of free functions that check a bunch of conditions > > > > because they may be called before things are set up. > > > > > > I'll see what I can do. I do kind of like this but I think it gets muddled and > > > I'm not dead set on either way. > > > > > > > To my > > > > mind that generally means we should be calling individual cleanup > > > > in the appropriate error handlers. > > > > > > > > Either that or just use devm handling for each item. Sure > > > > it's a few more lines of code, but I find it a lot easier to go > > > > > > > > Oh look that thing we just set up is cleaned up by this. > > > > > > > > > > > > > + destroy_workqueue(doe_mb->work_queue); > > > > > + if (pci_doe_irq_enabled(doe_mb)) > > > > > + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); > > > > > + xa_destroy(&doe_mb->prots); > > > > > + kfree(doe_mb); > > > > > +} > > > > > + > > > > > > > > ... > > > > > > > > > + > > > > > +static void pci_doe_destroy_mb(void *mb) > > > > > +{ > > > > > + struct pci_doe_mb *doe_mb = mb; > > > > > + > > > > > + /* Mark going down */ > > > > > + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); > > > > > + > > > > > + /* Abort any in progress work items */ > > > > > + pci_doe_abort(doe_mb); > > > > > > > > Abort is getting used for two things in here. Perhaps > > > > rename this one to > > > > pci_doe_abort_tasks() or something like that? > > > > > > What do you mean two things? Oh I think I see. You mean abort the work item > > > vs abort sent to the hardware? > > > > yup. > > > > > > > > This no longer aborts all the tasks just the one which may be in progress. > > > Because the work queue is ordered only one task may be in progress. I'll clean > > > up the comment too. > > > > Hmm. It puts a requirement on the caller to not queue multiple requests that > > might require ordering. One advantage of flushing the lot was ordering was > > unaffected (though any caller that queued multiple items would have to then > > requeue multiple items so would have to maintain their own retry buffer). > > > > > > > > This sets the abort flag and wakes it up if it is sleeping. If not then the > > > abort flag will be detected in the next arm. > > > > > > FWIW I think I may just remove this call and open code it here. > > > > Sounds good, avoid naming confusion by getting rid of the name :) > > > > > > > > > > > + > > > > > + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, > > > > > + doe_mb->cap_offset); > > > > > + if (!doe_mb->work_queue) { > > > > > + pci_err(pdev, "[%x] failed to allocate work queue\n", > > > > > + doe_mb->cap_offset); > > > > > + pci_doe_free_mb(doe_mb); > > > > > > > > As above, I'd rather this explicitly freed what has been set up > > > > and only that rather than calling a free function that does a bunch of > > > > stuff conditionally. > > > > > > I think I can make that work. This is the only conditional in free however, > > > because the other conditional is the IRQ support which may not be set up. > > > > If you split to multiple devm_ calls you can not setup a tear down for the > > irq if we don't have one. Or, don't use pci_request_irq() but call > > devm_request_threaded_irq() directly and let that clean up for you. > > > > > > > > > > Thanks again for the in depth review! > > > > No problem. I know how nasty this seemingly simple little bit of code > > is, so you have my sympathies :) > > > > > > Jonathan
On Thu, Jun 30, 2022 at 04:25:40PM +0100, Jonathan Cameron wrote: > On Wed, 29 Jun 2022 21:34:18 -0700 > Ira Weiny <ira.weiny@intel.com> wrote: > [snip] I've dropped the IRQ support and was polishing things up. Without the IRQ I don't think any 'arming' makes sense. However, in working through the sequence again I think I found another problem. I _think_... :-/ > > > > But we are only going to see this if some other entity is using the mailbox > > right? And I don't think that is going to be common, is it? > > BUSY on entry to doe_statemachine_work() is indeed only relevant if > some other entity is trampling on us. It's best effort only. > > BUSY during normal flow is the one I care about. > In most cases it will go like (assuming we clear the int status in the handler as now) > > Send Object > BUSY ________|-----___________________ > PROC ________|------------------______ > OBJ RDY ___________________________------- > Int Status______________-____________-_____ So I did not realize that BUSY could clear like this. I thought the point of BUSY was to indicate someone else had an exchange in flight. What happens if another entity jumps in during the PROC time? How does one know that OBJ RDY is _our_ object ready and not someone else's? For example 'entity' issues a send, we see busy clear and also start a send. But the device is still processing the send from 'entity': Send Object(entity) Send Object (Linux) BUSY ___|----_______________|---______________________________ PROC ___|-----------------------------___|-----------------___ OBJ RDY _________________________________-------______________--- Int Status________-__________________-_____-____________________-__ ^^^ This... ... is _not_ Linux's object!?!?!?! Can that happen? If so this is entirely broken. Even Polling OBJ RDY will break. And worse yet we will not even see BUSY being set in any 'abnormal' way. > > where I've added PROC to mean the device is processing the data. > Once it clears the input buffer on the device and hence the device can accept > another protocol request BUSY will drop. If device has some pipelining > or runs multiple protocols in different threads, you can think of that busy > period just being the time it needs to copy out the request to some protocol > thread specific storage. BUSY was not at all doing what I thought it did. I'm now concerned that it is completely broken WRT to other entities even without IRQs. Frankly I'm confused why pci_doe_send_req() even checks for busy because it is unlikely that we will ever see it set. For sure we won't from our side because the workqueue is going to process one task at a time. If Linux wanted to have multiple objects in flight I think we would need a much more complex state machine than we had. Maybe your original state machine handled this. If so, I apologize for missing this subtle point. At this point I'm debating removing the check for BUSY as well because I don't see the point. (Other than maybe flagging some error to say that 'entity' may be messing things up for us and bailing.) Thoughts? Ira > > You won't see this in QEMU without extra hacks because we shorten the > flow so that whole thing is instantaneous. > > If those two interrupts per transfer occur well spread out they can result in > your INT flag being set too often and some of the waits dropping through early. > > It will 'work' I think though because you ultimately spin on Data object > ready which won't be set until after the second interrupt. >
On Fri, 1 Jul 2022 15:22:38 -0700 Ira Weiny <ira.weiny@intel.com> wrote: > On Thu, Jun 30, 2022 at 04:25:40PM +0100, Jonathan Cameron wrote: > > On Wed, 29 Jun 2022 21:34:18 -0700 > > Ira Weiny <ira.weiny@intel.com> wrote: > > > > [snip] > > I've dropped the IRQ support and was polishing things up. Without the IRQ I > don't think any 'arming' makes sense. > > However, in working through the sequence again I think I found another problem. > I _think_... :-/ > > > > > > > But we are only going to see this if some other entity is using the mailbox > > > right? And I don't think that is going to be common, is it? > > > > BUSY on entry to doe_statemachine_work() is indeed only relevant if > > some other entity is trampling on us. It's best effort only. > > > > BUSY during normal flow is the one I care about. > > In most cases it will go like (assuming we clear the int status in the handler as now) > > > > Send Object > > BUSY ________|-----___________________ > > PROC ________|------------------______ > > OBJ RDY ___________________________------- > > Int Status______________-____________-_____ > > So I did not realize that BUSY could clear like this. I thought the point of > BUSY was to indicate someone else had an exchange in flight. Unfortunately the spec doesn't provide any way of indicating 'who' is using the DOE. All busy says is that right now the mailbox is not capable of receiving a new request. Way back in one of the early posting we considered just dropping the 'best effort' wait that is there, but I think we concluded it was harmless and might make things a tiny bit more stable if there was something stale from before OS load. > > What happens if another entity jumps in during the PROC time? How does one > know that OBJ RDY is _our_ object ready and not someone else's? Absolutely. The reality is that DOE isn't suitable for multi actor use. We need to put in some mediation. One thing being neglected on my todo list is that we need a _DSM in ACPI or similar to negotiate access plus potentially some firmware interfaces to allow the OS to make firmware mediated calls. Those firmware interfaces may be at the protocol level or even further up the stack. Not sure if we got to it, but this problem was in the slides for last years Plumbers uconf talk on DOE. > > For example 'entity' issues a send, we see busy clear and also start a > send. But the device is still processing the send from 'entity': > > Send Object(entity) Send Object (Linux) > BUSY ___|----_______________|---______________________________ > PROC ___|-----------------------------___|-----------------___ > OBJ RDY _________________________________-------______________--- > Int Status________-__________________-_____-____________________-__ > > ^^^ > This... > > ... is _not_ Linux's object!?!?!?! > > Can that happen? yup. > > If so this is entirely broken. Even Polling OBJ RDY will break. And worse yet > we will not even see BUSY being set in any 'abnormal' way. > > > > > where I've added PROC to mean the device is processing the data. > > Once it clears the input buffer on the device and hence the device can accept > > another protocol request BUSY will drop. If device has some pipelining > > or runs multiple protocols in different threads, you can think of that busy > > period just being the time it needs to copy out the request to some protocol > > thread specific storage. > > BUSY was not at all doing what I thought it did. I'm now concerned that it is > completely broken WRT to other entities even without IRQs. Frankly I'm > confused why pci_doe_send_req() even checks for busy because it is unlikely > that we will ever see it set. For sure we won't from our side because the > workqueue is going to process one task at a time. yup, we could drop it, but leave some comment in there that says the spec suggests checking it. > > If Linux wanted to have multiple objects in flight I think we would need a much > more complex state machine than we had. Maybe your original state machine > handled this. If so, I apologize for missing this subtle point. It didn't. I decided that it wasn't worth the effort :) > > At this point I'm debating removing the check for BUSY as well because I don't > see the point. (Other than maybe flagging some error to say that 'entity' may > be messing things up for us and bailing.) > > Thoughts? I'm fine with replacing it with comments, or an error print to say that IIRC the spec says we should wait for it, but reality is that it doesn't work. Guess I should get on with proposing a _DSM interface to deal with this. It's a bit messy though as relies on reliable matching of PCI devices against firmware. In theory, with the right 'no reenumeration' flags that has a high chance of working these days but requires some extra language to say that all bets are off if you reenumerate before figuring out the ACPI to PCI device mapping. I dropped the ball on getting that element in place. What fun ;) Jonathan > Ira > > > > > You won't see this in QEMU without extra hacks because we shorten the > > flow so that whole thing is instantaneous. > > > > If those two interrupts per transfer occur well spread out they can result in > > your INT flag being set too often and some of the waits dropping through early. > > > > It will 'work' I think though because you ultimately spin on Data object > > ready which won't be set until after the second interrupt. > >
diff --git a/.clang-format b/.clang-format index fa959436bcfd..7bebb066f2a2 100644 --- a/.clang-format +++ b/.clang-format @@ -420,6 +420,7 @@ ForEachMacros: - 'of_property_for_each_string' - 'of_property_for_each_u32' - 'pci_bus_for_each_resource' + - 'pci_doe_for_each_off' - 'pcl_for_each_chunk' - 'pcl_for_each_segment' - 'pcm_for_each_format' diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig index 133c73207782..b2f2e588a817 100644 --- a/drivers/pci/Kconfig +++ b/drivers/pci/Kconfig @@ -121,6 +121,9 @@ config XEN_PCIDEV_FRONTEND config PCI_ATS bool +config PCI_DOE + bool + config PCI_ECAM bool diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 0da6b1ebc694..2680e4c92f0a 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -31,6 +31,7 @@ obj-$(CONFIG_PCI_ECAM) += ecam.o obj-$(CONFIG_PCI_P2PDMA) += p2pdma.o obj-$(CONFIG_XEN_PCIDEV_FRONTEND) += xen-pcifront.o obj-$(CONFIG_VGA_ARB) += vgaarb.o +obj-$(CONFIG_PCI_DOE) += doe.o # Endpoint library must be initialized before its users obj-$(CONFIG_PCI_ENDPOINT) += endpoint/ diff --git a/drivers/pci/doe.c b/drivers/pci/doe.c new file mode 100644 index 000000000000..4a7a1e988124 --- /dev/null +++ b/drivers/pci/doe.c @@ -0,0 +1,689 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Data Object Exchange + * PCIe r6.0, sec 6.30 DOE + * + * Copyright (C) 2021 Huawei + * Jonathan Cameron <Jonathan.Cameron@huawei.com> + * + * Copyright (C) 2022 Intel Corporation + * Ira Weiny <ira.weiny@intel.com> + */ + +#define dev_fmt(fmt) "DOE: " fmt + +#include <linux/bitfield.h> +#include <linux/delay.h> +#include <linux/jiffies.h> +#include <linux/mutex.h> +#include <linux/pci.h> +#include <linux/pci-doe.h> +#include <linux/workqueue.h> + +#define PCI_DOE_PROTOCOL_DISCOVERY 0 + +#define PCI_DOE_BUSY_MAX_RETRIES 16 +#define PCI_DOE_POLL_INTERVAL (HZ / 128) + +/* Timeout of 1 second from 6.30.2 Operation, PCI Spec r6.0 */ +#define PCI_DOE_TIMEOUT HZ + +#define PCI_DOE_FLAG_ABORT 0 +#define PCI_DOE_FLAG_IRQ 1 +#define PCI_DOE_FLAG_DEAD 2 + +/** + * struct pci_doe_mb - State for a single DOE mailbox + * + * This state is used to manage a single DOE mailbox capability. All fields + * should be considered opaque to the consumers and the structure passed into + * the helpers below after being created by devm_pci_doe_create() + * + * @pdev: PCI device this mailbox belongs to + * @cap_offset: Capability offset + * @int_msg_num: DOE Interrupt Message Number; negative if irqs are not used + * @prots: Array of protocols supported (encoded as long values) + * @wq: Wait queue for work items awaiting irq/abort + * @work_queue: Queue of pci_doe_work items + * @flags: Bit array of PCI_DOE_FLAG_* flags + * + * Note: @prots can't be allocated with struct size because the number of + * protocols is not known until after this structure is in use. However, the + * single discovery protocol is always required to query for the number of + * protocols. + */ +struct pci_doe_mb { + struct pci_dev *pdev; + u16 cap_offset; + int int_msg_num; + struct xarray prots; + + wait_queue_head_t wq; + struct workqueue_struct *work_queue; + unsigned long flags; +}; + +static irqreturn_t pci_doe_irq_handler(int irq, void *data) +{ + struct pci_doe_mb *doe_mb = data; + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + u32 val; + + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_INT_STATUS, val)) { + pci_write_config_dword(pdev, offset + PCI_DOE_STATUS, + PCI_DOE_STATUS_INT_STATUS); + set_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); + wake_up(&doe_mb->wq); + return IRQ_HANDLED; + } + + return IRQ_NONE; +} + +static bool pci_doe_irq_enabled(struct pci_doe_mb *doe_mb) +{ + return doe_mb->int_msg_num >= 0; +} + +static void pci_doe_abort(struct pci_doe_mb *doe_mb) +{ + set_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags); + wake_up(&doe_mb->wq); +} + +/* + * Returned from the wait functions to indicate that an abort has been issued + */ +#define DOE_WAIT_ABORT 1 + +static int pci_doe_arm_wait(struct pci_doe_mb *doe_mb) +{ + if (test_and_clear_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) + return DOE_WAIT_ABORT; + clear_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags); + return 0; +} + +static int pci_doe_wait_poll(struct pci_doe_mb *doe_mb, unsigned long timeout) +{ + if (wait_event_timeout(doe_mb->wq, + test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags), + timeout)) + return DOE_WAIT_ABORT; + return 0; +} + +static int pci_doe_wait_irq_or_poll(struct pci_doe_mb *doe_mb) +{ + if (pci_doe_irq_enabled(doe_mb)) { + wait_event_timeout(doe_mb->wq, + test_bit(PCI_DOE_FLAG_IRQ, &doe_mb->flags) || + test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags), + PCI_DOE_TIMEOUT); + if (test_bit(PCI_DOE_FLAG_ABORT, &doe_mb->flags)) + return DOE_WAIT_ABORT; + return 0; + } + + return pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL); +} + +static void pci_doe_write_ctrl(struct pci_doe_mb *doe_mb, u32 val) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + + if (pci_doe_irq_enabled(doe_mb)) + val |= PCI_DOE_CTRL_INT_EN; + pci_write_config_dword(pdev, offset + PCI_DOE_CTRL, val); +} + +static int pci_doe_issue_abort(struct pci_doe_mb *doe_mb) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + unsigned long timeout_jiffies; + + pci_dbg(pdev, "[%x] Issuing Abort\n", offset); + + /* + * Abort detected while aborting; something is really broken or the + * mailbox is being destroyed. + */ + if (pci_doe_arm_wait(doe_mb)) + return -EIO; + + timeout_jiffies = jiffies + PCI_DOE_TIMEOUT; + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_ABORT); + + do { + u32 val; + + /* + * Abort detected while aborting; something is really broken or + * the mailbox is being destroyed. + */ + if (pci_doe_wait_irq_or_poll(doe_mb)) + return -EIO; + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + + /* Abort success! */ + if (!FIELD_GET(PCI_DOE_STATUS_ERROR, val) && + !FIELD_GET(PCI_DOE_STATUS_BUSY, val)) + return 0; + + } while (!time_after(jiffies, timeout_jiffies)); + + /* Abort has timed out and the MB is dead */ + pci_err(pdev, "[%x] ABORT timed out\n", offset); + return -EIO; +} + +static int pci_doe_send_req(struct pci_doe_mb *doe_mb, + struct pci_doe_task *task) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + u32 val; + int i; + + /* + * Check the DOE busy bit is not set. If it is set, this could indicate + * someone other than Linux (e.g. firmware) is using the mailbox. Note + * it is expected that firmware and OS will negotiate access rights via + * an, as yet to be defined method. + */ + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_BUSY, val)) + return -EBUSY; + + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) + return -EIO; + + /* Write DOE Header */ + val = FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_VID, task->prot.vid) | + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, task->prot.type); + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, val); + /* Length is 2 DW of header + length of payload in DW */ + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, + FIELD_PREP(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, + 2 + task->request_pl_sz / + sizeof(u32))); + for (i = 0; i < task->request_pl_sz / sizeof(u32); i++) + pci_write_config_dword(pdev, offset + PCI_DOE_WRITE, + task->request_pl[i]); + + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_GO); + + /* Request is sent - now wait for poll or IRQ */ + return 0; +} + +static bool pci_doe_data_obj_ready(struct pci_doe_mb *doe_mb) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + u32 val; + + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) + return true; + return false; +} + +static int pci_doe_recv_resp(struct pci_doe_mb *doe_mb, struct pci_doe_task *task) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + size_t length, payload_length; + u32 val; + int i; + + /* Read the first dword to get the protocol */ + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + if ((FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val) != task->prot.vid) || + (FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val) != task->prot.type)) { + pci_err(pdev, + "[%x] expected [VID, Protocol] = [%04x, %02x], got [%04x, %02x]\n", + doe_mb->cap_offset, + task->prot.vid, task->prot.type, + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_VID, val), + FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_1_TYPE, val)); + return -EIO; + } + + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + /* Read the second dword to get the length */ + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + + length = FIELD_GET(PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH, val); + if (length > SZ_1M || length < 2) + return -EIO; + + /* First 2 dwords have already been read */ + length -= 2; + payload_length = min(length, task->response_pl_sz / sizeof(u32)); + /* Read the rest of the response payload */ + for (i = 0; i < payload_length; i++) { + pci_read_config_dword(pdev, offset + PCI_DOE_READ, + &task->response_pl[i]); + /* Prior to the last ack, ensure Data Object Ready */ + if (i == (payload_length-1) && !pci_doe_data_obj_ready(doe_mb)) + return -EIO; + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + } + + /* Flush excess length */ + for (; i < length; i++) { + pci_read_config_dword(pdev, offset + PCI_DOE_READ, &val); + pci_write_config_dword(pdev, offset + PCI_DOE_READ, 0); + } + + /* Final error check to pick up on any since Data Object Ready */ + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) + return -EIO; + + return min(length, task->response_pl_sz / sizeof(u32)) * sizeof(u32); +} + +static void signal_task_complete(struct pci_doe_task *task, int rv) +{ + task->rv = rv; + task->complete(task); +} + +static void signal_task_abort(struct pci_doe_task *task, int rv) +{ + struct pci_doe_mb *doe_mb = task->doe_mb; + struct pci_dev *pdev = doe_mb->pdev; + + if (pci_doe_issue_abort(doe_mb)) { + /* On failure set the MB dead - no more submissions */ + pci_err(pdev, "[%x] Abort failed marking mailbox dead\n", + doe_mb->cap_offset); + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); + } + signal_task_complete(task, rv); +} + +static void doe_statemachine_work(struct work_struct *work) +{ + struct pci_doe_task *task = container_of(work, struct pci_doe_task, + work); + struct pci_doe_mb *doe_mb = task->doe_mb; + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + unsigned int busy_retries = 0; + unsigned long timeout_jiffies; + u32 val; + int rc; + + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) { + signal_task_complete(task, -EIO); + return; + } + + /* Send request */ +retry_req: + if (pci_doe_arm_wait(doe_mb)) { + signal_task_abort(task, -EIO); + return; + } + + rc = pci_doe_send_req(doe_mb, task); + + /* + * The specification does not provide any guidance on how long + * some other entity could keep the DOE busy, so try for 1 + * second then fail. Busy handling is best effort only, because + * there is no way of avoiding racing against another user of + * the DOE. + */ + if (rc == -EBUSY) { + busy_retries++; + if (busy_retries == PCI_DOE_BUSY_MAX_RETRIES) { + pci_warn(pdev, + "[%x] busy for too long (> 1 sec)\n", + offset); + signal_task_complete(task, rc); + return; + } + if (pci_doe_wait_poll(doe_mb, HZ / PCI_DOE_BUSY_MAX_RETRIES)) { + signal_task_abort(task, rc); + return; + } + goto retry_req; + } else if (rc) { + signal_task_abort(task, rc); + return; + } + + timeout_jiffies = jiffies + HZ; + if (pci_doe_wait_irq_or_poll(doe_mb)) { + signal_task_abort(task, -EIO); + return; + } + + /* Poll for response */ +retry_resp: + if (pci_doe_arm_wait(doe_mb)) { + signal_task_abort(task, -EIO); + return; + } + + pci_read_config_dword(pdev, offset + PCI_DOE_STATUS, &val); + if (FIELD_GET(PCI_DOE_STATUS_ERROR, val)) { + signal_task_abort(task, -EIO); + return; + } + + if (!FIELD_GET(PCI_DOE_STATUS_DATA_OBJECT_READY, val)) { + if (time_after(jiffies, timeout_jiffies)) { + signal_task_abort(task, -ETIMEDOUT); + return; + } + if (pci_doe_wait_poll(doe_mb, PCI_DOE_POLL_INTERVAL)) { + signal_task_abort(task, -EIO); + return; + } + goto retry_resp; + } + + rc = pci_doe_recv_resp(doe_mb, task); + if (rc < 0) { + signal_task_abort(task, rc); + return; + } + + signal_task_complete(task, rc); +} + +static void pci_doe_task_complete(struct pci_doe_task *task) +{ + complete(task->private); +} + +static int pci_doe_discovery(struct pci_doe_mb *doe_mb, u8 *index, u16 *vid, + u8 *protocol) +{ + u32 request_pl = FIELD_PREP(PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX, + *index); + u32 response_pl; + DECLARE_COMPLETION_ONSTACK(c); + struct pci_doe_task task = { + .prot.vid = PCI_VENDOR_ID_PCI_SIG, + .prot.type = PCI_DOE_PROTOCOL_DISCOVERY, + .request_pl = &request_pl, + .request_pl_sz = sizeof(request_pl), + .response_pl = &response_pl, + .response_pl_sz = sizeof(response_pl), + .complete = pci_doe_task_complete, + .private = &c, + }; + int ret; + + ret = pci_doe_submit_task(doe_mb, &task); + if (ret < 0) + return ret; + + wait_for_completion(&c); + + if (task.rv != sizeof(response_pl)) + return -EIO; + + *vid = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID, response_pl); + *protocol = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL, + response_pl); + *index = FIELD_GET(PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX, + response_pl); + + return 0; +} + +static void *pci_doe_xa_entry(u16 vid, u8 prot) +{ + return (void *)(((unsigned long)vid << 16) | prot); +} + +static int pci_doe_cache_protocols(struct pci_doe_mb *doe_mb) +{ + u8 index = 0; + u8 xa_idx = 0; + + do { + int rc; + u16 vid; + u8 prot; + + rc = pci_doe_discovery(doe_mb, &index, &vid, &prot); + if (rc) + return rc; + + pci_dbg(doe_mb->pdev, + "[%x] Found protocol %d vid: %x prot: %x\n", + doe_mb->cap_offset, xa_idx, vid, prot); + + rc = xa_insert(&doe_mb->prots, xa_idx++, + pci_doe_xa_entry(vid, prot), GFP_KERNEL); + if (rc) + return -ENOMEM; + } while (index); + + return 0; +} + +static int pci_doe_enable_irq(struct pci_doe_mb *doe_mb, + unsigned int int_msg_num) +{ + struct pci_dev *pdev = doe_mb->pdev; + int offset = doe_mb->cap_offset; + int rc; + + /* + * Enabling bus mastering is required for MSI/MSIx. It is safe to call + * this multiple times and thus is called here to ensure that mastering + * is enabled even if the driver has done so. + */ + pci_set_master(pdev); + rc = pci_request_irq(pdev, int_msg_num, pci_doe_irq_handler, NULL, + doe_mb, "DOE[%x %s]", offset, pci_name(pdev)); + if (rc) + return rc; + + doe_mb->int_msg_num = int_msg_num; + pci_doe_write_ctrl(doe_mb, PCI_DOE_CTRL_INT_EN); + + return 0; +} + +static void pci_doe_free_mb(struct pci_doe_mb *doe_mb) +{ + if (doe_mb->work_queue) + destroy_workqueue(doe_mb->work_queue); + if (pci_doe_irq_enabled(doe_mb)) + pci_free_irq(doe_mb->pdev, doe_mb->int_msg_num, doe_mb); + xa_destroy(&doe_mb->prots); + kfree(doe_mb); +} + +/** + * pci_doe_get_int_msg_num() - Return the interrupt message number for the + * mailbox at offset + * + * @pdev: The PCI device + * @offset: Offset of the DOE mailbox + * + * Returns: IRQ number on success + * -errno if IRQs are not supported on this mailbox + */ +int pci_doe_get_int_msg_num(struct pci_dev *pdev, int offset) +{ + u32 val; + + pci_read_config_dword(pdev, offset + PCI_DOE_CAP, &val); + if (!FIELD_GET(PCI_DOE_CAP_INT_SUP, val)) + return -ENXIO; + + return FIELD_GET(PCI_DOE_CAP_INT_MSG_NUM, val); +} +EXPORT_SYMBOL_GPL(pci_doe_get_int_msg_num); + +static void pci_doe_destroy_mb(void *mb) +{ + struct pci_doe_mb *doe_mb = mb; + + /* Mark going down */ + set_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags); + + /* Abort any in progress work items */ + pci_doe_abort(doe_mb); + + /* Flush remaining work items */ + flush_workqueue(doe_mb->work_queue); + + pci_doe_free_mb(doe_mb); +} + +/** + * pcim_doe_create_mb() - Create a DOE mailbox object + * + * @pdev: PCI device to create the DOE mailbox for + * @cap_offset: Offset of the DOE mailbox + * @int_msg_num: Interrupt message number to use; a negative value means don't + * use interrupts + * + * Create a single mailbox object to manage the mailbox protocol at the + * cap_offset specified. + * + * Caller should allocate PCI IRQ vectors before passing a possitive value for + * int_msg_num. + * + * RETURNS: created mailbox object on success + * ERR_PTR(-errno) on failure + */ +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset, + int int_msg_num) +{ + struct pci_doe_mb *doe_mb; + int rc; + + doe_mb = kzalloc(sizeof(*doe_mb), GFP_KERNEL); + if (!doe_mb) + return ERR_PTR(-ENOMEM); + + doe_mb->pdev = pdev; + doe_mb->int_msg_num = -1; + doe_mb->cap_offset = cap_offset; + + xa_init(&doe_mb->prots); + init_waitqueue_head(&doe_mb->wq); + + if (int_msg_num >= 0) { + rc = pci_doe_enable_irq(doe_mb, int_msg_num); + if (rc) + pci_err(pdev, + "[%x] enable requested IRQ (%d) failed : %d\n", + doe_mb->cap_offset, int_msg_num, rc); + } + + doe_mb->work_queue = alloc_ordered_workqueue("DOE: [%x]", 0, + doe_mb->cap_offset); + if (!doe_mb->work_queue) { + pci_err(pdev, "[%x] failed to allocate work queue\n", + doe_mb->cap_offset); + pci_doe_free_mb(doe_mb); + return ERR_PTR(-ENOMEM); + } + + /* Reset the mailbox by issuing an abort */ + rc = pci_doe_issue_abort(doe_mb); + if (rc) { + pci_err(pdev, "[%x] failed to reset : %d\n", + doe_mb->cap_offset, rc); + pci_doe_free_mb(doe_mb); + return ERR_PTR(rc); + } + + if (devm_add_action_or_reset(&pdev->dev, pci_doe_destroy_mb, doe_mb)) + return ERR_PTR(-EIO); + + rc = pci_doe_cache_protocols(doe_mb); + if (rc) { + pci_err(pdev, "[%x] failed to cache protocols : %d\n", + doe_mb->cap_offset, rc); + return ERR_PTR(rc); + } + + return doe_mb; +} +EXPORT_SYMBOL_GPL(pcim_doe_create_mb); + +/** + * pci_doe_supports_prot() - Return if the DOE instance supports the given + * protocol + * @doe_mb: DOE mailbox capability to query + * @vid: Protocol Vendor ID + * @type: Protocol type + * + * RETURNS: True if the DOE mailbox supports the protocol specified + */ +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type) +{ + unsigned long index; + void *entry; + + /* The discovery protocol must always be supported */ + if (vid == PCI_VENDOR_ID_PCI_SIG && type == PCI_DOE_PROTOCOL_DISCOVERY) + return true; + + xa_for_each(&doe_mb->prots, index, entry) + if (entry == pci_doe_xa_entry(vid, type)) + return true; + + return false; +} +EXPORT_SYMBOL_GPL(pci_doe_supports_prot); + +/** + * pci_doe_submit_task() - Submit a task to be processed by the state machine + * + * @doe_mb: DOE mailbox capability to submit to + * @task: task to be queued + * + * Submit a DOE task (request/response) to the DOE mailbox to be processed. + * Returns upon queueing the task object. If the queue is full this function + * will sleep until there is room in the queue. + * + * task->complete will be called when the state machine is done processing this + * task. + * + * Excess data will be discarded. + * + * RETURNS: 0 when task has been successful queued, -ERRNO on error + */ +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task) +{ + if (!pci_doe_supports_prot(doe_mb, task->prot.vid, task->prot.type)) + return -EINVAL; + + /* + * DOE requests must be a whole number of DW + * and the response needs to be big enough for at least 1 DW + */ + if (task->request_pl_sz % sizeof(u32) || + task->response_pl_sz < sizeof(u32)) + return -EINVAL; + + if (test_bit(PCI_DOE_FLAG_DEAD, &doe_mb->flags)) + return -EIO; + + task->doe_mb = doe_mb; + INIT_WORK(&task->work, doe_statemachine_work); + queue_work(doe_mb->work_queue, &task->work); + return 0; +} +EXPORT_SYMBOL_GPL(pci_doe_submit_task); diff --git a/include/linux/pci-doe.h b/include/linux/pci-doe.h new file mode 100644 index 000000000000..805b58ff4016 --- /dev/null +++ b/include/linux/pci-doe.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Data Object Exchange + * PCIe r6.0, sec 6.30 DOE + * + * Copyright (C) 2021 Huawei + * Jonathan Cameron <Jonathan.Cameron@huawei.com> + * + * Copyright (C) 2022 Intel Corporation + * Ira Weiny <ira.weiny@intel.com> + */ + +#ifndef LINUX_PCI_DOE_H +#define LINUX_PCI_DOE_H + +#include <linux/completion.h> + +struct pci_doe_protocol { + u16 vid; + u8 type; +}; + +struct pci_doe_mb; + +/** + * struct pci_doe_task - represents a single query/response + * + * @prot: DOE Protocol + * @request_pl: The request payload + * @request_pl_sz: Size of the request payload (bytes) + * @response_pl: The response payload + * @response_pl_sz: Size of the response payload (bytes) + * @rv: Return value. Length of received response or error (bytes) + * @complete: Called when task is complete + * @private: Private data for the consumer + * @work: Used internally by the mailbox + * @doe_mb: Used internally by the mailbox + * + * The payload sizes and rv are specified in bytes with the following + * restrictions concerning the protocol. + * + * 1) The request_pl_sz must be a multiple of double words (4 bytes) + * 2) The response_pl_sz must be >= a single double word (4 bytes) + * 3) rv is returned as bytes but it will be a multiple of double words + * + * NOTE there is no need for the caller to initialize work or doe_mb. + */ +struct pci_doe_task { + struct pci_doe_protocol prot; + u32 *request_pl; + size_t request_pl_sz; + u32 *response_pl; + size_t response_pl_sz; + int rv; + void (*complete)(struct pci_doe_task *task); + void *private; + + /* No need for the user to initialize these fields */ + struct work_struct work; + struct pci_doe_mb *doe_mb; +}; + +/** + * pci_doe_for_each_off - Iterate each DOE capability + * @pdev: struct pci_dev to iterate + * @off: u16 of config space offset of each mailbox capability found + */ +#define pci_doe_for_each_off(pdev, off) \ + for (off = pci_find_next_ext_capability(pdev, off, \ + PCI_EXT_CAP_ID_DOE); \ + off > 0; \ + off = pci_find_next_ext_capability(pdev, off, \ + PCI_EXT_CAP_ID_DOE)) + +int pci_doe_get_int_msg_num(struct pci_dev *pdev, int offset); +struct pci_doe_mb *pcim_doe_create_mb(struct pci_dev *pdev, u16 cap_offset, + int irq); +bool pci_doe_supports_prot(struct pci_doe_mb *doe_mb, u16 vid, u8 type); +int pci_doe_submit_task(struct pci_doe_mb *doe_mb, struct pci_doe_task *task); + +#endif diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h index bee1a9ed6e66..9d50678f3f62 100644 --- a/include/uapi/linux/pci_regs.h +++ b/include/uapi/linux/pci_regs.h @@ -736,7 +736,8 @@ #define PCI_EXT_CAP_ID_DVSEC 0x23 /* Designated Vendor-Specific */ #define PCI_EXT_CAP_ID_DLF 0x25 /* Data Link Feature */ #define PCI_EXT_CAP_ID_PL_16GT 0x26 /* Physical Layer 16.0 GT/s */ -#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_PL_16GT +#define PCI_EXT_CAP_ID_DOE 0x2E /* Data Object Exchange */ +#define PCI_EXT_CAP_ID_MAX PCI_EXT_CAP_ID_DOE #define PCI_EXT_CAP_DSN_SIZEOF 12 #define PCI_EXT_CAP_MCAST_ENDPOINT_SIZEOF 40 @@ -1102,4 +1103,30 @@ #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_MASK 0x000000F0 #define PCI_PL_16GT_LE_CTRL_USP_TX_PRESET_SHIFT 4 +/* Data Object Exchange */ +#define PCI_DOE_CAP 0x04 /* DOE Capabilities Register */ +#define PCI_DOE_CAP_INT_SUP 0x00000001 /* Interrupt Support */ +#define PCI_DOE_CAP_INT_MSG_NUM 0x00000ffe /* Interrupt Message Number */ +#define PCI_DOE_CTRL 0x08 /* DOE Control Register */ +#define PCI_DOE_CTRL_ABORT 0x00000001 /* DOE Abort */ +#define PCI_DOE_CTRL_INT_EN 0x00000002 /* DOE Interrupt Enable */ +#define PCI_DOE_CTRL_GO 0x80000000 /* DOE Go */ +#define PCI_DOE_STATUS 0x0c /* DOE Status Register */ +#define PCI_DOE_STATUS_BUSY 0x00000001 /* DOE Busy */ +#define PCI_DOE_STATUS_INT_STATUS 0x00000002 /* DOE Interrupt Status */ +#define PCI_DOE_STATUS_ERROR 0x00000004 /* DOE Error */ +#define PCI_DOE_STATUS_DATA_OBJECT_READY 0x80000000 /* Data Object Ready */ +#define PCI_DOE_WRITE 0x10 /* DOE Write Data Mailbox Register */ +#define PCI_DOE_READ 0x14 /* DOE Read Data Mailbox Register */ + +/* DOE Data Object - note not actually registers */ +#define PCI_DOE_DATA_OBJECT_HEADER_1_VID 0x0000ffff +#define PCI_DOE_DATA_OBJECT_HEADER_1_TYPE 0x00ff0000 +#define PCI_DOE_DATA_OBJECT_HEADER_2_LENGTH 0x0003ffff + +#define PCI_DOE_DATA_OBJECT_DISC_REQ_3_INDEX 0x000000ff +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_VID 0x0000ffff +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL 0x00ff0000 +#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000 + #endif /* LINUX_PCI_REGS_H */