diff mbox series

[v2,09/12] NTB: Introduce MSI library

Message ID 20190213175454.7506-10-logang@deltatee.com (mailing list archive)
State New
Headers show
Series Support using MSI interrupts in ntb_transport | expand

Commit Message

Logan Gunthorpe Feb. 13, 2019, 5:54 p.m. UTC
The NTB MSI library allows passing MSI interrupts across a memory
window. This offers similar functionality to doorbells or messages
except will often have much better latency and the client can
potentially use significantly more remote interrupts than typical hardware
provides for doorbells. (Which can be important in high-multiport
setups.)

The library utilizes one memory window per peer and uses the highest
index memory windows. Before any ntb_msi function may be used, the user
must call ntb_msi_init(). It may then setup and tear down the memory
windows when the link state changes using ntb_msi_setup_mws() and
ntb_msi_clear_mws().

The peer which receives the interrupt must call ntb_msim_request_irq()
to assign the interrupt handler (this function is functionally
similar to devm_request_irq()) and the returned descriptor must be
transferred to the peer which can use it to trigger the interrupt.
The triggering peer, once having received the descriptor, can
trigger the interrupt by calling ntb_msi_peer_trigger().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Cc: Jon Mason <jdmason@kudzu.us>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Allen Hubbe <allenbh@gmail.com>
---
 drivers/ntb/Kconfig  |  11 ++
 drivers/ntb/Makefile |   3 +-
 drivers/ntb/msi.c    | 415 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/ntb.h  |  73 ++++++++
 4 files changed, 501 insertions(+), 1 deletion(-)
 create mode 100644 drivers/ntb/msi.c

Comments

Serge Semin March 6, 2019, 8:26 p.m. UTC | #1
On Wed, Feb 13, 2019 at 10:54:51AM -0700, Logan Gunthorpe wrote:

Hi

> The NTB MSI library allows passing MSI interrupts across a memory
> window. This offers similar functionality to doorbells or messages
> except will often have much better latency and the client can
> potentially use significantly more remote interrupts than typical hardware
> provides for doorbells. (Which can be important in high-multiport
> setups.)
> 
> The library utilizes one memory window per peer and uses the highest
> index memory windows. Before any ntb_msi function may be used, the user
> must call ntb_msi_init(). It may then setup and tear down the memory
> windows when the link state changes using ntb_msi_setup_mws() and
> ntb_msi_clear_mws().
> 
> The peer which receives the interrupt must call ntb_msim_request_irq()
> to assign the interrupt handler (this function is functionally
> similar to devm_request_irq()) and the returned descriptor must be
> transferred to the peer which can use it to trigger the interrupt.
> The triggering peer, once having received the descriptor, can
> trigger the interrupt by calling ntb_msi_peer_trigger().
> 

The library is very useful, thanks for sharing it with us.

Here are my two general concerns regarding the implementation.
(More specific comments are further in the letter.)

First of all, It might be unsafe to have some resources consumed by NTB
MSI or some other library without a simple way to warn NTB client drivers
about their attempts to access that resources, since it might lead to random
errors. When I thought about implementing a transport library based on the
Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
array with the resources busy-flags. If, for instance, some message or
scratchpad register is occupied by the library (MSI, transport or some else),
then it would be impossible to access these resources directly through NTB API
methods. So NTB client driver shall retrieve an error in an attempt to
write/read data to/from busy message or scratchpad register, or in an attempt
to set some occupied doorbell bit. The same thing can be done for memory windows.

Second tiny concern is about documentation. Since there is a special file for
all NTB-related doc, it would be good to have some description about the
NTB MSI library there as well:
Documentation/ntb.txt

> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Cc: Jon Mason <jdmason@kudzu.us>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Allen Hubbe <allenbh@gmail.com>
> ---
>  drivers/ntb/Kconfig  |  11 ++
>  drivers/ntb/Makefile |   3 +-
>  drivers/ntb/msi.c    | 415 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/ntb.h  |  73 ++++++++
>  4 files changed, 501 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/ntb/msi.c
> 
> diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
> index 95944e52fa36..5760764052be 100644
> --- a/drivers/ntb/Kconfig
> +++ b/drivers/ntb/Kconfig
> @@ -12,6 +12,17 @@ menuconfig NTB
>  
>  if NTB
>  
> +config NTB_MSI
> +	bool "MSI Interrupt Support"
> +	depends on PCI_MSI
> +	help
> +	 Support using MSI interrupt forwarding instead of (or in addition to)
> +	 hardware doorbells. MSI interrupts typically offer lower latency
> +	 than doorbells and more MSI interrupts can be made available to
> +	 clients. However this requires an extra memory window and support
> +	 in the hardware driver for creating the MSI interrupts.
> +
> +	 If unsure, say N.
>  source "drivers/ntb/hw/Kconfig"
>  
>  source "drivers/ntb/test/Kconfig"
> diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
> index 537226f8e78d..cc27ad2ef150 100644
> --- a/drivers/ntb/Makefile
> +++ b/drivers/ntb/Makefile
> @@ -1,4 +1,5 @@
>  obj-$(CONFIG_NTB) += ntb.o hw/ test/
>  obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
>  
> -ntb-y := core.o
> +ntb-y			:= core.o
> +ntb-$(CONFIG_NTB_MSI)	+= msi.o
> diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
> new file mode 100644
> index 000000000000..5d4bd7a63924
> --- /dev/null
> +++ b/drivers/ntb/msi.c
> @@ -0,0 +1,415 @@
> +// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
> +
> +#include <linux/irq.h>
> +#include <linux/module.h>
> +#include <linux/ntb.h>
> +#include <linux/msi.h>
> +#include <linux/pci.h>
> +
> +MODULE_LICENSE("Dual BSD/GPL");
> +MODULE_VERSION("0.1");
> +MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
> +MODULE_DESCRIPTION("NTB MSI Interrupt Library");
> +
> +struct ntb_msi {
> +	u64 base_addr;
> +	u64 end_addr;
> +
> +	void (*desc_changed)(void *ctx);
> +
> +	u32 *peer_mws[];

Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
used to map MWs at these pointers?

> +};
> +
> +/**
> + * ntb_msi_init() - Initialize the MSI context
> + * @ntb:	NTB device context
> + *
> + * This function must be called before any other ntb_msi function.
> + * It initializes the context for MSI operations and maps
> + * the peer memory windows.
> + *
> + * This function reserves the last N outbound memory windows (where N
> + * is the number of peers).
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_init(struct ntb_dev *ntb,
> +		 void (*desc_changed)(void *ctx))
> +{
> +	phys_addr_t mw_phys_addr;
> +	resource_size_t mw_size;
> +	size_t struct_size;
> +	int peer_widx;
> +	int peers;
> +	int ret;
> +	int i;
> +
> +	peers = ntb_peer_port_count(ntb);
> +	if (peers <= 0)
> +		return -EINVAL;
> +
> +	struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
> +
> +	ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
> +	if (!ntb->msi)
> +		return -ENOMEM;
> +
> +	ntb->msi->desc_changed = desc_changed;
> +
> +	for (i = 0; i < peers; i++) {
> +		peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
> +
> +		ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
> +					   &mw_size);
> +		if (ret)
> +			goto unroll;
> +
> +		ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
> +						     mw_size);
> +		if (!ntb->msi->peer_mws[i]) {
> +			ret = -EFAULT;
> +			goto unroll;
> +		}
> +	}
> +
> +	return 0;
> +
> +unroll:
> +	for (i = 0; i < peers; i++)
> +		if (ntb->msi->peer_mws[i])
> +			devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

Simpler and faster cleanup-code would be:

+ unroll:
+ 	for (--i; i >= 0; --i)
+ 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

> +
> +	devm_kfree(&ntb->dev, ntb->msi);
> +	ntb->msi = NULL;
> +	return ret;
> +}
> +EXPORT_SYMBOL(ntb_msi_init);
> +
> +/**
> + * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
> + * @ntb:	NTB device context
> + *
> + * This function sets up the required inbound memory windows. It should be
> + * called from a work function after a link up event.
> + *
> + * Over the entire network, this function will reserves the last N
> + * inbound memory windows for each peer (where N is the number of peers).
> + *
> + * ntb_msi_init() must be called before this function.
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_setup_mws(struct ntb_dev *ntb)
> +{
> +	struct msi_desc *desc;
> +	u64 addr;
> +	int peer, peer_widx;
> +	resource_size_t addr_align, size_align, size_max;
> +	resource_size_t mw_size = SZ_32K;
> +	resource_size_t mw_min_size = mw_size;
> +	int i;
> +	int ret;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	desc = first_msi_entry(&ntb->pdev->dev);
> +	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			return peer_widx;
> +
> +		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
> +				       NULL, NULL);
> +		if (ret)
> +			return ret;
> +
> +		addr &= ~(addr_align - 1);
> +	}
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0) {
> +			ret = peer_widx;
> +			goto error_out;
> +		}
> +
> +		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
> +				       &size_align, &size_max);
> +		if (ret)
> +			goto error_out;
> +
> +		mw_size = round_up(mw_size, size_align);
> +		mw_size = max(mw_size, size_max);
> +		if (mw_size < mw_min_size)
> +			mw_min_size = mw_size;
> +
> +		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
> +				       addr, mw_size);
> +		if (ret)
> +			goto error_out;

Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
(two-ports legacy configuration), but will fail for IDT due to being based on
the outbound MW xlat interface. So the library at this stage isn't portable
across all NTB hardware. In order to make it working the translation address is
supposed to be transferred to the peer side, where a peer code should call
ntb_peer_mw_set_trans() method with the retrieved xlat address.
See documentation for details:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt

ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

So I'd suggest to add some method like ntb_msi_peer_setup_mws() or similar
which is supposed to be called on the peer side with a translation address
or some common descriptor containing the address passed to the function
argument.

It seems to me the test driver should be also altered to support this case.

> +	}
> +
> +	ntb->msi->base_addr = addr;
> +	ntb->msi->end_addr = addr + mw_min_size;
> +
> +	return 0;
> +
> +error_out:
> +	for (i = 0; i < peer; i++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			continue;
> +
> +		ntb_mw_clear_trans(ntb, i, peer_widx);
> +	}

The same cleanup pattern can be utilized here:
+error_out:
+	for (--peer; peer >= 0; --peer) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		ntb_mw_clear_trans(ntb, i, peer_widx);
+	}

So you won't need "i" variable here anymore. You also don't need to check the
return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
was already checked in the main algo code.

> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(ntb_msi_setup_mws);
> +
> +/**
> + * ntb_msi_clear_mws() - Clear all inbound memory windows
> + * @ntb:	NTB device context
> + *
> + * This function tears down the resources used by ntb_msi_setup_mws().
> + */
> +void ntb_msi_clear_mws(struct ntb_dev *ntb)
> +{
> +	int peer;
> +	int peer_widx;
> +
> +	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		if (peer_widx < 0)
> +			continue;
> +
> +		ntb_mw_clear_trans(ntb, peer, peer_widx);
> +	}
> +}
> +EXPORT_SYMBOL(ntb_msi_clear_mws);
> +

Similarly something like ntb_msi_peer_clear_mws() should be added to
unset a translation address on the peer side.

> +struct ntb_msi_devres {
> +	struct ntb_dev *ntb;
> +	struct msi_desc *entry;
> +	struct ntb_msi_desc *msi_desc;
> +};
> +
> +static int ntb_msi_set_desc(struct ntb_dev *ntb, struct msi_desc *entry,
> +			    struct ntb_msi_desc *msi_desc)
> +{
> +	u64 addr;
> +
> +	addr = entry->msg.address_lo +
> +		((uint64_t)entry->msg.address_hi << 32);
> +
> +	if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
> +		dev_warn_once(&ntb->dev,
> +			      "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
> +			      entry->irq, addr, ntb->msi->base_addr,
> +			      ntb->msi->end_addr);
> +		return -EFAULT;
> +	}
> +
> +	msi_desc->addr_offset = addr - ntb->msi->base_addr;
> +	msi_desc->data = entry->msg.data;
> +
> +	return 0;
> +}
> +
> +static void ntb_msi_write_msg(struct msi_desc *entry, void *data)
> +{
> +	struct ntb_msi_devres *dr = data;
> +
> +	WARN_ON(ntb_msi_set_desc(dr->ntb, entry, dr->msi_desc));
> +
> +	if (dr->ntb->msi->desc_changed)
> +		dr->ntb->msi->desc_changed(dr->ntb->ctx);
> +}
> +
> +static void ntbm_msi_callback_release(struct device *dev, void *res)
> +{
> +	struct ntb_msi_devres *dr = res;
> +
> +	dr->entry->write_msi_msg = NULL;
> +	dr->entry->write_msi_msg_data = NULL;
> +}
> +
> +static int ntbm_msi_setup_callback(struct ntb_dev *ntb, struct msi_desc *entry,
> +				   struct ntb_msi_desc *msi_desc)
> +{
> +	struct ntb_msi_devres *dr;
> +
> +	dr = devres_alloc(ntbm_msi_callback_release,
> +			  sizeof(struct ntb_msi_devres), GFP_KERNEL);
> +	if (!dr)
> +		return -ENOMEM;
> +
> +	dr->ntb = ntb;
> +	dr->entry = entry;
> +	dr->msi_desc = msi_desc;
> +
> +	devres_add(&ntb->dev, dr);
> +
> +	dr->entry->write_msi_msg = ntb_msi_write_msg;
> +	dr->entry->write_msi_msg_data = dr;
> +
> +	return 0;
> +}
> +
> +/**
> + * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
> + * @ntb:	NTB device context
> + * @handler:	Function to be called when the IRQ occurs
> + * @thread_fn:  Function to be called in a threaded interrupt context. NULL
> + *              for clients which handle everything in @handler
> + * @devname:    An ascii name for the claiming device, dev_name(dev) if NULL
> + * @dev_id:     A cookie passed back to the handler function
> + *
> + * This function assigns an interrupt handler to an unused
> + * MSI interrupt and returns the descriptor used to trigger
> + * it. The descriptor can then be sent to a peer to trigger
> + * the interrupt.
> + *
> + * The interrupt resource is managed with devres so it will
> + * be automatically freed when the NTB device is torn down.
> + *
> + * If an IRQ allocated with this function needs to be freed
> + * separately, ntbm_free_irq() must be used.
> + *
> + * Return: IRQ number assigned on success, otherwise a negative error number.
> + */
> +int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
> +				  irq_handler_t thread_fn,
> +				  const char *name, void *dev_id,
> +				  struct ntb_msi_desc *msi_desc)
> +{
> +	struct msi_desc *entry;
> +	struct irq_desc *desc;
> +	int ret;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	for_each_pci_msi_entry(entry, ntb->pdev) {
> +		desc = irq_to_desc(entry->irq);
> +		if (desc->action)
> +			continue;
> +
> +		ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
> +						thread_fn, 0, name, dev_id);
> +		if (ret)
> +			continue;
> +
> +		if (ntb_msi_set_desc(ntb, entry, msi_desc)) {
> +			devm_free_irq(&ntb->dev, entry->irq, dev_id);
> +			continue;
> +		}
> +
> +		ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
> +		if (ret) {
> +			devm_free_irq(&ntb->dev, entry->irq, dev_id);
> +			return ret;
> +		}
> +
> +
> +		return entry->irq;
> +	}
> +
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
> +
> +static int ntbm_msi_callback_match(struct device *dev, void *res, void *data)
> +{
> +	struct ntb_dev *ntb = dev_ntb(dev);
> +	struct ntb_msi_devres *dr = res;
> +
> +	return dr->ntb == ntb && dr->entry == data;
> +}
> +
> +/**
> + * ntbm_msi_free_irq() - free an interrupt
> + * @ntb:	NTB device context
> + * @irq:	Interrupt line to free
> + * @dev_id:	Device identity to free
> + *
> + * This function should be used to manually free IRQs allocated with
> + * ntbm_request_[threaded_]irq().
> + */
> +void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id)
> +{
> +	struct msi_desc *entry = irq_get_msi_desc(irq);
> +
> +	entry->write_msi_msg = NULL;
> +	entry->write_msi_msg_data = NULL;
> +
> +	WARN_ON(devres_destroy(&ntb->dev, ntbm_msi_callback_release,
> +			       ntbm_msi_callback_match, entry));
> +
> +	devm_free_irq(&ntb->dev, irq, dev_id);
> +}
> +EXPORT_SYMBOL(ntbm_msi_free_irq);
> +
> +/**
> + * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
> + * @ntb:	NTB device context
> + * @peer:	Peer index
> + * @desc:	MSI descriptor data which triggers the interrupt
> + *
> + * This function triggers an interrupt on a peer. It requires
> + * the descriptor structure to have been passed from that peer
> + * by some other means.
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +			 struct ntb_msi_desc *desc)
> +{
> +	int idx;
> +
> +	if (!ntb->msi)
> +		return -EINVAL;
> +
> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
> +
> +	ntb->msi->peer_mws[peer][idx] = desc->data;
> +

Shouldn't we use iowrite32() here instead of direct access to the IO-memory?

> +	return 0;
> +}
> +EXPORT_SYMBOL(ntb_msi_peer_trigger);
> +
> +/**
> + * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
> + * @ntb:	NTB device context
> + * @peer:	Peer index
> + * @desc:	MSI descriptor data which triggers the interrupt
> + * @msi_addr:   Physical address to trigger the interrupt
> + *
> + * This function allows using DMA engines to trigger an interrupt
> + * (for example, trigger an interrupt to process the data after
> + * sending it). To trigger the interrupt, write @desc.data to the address
> + * returned in @msi_addr
> + *
> + * Return: Zero on success, otherwise a negative error number.
> + */
> +int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +		      struct ntb_msi_desc *desc,
> +		      phys_addr_t *msi_addr)
> +{
> +	int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
> +	phys_addr_t mw_phys_addr;
> +	int ret;
> +
> +	ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
> +	if (ret)
> +		return ret;
> +
> +	if (msi_addr)
> +		*msi_addr = mw_phys_addr + desc->addr_offset;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(ntb_msi_peer_addr);
> diff --git a/include/linux/ntb.h b/include/linux/ntb.h
> index f5c69d853489..b9c61ee3c734 100644
> --- a/include/linux/ntb.h
> +++ b/include/linux/ntb.h
> @@ -58,9 +58,11 @@
>  
>  #include <linux/completion.h>
>  #include <linux/device.h>
> +#include <linux/interrupt.h>
>  
>  struct ntb_client;
>  struct ntb_dev;
> +struct ntb_msi;
>  struct pci_dev;
>  
>  /**
> @@ -425,6 +427,10 @@ struct ntb_dev {
>  	spinlock_t			ctx_lock;
>  	/* block unregister until device is fully released */
>  	struct completion		released;
> +
> +	#ifdef CONFIG_NTB_MSI
> +	struct ntb_msi *msi;
> +	#endif

I'd align the macro-condition to the most left position:
+#ifdef CONFIG_NTB_MSI
+	struct ntb_msi *msi;
+#endif

>  };
>  #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
>  
> @@ -1572,4 +1578,71 @@ static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
>  	return ntb_mw_count(ntb, pidx) - ret - 1;
>  }
>  
> +struct ntb_msi_desc {
> +	u32 addr_offset;
> +	u32 data;
> +};
> +
> +#ifdef CONFIG_NTB_MSI
> +
> +int ntb_msi_init(struct ntb_dev *ntb, void (*desc_changed)(void *ctx));
> +int ntb_msi_setup_mws(struct ntb_dev *ntb);
> +void ntb_msi_clear_mws(struct ntb_dev *ntb);
> +int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
> +				  irq_handler_t thread_fn,
> +				  const char *name, void *dev_id,
> +				  struct ntb_msi_desc *msi_desc);
> +void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id);
> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +			 struct ntb_msi_desc *desc);
> +int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +		      struct ntb_msi_desc *desc,
> +		      phys_addr_t *msi_addr);
> +
> +#else /* not CONFIG_NTB_MSI */
> +
> +static inline int ntb_msi_init(struct ntb_dev *ntb,
> +			       void (*desc_changed)(void *ctx))
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
> +static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
> +						irq_handler_t handler,
> +						irq_handler_t thread_fn,
> +						const char *name, void *dev_id,
> +						struct ntb_msi_desc *msi_desc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq,
> +				     void *dev_id) {}
> +static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> +				       struct ntb_msi_desc *desc)
> +{
> +	return -EOPNOTSUPP;
> +}
> +static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
> +				    struct ntb_msi_desc *desc,
> +				    phys_addr_t *msi_addr)
> +{
> +	return -EOPNOTSUPP;
> +
> +}
> +
> +#endif /* CONFIG_NTB_MSI */
> +
> +static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
> +				       irq_handler_t handler,
> +				       const char *name, void *dev_id,
> +				       struct ntb_msi_desc *msi_desc)
> +{
> +	return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
> +					     dev_id, msi_desc);
> +}
> +
>  #endif
> -- 
> 2.19.0
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/20190213175454.7506-10-logang%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.
Logan Gunthorpe March 6, 2019, 9:35 p.m. UTC | #2
On 2019-03-06 1:26 p.m., Serge Semin wrote:
> First of all, It might be unsafe to have some resources consumed by NTB
> MSI or some other library without a simple way to warn NTB client drivers
> about their attempts to access that resources, since it might lead to random
> errors. When I thought about implementing a transport library based on the
> Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
> array with the resources busy-flags. If, for instance, some message or
> scratchpad register is occupied by the library (MSI, transport or some else),
> then it would be impossible to access these resources directly through NTB API
> methods. So NTB client driver shall retrieve an error in an attempt to
> write/read data to/from busy message or scratchpad register, or in an attempt
> to set some occupied doorbell bit. The same thing can be done for memory windows.

Yes, it would be nice to have a generic library to manage all the
resources, but right now we don't and it's unfair to expect us to take
on this work to get the features we care about merged. Right now, it's
not at all unsafe as the client is quite capable of ensuring it has the
resources for the MSI library. The changes for ntb_transport to ensure
this are quite reasonable.

> Second tiny concern is about documentation. Since there is a special file for
> all NTB-related doc, it would be good to have some description about the
> NTB MSI library there as well:
> Documentation/ntb.txt

Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
date since your changes. Especially in the ntb_tool section...

>> +	u32 *peer_mws[];
> 
> Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
> used to map MWs at these pointers?

Yes, will change for v3.


> Simpler and faster cleanup-code would be:

> + unroll:
> + 	for (--i; i >= 0; --i)
> + 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);

Faster, maybe, but I would not consider this simpler. It's much more
complicated to reason about and ensure it's correct. I prefer my way
because I don't care about speed, but I do care about readability.


> Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> (two-ports legacy configuration), but will fail for IDT due to being based on
> the outbound MW xlat interface. So the library at this stage isn't portable
> across all NTB hardware. In order to make it working the translation address is
> supposed to be transferred to the peer side, where a peer code should call
> ntb_peer_mw_set_trans() method with the retrieved xlat address.
> See documentation for details:

> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
> 
> ntb_perf driver can be also used as a reference of the portable NTB MWs setup.

Gross. Well, given that ntb_transport doesn't even support this and we
don't really have a sensible library to transfer this information, I'm
going to leave it as is for now. Someone can update ntb_msi when they
update ntb_transport, preferably after we have a nice library to handle
the transfers for us seeing I absolutely do not want to replicate the
mess in ntb_perf.

Actually, if we had a generic spad/msg communication library, it would
probably be better to have a common ntb_mw_set_trans() function that
uses the communications library to send the data and automatically call
ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
mess into the clients.

> The same cleanup pattern can be utilized here:
> +error_out:
> +	for (--peer; peer >= 0; --peer) {
> +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> +		ntb_mw_clear_trans(ntb, i, peer_widx);
> +	}
> 
> So you won't need "i" variable here anymore. You also don't need to check the
> return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> was already checked in the main algo code.

See above.

>> +EXPORT_SYMBOL(ntb_msi_clear_mws);
>> +
> 
> Similarly something like ntb_msi_peer_clear_mws() should be added to
> unset a translation address on the peer side.

Well, we can table that for when ntb_msi supports the peer MW setting
functions.
>> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
>> +			 struct ntb_msi_desc *desc)
>> +{
>> +	int idx;
>> +
>> +	if (!ntb->msi)
>> +		return -EINVAL;
>> +
>> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
>> +
>> +	ntb->msi->peer_mws[peer][idx] = desc->data;
>> +
> 
> Shouldn't we use iowrite32() here instead of direct access to the IO-memory?

Yes, as above I'll fix it for v3.

>> @@ -425,6 +427,10 @@ struct ntb_dev {
>>  	spinlock_t			ctx_lock;
>>  	/* block unregister until device is fully released */
>>  	struct completion		released;
>> +
>> +	#ifdef CONFIG_NTB_MSI
>> +	struct ntb_msi *msi;
>> +	#endif
> 
> I'd align the macro-condition to the most left position:
> +#ifdef CONFIG_NTB_MSI
> +	struct ntb_msi *msi;
> +#endif

Fixed for v3.


Logan
Serge Semin March 6, 2019, 11:13 p.m. UTC | #3
On Wed, Mar 06, 2019 at 02:35:53PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2019-03-06 1:26 p.m., Serge Semin wrote:
> > First of all, It might be unsafe to have some resources consumed by NTB
> > MSI or some other library without a simple way to warn NTB client drivers
> > about their attempts to access that resources, since it might lead to random
> > errors. When I thought about implementing a transport library based on the
> > Message/Spad+Doorbell registers, I had in mind to create an internal bits-field
> > array with the resources busy-flags. If, for instance, some message or
> > scratchpad register is occupied by the library (MSI, transport or some else),
> > then it would be impossible to access these resources directly through NTB API
> > methods. So NTB client driver shall retrieve an error in an attempt to
> > write/read data to/from busy message or scratchpad register, or in an attempt
> > to set some occupied doorbell bit. The same thing can be done for memory windows.
> 
> Yes, it would be nice to have a generic library to manage all the
> resources, but right now we don't and it's unfair to expect us to take
> on this work to get the features we care about merged. Right now, it's
> not at all unsafe as the client is quite capable of ensuring it has the
> resources for the MSI library. The changes for ntb_transport to ensure
> this are quite reasonable.
> 
> > Second tiny concern is about documentation. Since there is a special file for
> > all NTB-related doc, it would be good to have some description about the
> > NTB MSI library there as well:
> > Documentation/ntb.txt
> 
> Sure, I'll add a short blurb for v3. Though, I noticed it's quite out of
> date since your changes. Especially in the ntb_tool section...
> 

Ok. Thanks.
If you want you can add some info to the ntb_tool section as well. If you
don't have time, I'll update it next time I submit anything new to the
subsystem.

-Sergey

> >> +	u32 *peer_mws[];
> > 
> > Shouldn't we use the __iomem attribute here since later the devm_ioremap() is
> > used to map MWs at these pointers?
> 
> Yes, will change for v3.
> 
> 
> > Simpler and faster cleanup-code would be:
> 
> > + unroll:
> > + 	for (--i; i >= 0; --i)
> > + 		devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
> 
> Faster, maybe, but I would not consider this simpler. It's much more
> complicated to reason about and ensure it's correct. I prefer my way
> because I don't care about speed, but I do care about readability.
> 
> 
> > Alas calling the ntb_mw_set_trans() method isn't enough to fully initialize
> > NTB Memory Windows. Yes, the library will work for Intel/AMD/Switchtec
> > (two-ports legacy configuration), but will fail for IDT due to being based on
> > the outbound MW xlat interface. So the library at this stage isn't portable
> > across all NTB hardware. In order to make it working the translation address is
> > supposed to be transferred to the peer side, where a peer code should call
> > ntb_peer_mw_set_trans() method with the retrieved xlat address.
> > See documentation for details:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/ntb.txt
> > 
> > ntb_perf driver can be also used as a reference of the portable NTB MWs setup.
> 
> Gross. Well, given that ntb_transport doesn't even support this and we
> don't really have a sensible library to transfer this information, I'm
> going to leave it as is for now. Someone can update ntb_msi when they
> update ntb_transport, preferably after we have a nice library to handle
> the transfers for us seeing I absolutely do not want to replicate the
> mess in ntb_perf.
> 
> Actually, if we had a generic spad/msg communication library, it would
> probably be better to have a common ntb_mw_set_trans() function that
> uses the communications library to send the data and automatically call
> ntb_peer_mw_set_trans() on the peer. That way we don't have to push this
> mess into the clients.
> 
> > The same cleanup pattern can be utilized here:
> > +error_out:
> > +	for (--peer; peer >= 0; --peer) {
> > +		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
> > +		ntb_mw_clear_trans(ntb, i, peer_widx);
> > +	}
> > 
> > So you won't need "i" variable here anymore. You also don't need to check the
> > return value of ntb_peer_highest_mw_idx() in the cleanup loop because it
> > was already checked in the main algo code.
> 
> See above.
> 
> >> +EXPORT_SYMBOL(ntb_msi_clear_mws);
> >> +
> > 
> > Similarly something like ntb_msi_peer_clear_mws() should be added to
> > unset a translation address on the peer side.
> 
> Well, we can table that for when ntb_msi supports the peer MW setting
> functions.
> >> +int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
> >> +			 struct ntb_msi_desc *desc)
> >> +{
> >> +	int idx;
> >> +
> >> +	if (!ntb->msi)
> >> +		return -EINVAL;
> >> +
> >> +	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
> >> +
> >> +	ntb->msi->peer_mws[peer][idx] = desc->data;
> >> +
> > 
> > Shouldn't we use iowrite32() here instead of direct access to the IO-memory?
> 
> Yes, as above I'll fix it for v3.
> 
> >> @@ -425,6 +427,10 @@ struct ntb_dev {
> >>  	spinlock_t			ctx_lock;
> >>  	/* block unregister until device is fully released */
> >>  	struct completion		released;
> >> +
> >> +	#ifdef CONFIG_NTB_MSI
> >> +	struct ntb_msi *msi;
> >> +	#endif
> > 
> > I'd align the macro-condition to the most left position:
> > +#ifdef CONFIG_NTB_MSI
> > +	struct ntb_msi *msi;
> > +#endif
> 
> Fixed for v3.
> 
> 
> Logan
> 
> -- 
> You received this message because you are subscribed to the Google Groups "linux-ntb" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to linux-ntb+unsubscribe@googlegroups.com.
> To post to this group, send email to linux-ntb@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/linux-ntb/5b420eb7-5010-aae3-e9bd-1c612af409ae%40deltatee.com.
> For more options, visit https://groups.google.com/d/optout.
diff mbox series

Patch

diff --git a/drivers/ntb/Kconfig b/drivers/ntb/Kconfig
index 95944e52fa36..5760764052be 100644
--- a/drivers/ntb/Kconfig
+++ b/drivers/ntb/Kconfig
@@ -12,6 +12,17 @@  menuconfig NTB
 
 if NTB
 
+config NTB_MSI
+	bool "MSI Interrupt Support"
+	depends on PCI_MSI
+	help
+	 Support using MSI interrupt forwarding instead of (or in addition to)
+	 hardware doorbells. MSI interrupts typically offer lower latency
+	 than doorbells and more MSI interrupts can be made available to
+	 clients. However this requires an extra memory window and support
+	 in the hardware driver for creating the MSI interrupts.
+
+	 If unsure, say N.
 source "drivers/ntb/hw/Kconfig"
 
 source "drivers/ntb/test/Kconfig"
diff --git a/drivers/ntb/Makefile b/drivers/ntb/Makefile
index 537226f8e78d..cc27ad2ef150 100644
--- a/drivers/ntb/Makefile
+++ b/drivers/ntb/Makefile
@@ -1,4 +1,5 @@ 
 obj-$(CONFIG_NTB) += ntb.o hw/ test/
 obj-$(CONFIG_NTB_TRANSPORT) += ntb_transport.o
 
-ntb-y := core.o
+ntb-y			:= core.o
+ntb-$(CONFIG_NTB_MSI)	+= msi.o
diff --git a/drivers/ntb/msi.c b/drivers/ntb/msi.c
new file mode 100644
index 000000000000..5d4bd7a63924
--- /dev/null
+++ b/drivers/ntb/msi.c
@@ -0,0 +1,415 @@ 
+// SPDX-License-Identifier: (GPL-2.0 OR BSD-3-Clause)
+
+#include <linux/irq.h>
+#include <linux/module.h>
+#include <linux/ntb.h>
+#include <linux/msi.h>
+#include <linux/pci.h>
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_VERSION("0.1");
+MODULE_AUTHOR("Logan Gunthorpe <logang@deltatee.com>");
+MODULE_DESCRIPTION("NTB MSI Interrupt Library");
+
+struct ntb_msi {
+	u64 base_addr;
+	u64 end_addr;
+
+	void (*desc_changed)(void *ctx);
+
+	u32 *peer_mws[];
+};
+
+/**
+ * ntb_msi_init() - Initialize the MSI context
+ * @ntb:	NTB device context
+ *
+ * This function must be called before any other ntb_msi function.
+ * It initializes the context for MSI operations and maps
+ * the peer memory windows.
+ *
+ * This function reserves the last N outbound memory windows (where N
+ * is the number of peers).
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_init(struct ntb_dev *ntb,
+		 void (*desc_changed)(void *ctx))
+{
+	phys_addr_t mw_phys_addr;
+	resource_size_t mw_size;
+	size_t struct_size;
+	int peer_widx;
+	int peers;
+	int ret;
+	int i;
+
+	peers = ntb_peer_port_count(ntb);
+	if (peers <= 0)
+		return -EINVAL;
+
+	struct_size = sizeof(*ntb->msi) + sizeof(*ntb->msi->peer_mws) * peers;
+
+	ntb->msi = devm_kzalloc(&ntb->dev, struct_size, GFP_KERNEL);
+	if (!ntb->msi)
+		return -ENOMEM;
+
+	ntb->msi->desc_changed = desc_changed;
+
+	for (i = 0; i < peers; i++) {
+		peer_widx = ntb_peer_mw_count(ntb) - 1 - i;
+
+		ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr,
+					   &mw_size);
+		if (ret)
+			goto unroll;
+
+		ntb->msi->peer_mws[i] = devm_ioremap(&ntb->dev, mw_phys_addr,
+						     mw_size);
+		if (!ntb->msi->peer_mws[i]) {
+			ret = -EFAULT;
+			goto unroll;
+		}
+	}
+
+	return 0;
+
+unroll:
+	for (i = 0; i < peers; i++)
+		if (ntb->msi->peer_mws[i])
+			devm_iounmap(&ntb->dev, ntb->msi->peer_mws[i]);
+
+	devm_kfree(&ntb->dev, ntb->msi);
+	ntb->msi = NULL;
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_init);
+
+/**
+ * ntb_msi_setup_mws() - Initialize the MSI inbound memory windows
+ * @ntb:	NTB device context
+ *
+ * This function sets up the required inbound memory windows. It should be
+ * called from a work function after a link up event.
+ *
+ * Over the entire network, this function will reserves the last N
+ * inbound memory windows for each peer (where N is the number of peers).
+ *
+ * ntb_msi_init() must be called before this function.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	struct msi_desc *desc;
+	u64 addr;
+	int peer, peer_widx;
+	resource_size_t addr_align, size_align, size_max;
+	resource_size_t mw_size = SZ_32K;
+	resource_size_t mw_min_size = mw_size;
+	int i;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	desc = first_msi_entry(&ntb->pdev->dev);
+	addr = desc->msg.address_lo + ((uint64_t)desc->msg.address_hi << 32);
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			return peer_widx;
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, &addr_align,
+				       NULL, NULL);
+		if (ret)
+			return ret;
+
+		addr &= ~(addr_align - 1);
+	}
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0) {
+			ret = peer_widx;
+			goto error_out;
+		}
+
+		ret = ntb_mw_get_align(ntb, peer, peer_widx, NULL,
+				       &size_align, &size_max);
+		if (ret)
+			goto error_out;
+
+		mw_size = round_up(mw_size, size_align);
+		mw_size = max(mw_size, size_max);
+		if (mw_size < mw_min_size)
+			mw_min_size = mw_size;
+
+		ret = ntb_mw_set_trans(ntb, peer, peer_widx,
+				       addr, mw_size);
+		if (ret)
+			goto error_out;
+	}
+
+	ntb->msi->base_addr = addr;
+	ntb->msi->end_addr = addr + mw_min_size;
+
+	return 0;
+
+error_out:
+	for (i = 0; i < peer; i++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, i, peer_widx);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(ntb_msi_setup_mws);
+
+/**
+ * ntb_msi_clear_mws() - Clear all inbound memory windows
+ * @ntb:	NTB device context
+ *
+ * This function tears down the resources used by ntb_msi_setup_mws().
+ */
+void ntb_msi_clear_mws(struct ntb_dev *ntb)
+{
+	int peer;
+	int peer_widx;
+
+	for (peer = 0; peer < ntb_peer_port_count(ntb); peer++) {
+		peer_widx = ntb_peer_highest_mw_idx(ntb, peer);
+		if (peer_widx < 0)
+			continue;
+
+		ntb_mw_clear_trans(ntb, peer, peer_widx);
+	}
+}
+EXPORT_SYMBOL(ntb_msi_clear_mws);
+
+struct ntb_msi_devres {
+	struct ntb_dev *ntb;
+	struct msi_desc *entry;
+	struct ntb_msi_desc *msi_desc;
+};
+
+static int ntb_msi_set_desc(struct ntb_dev *ntb, struct msi_desc *entry,
+			    struct ntb_msi_desc *msi_desc)
+{
+	u64 addr;
+
+	addr = entry->msg.address_lo +
+		((uint64_t)entry->msg.address_hi << 32);
+
+	if (addr < ntb->msi->base_addr || addr >= ntb->msi->end_addr) {
+		dev_warn_once(&ntb->dev,
+			      "IRQ %d: MSI Address not within the memory window (%llx, [%llx %llx])\n",
+			      entry->irq, addr, ntb->msi->base_addr,
+			      ntb->msi->end_addr);
+		return -EFAULT;
+	}
+
+	msi_desc->addr_offset = addr - ntb->msi->base_addr;
+	msi_desc->data = entry->msg.data;
+
+	return 0;
+}
+
+static void ntb_msi_write_msg(struct msi_desc *entry, void *data)
+{
+	struct ntb_msi_devres *dr = data;
+
+	WARN_ON(ntb_msi_set_desc(dr->ntb, entry, dr->msi_desc));
+
+	if (dr->ntb->msi->desc_changed)
+		dr->ntb->msi->desc_changed(dr->ntb->ctx);
+}
+
+static void ntbm_msi_callback_release(struct device *dev, void *res)
+{
+	struct ntb_msi_devres *dr = res;
+
+	dr->entry->write_msi_msg = NULL;
+	dr->entry->write_msi_msg_data = NULL;
+}
+
+static int ntbm_msi_setup_callback(struct ntb_dev *ntb, struct msi_desc *entry,
+				   struct ntb_msi_desc *msi_desc)
+{
+	struct ntb_msi_devres *dr;
+
+	dr = devres_alloc(ntbm_msi_callback_release,
+			  sizeof(struct ntb_msi_devres), GFP_KERNEL);
+	if (!dr)
+		return -ENOMEM;
+
+	dr->ntb = ntb;
+	dr->entry = entry;
+	dr->msi_desc = msi_desc;
+
+	devres_add(&ntb->dev, dr);
+
+	dr->entry->write_msi_msg = ntb_msi_write_msg;
+	dr->entry->write_msi_msg_data = dr;
+
+	return 0;
+}
+
+/**
+ * ntbm_msi_request_threaded_irq() - allocate an MSI interrupt
+ * @ntb:	NTB device context
+ * @handler:	Function to be called when the IRQ occurs
+ * @thread_fn:  Function to be called in a threaded interrupt context. NULL
+ *              for clients which handle everything in @handler
+ * @devname:    An ascii name for the claiming device, dev_name(dev) if NULL
+ * @dev_id:     A cookie passed back to the handler function
+ *
+ * This function assigns an interrupt handler to an unused
+ * MSI interrupt and returns the descriptor used to trigger
+ * it. The descriptor can then be sent to a peer to trigger
+ * the interrupt.
+ *
+ * The interrupt resource is managed with devres so it will
+ * be automatically freed when the NTB device is torn down.
+ *
+ * If an IRQ allocated with this function needs to be freed
+ * separately, ntbm_free_irq() must be used.
+ *
+ * Return: IRQ number assigned on success, otherwise a negative error number.
+ */
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc)
+{
+	struct msi_desc *entry;
+	struct irq_desc *desc;
+	int ret;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	for_each_pci_msi_entry(entry, ntb->pdev) {
+		desc = irq_to_desc(entry->irq);
+		if (desc->action)
+			continue;
+
+		ret = devm_request_threaded_irq(&ntb->dev, entry->irq, handler,
+						thread_fn, 0, name, dev_id);
+		if (ret)
+			continue;
+
+		if (ntb_msi_set_desc(ntb, entry, msi_desc)) {
+			devm_free_irq(&ntb->dev, entry->irq, dev_id);
+			continue;
+		}
+
+		ret = ntbm_msi_setup_callback(ntb, entry, msi_desc);
+		if (ret) {
+			devm_free_irq(&ntb->dev, entry->irq, dev_id);
+			return ret;
+		}
+
+
+		return entry->irq;
+	}
+
+	return -ENODEV;
+}
+EXPORT_SYMBOL(ntbm_msi_request_threaded_irq);
+
+static int ntbm_msi_callback_match(struct device *dev, void *res, void *data)
+{
+	struct ntb_dev *ntb = dev_ntb(dev);
+	struct ntb_msi_devres *dr = res;
+
+	return dr->ntb == ntb && dr->entry == data;
+}
+
+/**
+ * ntbm_msi_free_irq() - free an interrupt
+ * @ntb:	NTB device context
+ * @irq:	Interrupt line to free
+ * @dev_id:	Device identity to free
+ *
+ * This function should be used to manually free IRQs allocated with
+ * ntbm_request_[threaded_]irq().
+ */
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id)
+{
+	struct msi_desc *entry = irq_get_msi_desc(irq);
+
+	entry->write_msi_msg = NULL;
+	entry->write_msi_msg_data = NULL;
+
+	WARN_ON(devres_destroy(&ntb->dev, ntbm_msi_callback_release,
+			       ntbm_msi_callback_match, entry));
+
+	devm_free_irq(&ntb->dev, irq, dev_id);
+}
+EXPORT_SYMBOL(ntbm_msi_free_irq);
+
+/**
+ * ntb_msi_peer_trigger() - Trigger an interrupt handler on a peer
+ * @ntb:	NTB device context
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ *
+ * This function triggers an interrupt on a peer. It requires
+ * the descriptor structure to have been passed from that peer
+ * by some other means.
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc)
+{
+	int idx;
+
+	if (!ntb->msi)
+		return -EINVAL;
+
+	idx = desc->addr_offset / sizeof(*ntb->msi->peer_mws[peer]);
+
+	ntb->msi->peer_mws[peer][idx] = desc->data;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_trigger);
+
+/**
+ * ntb_msi_peer_addr() - Get the DMA address to trigger a peer's MSI interrupt
+ * @ntb:	NTB device context
+ * @peer:	Peer index
+ * @desc:	MSI descriptor data which triggers the interrupt
+ * @msi_addr:   Physical address to trigger the interrupt
+ *
+ * This function allows using DMA engines to trigger an interrupt
+ * (for example, trigger an interrupt to process the data after
+ * sending it). To trigger the interrupt, write @desc.data to the address
+ * returned in @msi_addr
+ *
+ * Return: Zero on success, otherwise a negative error number.
+ */
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr)
+{
+	int peer_widx = ntb_peer_mw_count(ntb) - 1 - peer;
+	phys_addr_t mw_phys_addr;
+	int ret;
+
+	ret = ntb_peer_mw_get_addr(ntb, peer_widx, &mw_phys_addr, NULL);
+	if (ret)
+		return ret;
+
+	if (msi_addr)
+		*msi_addr = mw_phys_addr + desc->addr_offset;
+
+	return 0;
+}
+EXPORT_SYMBOL(ntb_msi_peer_addr);
diff --git a/include/linux/ntb.h b/include/linux/ntb.h
index f5c69d853489..b9c61ee3c734 100644
--- a/include/linux/ntb.h
+++ b/include/linux/ntb.h
@@ -58,9 +58,11 @@ 
 
 #include <linux/completion.h>
 #include <linux/device.h>
+#include <linux/interrupt.h>
 
 struct ntb_client;
 struct ntb_dev;
+struct ntb_msi;
 struct pci_dev;
 
 /**
@@ -425,6 +427,10 @@  struct ntb_dev {
 	spinlock_t			ctx_lock;
 	/* block unregister until device is fully released */
 	struct completion		released;
+
+	#ifdef CONFIG_NTB_MSI
+	struct ntb_msi *msi;
+	#endif
 };
 #define dev_ntb(__dev) container_of((__dev), struct ntb_dev, dev)
 
@@ -1572,4 +1578,71 @@  static inline int ntb_peer_highest_mw_idx(struct ntb_dev *ntb, int pidx)
 	return ntb_mw_count(ntb, pidx) - ret - 1;
 }
 
+struct ntb_msi_desc {
+	u32 addr_offset;
+	u32 data;
+};
+
+#ifdef CONFIG_NTB_MSI
+
+int ntb_msi_init(struct ntb_dev *ntb, void (*desc_changed)(void *ctx));
+int ntb_msi_setup_mws(struct ntb_dev *ntb);
+void ntb_msi_clear_mws(struct ntb_dev *ntb);
+int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb, irq_handler_t handler,
+				  irq_handler_t thread_fn,
+				  const char *name, void *dev_id,
+				  struct ntb_msi_desc *msi_desc);
+void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq, void *dev_id);
+int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+			 struct ntb_msi_desc *desc);
+int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+		      struct ntb_msi_desc *desc,
+		      phys_addr_t *msi_addr);
+
+#else /* not CONFIG_NTB_MSI */
+
+static inline int ntb_msi_init(struct ntb_dev *ntb,
+			       void (*desc_changed)(void *ctx))
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_setup_mws(struct ntb_dev *ntb)
+{
+	return -EOPNOTSUPP;
+}
+static inline void ntb_msi_clear_mws(struct ntb_dev *ntb) {}
+static inline int ntbm_msi_request_threaded_irq(struct ntb_dev *ntb,
+						irq_handler_t handler,
+						irq_handler_t thread_fn,
+						const char *name, void *dev_id,
+						struct ntb_msi_desc *msi_desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline void ntbm_msi_free_irq(struct ntb_dev *ntb, unsigned int irq,
+				     void *dev_id) {}
+static inline int ntb_msi_peer_trigger(struct ntb_dev *ntb, int peer,
+				       struct ntb_msi_desc *desc)
+{
+	return -EOPNOTSUPP;
+}
+static inline int ntb_msi_peer_addr(struct ntb_dev *ntb, int peer,
+				    struct ntb_msi_desc *desc,
+				    phys_addr_t *msi_addr)
+{
+	return -EOPNOTSUPP;
+
+}
+
+#endif /* CONFIG_NTB_MSI */
+
+static inline int ntbm_msi_request_irq(struct ntb_dev *ntb,
+				       irq_handler_t handler,
+				       const char *name, void *dev_id,
+				       struct ntb_msi_desc *msi_desc)
+{
+	return ntbm_msi_request_threaded_irq(ntb, handler, NULL, name,
+					     dev_id, msi_desc);
+}
+
 #endif