diff mbox

pci: Export pci device msi table via sysfs

Message ID 1303412267-1948-1-git-send-email-nhorman@tuxdriver.com (mailing list archive)
State Rejected, archived
Headers show

Commit Message

Neil Horman April 21, 2011, 6:57 p.m. UTC
I've been working on some improvements to how we balance irqs for high volume
interrupt sources.  The consensus so far has been that what would be really
helpful is a irqbalance mechanism that operates in a one shot mode in response
to the addition of high volume interrupt sources (i.e. network devices mainly).
In attempting to implement this, I've found that it would be really useful to
have 2 bits of information:

1) A clear correlation of which interrupts belong to which devices.  Parsing
/proc/interrupts is an exercize in guessing what naming pattern a given driver
follows, and requires some amount of stateful information to be kept in user
space, lest every device addition require a rebalancing of every interrupt in
the system.

2) A indicator as to which kind of interrupts a given device is using.  The irq
attribute for a pci device is always accurate in that it simply reads whats in
the appropriate pci config space register, but devices using msi interrupts have
no use for that register, and never request that interrupt number.

This patch adds the requisite information.  It creates two per-pci-device irq
attribute files:

a) irq_mode - identifies which kind of irqs the device in question is using,
intx/msi/msix

b) msi_table - populated only if msi(x) is enabled, it lists the irqs allocated
to the pci device

Using this information I can implement a stateless irq one-shot balancer that
reacts to various udev events quite well

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Jesse Barnes <jbarnes@virtuousgeek.org>
CC: linux-pci@vger.kernel.org
---
 drivers/pci/pci-sysfs.c |   33 +++++++++++++++++++++++++++++++++
 1 files changed, 33 insertions(+), 0 deletions(-)

Comments

Matthew Wilcox April 21, 2011, 7:10 p.m. UTC | #1
On Thu, Apr 21, 2011 at 02:57:47PM -0400, Neil Horman wrote:
> b) msi_table - populated only if msi(x) is enabled, it lists the irqs allocated
> to the pci device

You're not the first one to try this ... the problem is, you can easily
overflow a single 4k page.  A device can have up to 2k MSI-X entries,
and we might take up to 5 bytes for each one, so we'd need a 10k buffer.

> +#ifdef CONFIG_PCI_MSI
> +static ssize_t msi_list_show(struct device *dev,
> +			struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct msi_desc *entry;
> +	int first, last;
> +	ssize_t count = 0;
> +
> +	if (!(pdev->msi_enabled || pdev->msix_enabled))
> +		return 0;
> +
> +	list_for_each_entry(entry, &pdev->msi_list, list)
> +		count += sprintf(&buf[count], "%d ", entry->irq);
> +
> +	return count;
> +}
> +#endif

The fundamental problem is that the way Linux uses MSI-X is completely
bollocks.  I've got a few hours to myself on a plane coming up in six
weeks, and I hope to rewrite it then (I've already written my talk, so
what else am I going to do?  :-)
Neil Horman April 21, 2011, 8 p.m. UTC | #2
On Thu, Apr 21, 2011 at 01:10:41PM -0600, Matthew Wilcox wrote:
> On Thu, Apr 21, 2011 at 02:57:47PM -0400, Neil Horman wrote:
> > b) msi_table - populated only if msi(x) is enabled, it lists the irqs allocated
> > to the pci device
> 
> You're not the first one to try this ... the problem is, you can easily
> overflow a single 4k page.  A device can have up to 2k MSI-X entries,
> and we might take up to 5 bytes for each one, so we'd need a 10k buffer.
> 
Yeah, I was a bit worried about that - but I didn't think any sane device would
allocate en entire 2048 irqs (not that thats an excuse).  I had considered doing
an export format like local_cpulist where consecutive irq allocations are listed
as first-last, which would save space and keep us inside a page.  Thoughts?

> > +#ifdef CONFIG_PCI_MSI
> > +static ssize_t msi_list_show(struct device *dev,
> > +			struct device_attribute *attr, char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	struct msi_desc *entry;
> > +	int first, last;
> > +	ssize_t count = 0;
> > +
> > +	if (!(pdev->msi_enabled || pdev->msix_enabled))
> > +		return 0;
> > +
> > +	list_for_each_entry(entry, &pdev->msi_list, list)
> > +		count += sprintf(&buf[count], "%d ", entry->irq);
> > +
> > +	return count;
> > +}
> > +#endif
> 
> The fundamental problem is that the way Linux uses MSI-X is completely
> bollocks.  I've got a few hours to myself on a plane coming up in six
> weeks, and I hope to rewrite it then (I've already written my talk, so
> what else am I going to do?  :-)
> 
Fair enough, I assume when you say 'it' you are referring to (or including) the
way the sysfs attributes are exported, so that we can use seq_file more
flexibly, and break the page limitation?

Regards
Neil

> -- 
> Matthew Wilcox				Intel Open Source Technology Centre
> "Bill, look, we understand that you're interested in selling us this
> operating system, but compare it to ours.  We can't possibly take such
> a retrograde step."
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH April 21, 2011, 8:26 p.m. UTC | #3
On Thu, Apr 21, 2011 at 02:57:47PM -0400, Neil Horman wrote:
> I've been working on some improvements to how we balance irqs for high volume
> interrupt sources.  The consensus so far has been that what would be really
> helpful is a irqbalance mechanism that operates in a one shot mode in response
> to the addition of high volume interrupt sources (i.e. network devices mainly).
> In attempting to implement this, I've found that it would be really useful to
> have 2 bits of information:
> 
> 1) A clear correlation of which interrupts belong to which devices.  Parsing
> /proc/interrupts is an exercize in guessing what naming pattern a given driver
> follows, and requires some amount of stateful information to be kept in user
> space, lest every device addition require a rebalancing of every interrupt in
> the system.
> 
> 2) A indicator as to which kind of interrupts a given device is using.  The irq
> attribute for a pci device is always accurate in that it simply reads whats in
> the appropriate pci config space register, but devices using msi interrupts have
> no use for that register, and never request that interrupt number.
> 
> This patch adds the requisite information.  It creates two per-pci-device irq
> attribute files:
> 
> a) irq_mode - identifies which kind of irqs the device in question is using,
> intx/msi/msix
> 
> b) msi_table - populated only if msi(x) is enabled, it lists the irqs allocated
> to the pci device
> 
> Using this information I can implement a stateless irq one-shot balancer that
> reacts to various udev events quite well
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> CC: Jesse Barnes <jbarnes@virtuousgeek.org>
> CC: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pci-sysfs.c |   33 +++++++++++++++++++++++++++++++++

The reason why this will not work has already been described by Matthew,
but I want to point out that if you ever add/delete/change a sysfs file
in the kernel, you also need to add/delete/change a Documentation/ABI/
file as well.

Writing that file would have shown you how this really isn't going to
work, which is usually a good exercise in itself :)

>  1 files changed, 33 insertions(+), 0 deletions(-)
> 
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index f8deb3e..1397dfb 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -26,6 +26,7 @@
>  #include <linux/security.h>
>  #include <linux/pci-aspm.h>
>  #include <linux/slab.h>
> +#include <linux/msi.h>
>  #include "pci.h"
>  
>  static int sysfs_initialized;	/* = 0 */
> @@ -71,6 +72,34 @@ static ssize_t broken_parity_status_store(struct device *dev,
>  	return count;
>  }
>  
> +static ssize_t irq_mode_show(struct device *dev,
> +			struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sprintf(buf, "%s\n", pdev->msix_enabled ? "msix" :
> +				   (pdev->msi_enabled ? "msi" : "intx"));
> +}
> +
> +#ifdef CONFIG_PCI_MSI
> +static ssize_t msi_list_show(struct device *dev,
> +			struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct msi_desc *entry;
> +	int first, last;
> +	ssize_t count = 0;
> +
> +	if (!(pdev->msi_enabled || pdev->msix_enabled))
> +		return 0;
> +
> +	list_for_each_entry(entry, &pdev->msi_list, list)
> +		count += sprintf(&buf[count], "%d ", entry->irq);

This breaks the "one-item-per-file" sysfs rule, so please never do this.


> +
> +	return count;
> +}
> +#endif
> +
>  static ssize_t local_cpus_show(struct device *dev,
>  			struct device_attribute *attr, char *buf)
>  {		
> @@ -328,6 +357,10 @@ struct device_attribute pci_dev_attrs[] = {
>  	__ATTR_RO(subsystem_device),
>  	__ATTR_RO(class),
>  	__ATTR_RO(irq),
> +	__ATTR_RO(irq_mode),
> +#ifdef CONFIG_PCI_MSI
> +	__ATTR_RO(msi_list),
> +#endif

Curious as to why you added this to the middle of the list?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg KH April 21, 2011, 8:34 p.m. UTC | #4
On Thu, Apr 21, 2011 at 04:00:39PM -0400, Neil Horman wrote:
> On Thu, Apr 21, 2011 at 01:10:41PM -0600, Matthew Wilcox wrote:
> > On Thu, Apr 21, 2011 at 02:57:47PM -0400, Neil Horman wrote:
> > > b) msi_table - populated only if msi(x) is enabled, it lists the irqs allocated
> > > to the pci device
> > 
> > You're not the first one to try this ... the problem is, you can easily
> > overflow a single 4k page.  A device can have up to 2k MSI-X entries,
> > and we might take up to 5 bytes for each one, so we'd need a 10k buffer.
> > 
> Yeah, I was a bit worried about that - but I didn't think any sane device would
> allocate en entire 2048 irqs (not that thats an excuse).  I had considered doing
> an export format like local_cpulist where consecutive irq allocations are listed
> as first-last, which would save space and keep us inside a page.  Thoughts?

Nope, again, sysfs is "one value per file" please.  So don't do it this
way.

If you look in the archives, I've posted how this could be done, but it
gets really messy quickly, as people have found out.  You should look
there first at those attempts before you reinvent them here (as you are
doing...)

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index f8deb3e..1397dfb 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -26,6 +26,7 @@ 
 #include <linux/security.h>
 #include <linux/pci-aspm.h>
 #include <linux/slab.h>
+#include <linux/msi.h>
 #include "pci.h"
 
 static int sysfs_initialized;	/* = 0 */
@@ -71,6 +72,34 @@  static ssize_t broken_parity_status_store(struct device *dev,
 	return count;
 }
 
+static ssize_t irq_mode_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%s\n", pdev->msix_enabled ? "msix" :
+				   (pdev->msi_enabled ? "msi" : "intx"));
+}
+
+#ifdef CONFIG_PCI_MSI
+static ssize_t msi_list_show(struct device *dev,
+			struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct msi_desc *entry;
+	int first, last;
+	ssize_t count = 0;
+
+	if (!(pdev->msi_enabled || pdev->msix_enabled))
+		return 0;
+
+	list_for_each_entry(entry, &pdev->msi_list, list)
+		count += sprintf(&buf[count], "%d ", entry->irq);
+
+	return count;
+}
+#endif
+
 static ssize_t local_cpus_show(struct device *dev,
 			struct device_attribute *attr, char *buf)
 {		
@@ -328,6 +357,10 @@  struct device_attribute pci_dev_attrs[] = {
 	__ATTR_RO(subsystem_device),
 	__ATTR_RO(class),
 	__ATTR_RO(irq),
+	__ATTR_RO(irq_mode),
+#ifdef CONFIG_PCI_MSI
+	__ATTR_RO(msi_list),
+#endif
 	__ATTR_RO(local_cpus),
 	__ATTR_RO(local_cpulist),
 	__ATTR_RO(modalias),