[RFC,v0.2] PCI: Add support for tango PCIe host bridge

Message ID	91db1f47-3024-9712-309a-fb4b21e42028@free.fr (mailing list archive)
State	New, archived
Delegated to:	Bjorn Helgaas
Headers	show Return-Path: <linux-pci-owner@kernel.org> sender: slash.tmp) by smtp5-g21.free.fr (Postfix) with ESMTPSA id 56E305FF9A; Thu, 23 Mar 2017 14:05:51 +0100 (CET) From: Mason <slash.tmp@free.fr> Subject: [RFC PATCH v0.2] PCI: Add support for tango PCIe host bridge To: Bjorn Helgaas <helgaas@kernel.org>, Marc Zyngier <marc.zyngier@arm.com>, Thomas Gleixner <tglx@linutronix.de> Cc: Robin Murphy <robin.murphy@arm.com>, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>, Liviu Dudau <liviu.dudau@arm.com>, David Laight <david.laight@aculab.com>, linux-pci <linux-pci@vger.kernel.org>, Linux ARM <linux-arm-kernel@lists.infradead.org>, Thibaud Cornic <thibaud_cornic@sigmadesigns.com>, Phuong Nguyen <phuong_nguyen@sigmadesigns.com>, LKML <linux-kernel@vger.kernel.org> Message-ID: <91db1f47-3024-9712-309a-fb4b21e42028@free.fr> Date: Thu, 23 Mar 2017 14:05:51 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:51.0) Gecko/20100101 Firefox/51.0 SeaMonkey/2.48 MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Sender: linux-pci-owner@vger.kernel.org Precedence: bulk

Mason March 23, 2017, 1:05 p.m. UTC

I think this version is ready for review.
It has all the required bits and pieces.
I still have a few questions, embedded as comments in the code.
(Missing are ancillary changes to Kconfig, Makefile)
---
 drivers/pci/host/pcie-tango.c | 350 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 350 insertions(+)
 create mode 100644 drivers/pci/host/pcie-tango.c

Marc Zyngier March 23, 2017, 2:22 p.m. UTC | #1

On 23/03/17 13:05, Mason wrote:
> I think this version is ready for review.
> It has all the required bits and pieces.
> I still have a few questions, embedded as comments in the code.
> (Missing are ancillary changes to Kconfig, Makefile)

May I suggest that if you think that a patch is ready for review, it
should really contain all the bits that make it an actual patch? That
would include an actual commit log and all what is required to actually
compile it. Not to mention a SoB.

We rely (at least I certainly do) on things like the kbuild robot
picking up stuff from the list and giving it a go. Also, it makes it a
much more efficient use of the reviewer time not to review the same
thing twice...

That being said:

> ---
>  drivers/pci/host/pcie-tango.c | 350 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 350 insertions(+)
>  create mode 100644 drivers/pci/host/pcie-tango.c
> 
> diff --git a/drivers/pci/host/pcie-tango.c b/drivers/pci/host/pcie-tango.c
> new file mode 100644
> index 000000000000..b2e6448aed2d
> --- /dev/null
> +++ b/drivers/pci/host/pcie-tango.c
> @@ -0,0 +1,350 @@
> +#include <linux/irqchip/chained_irq.h>
> +#include <linux/irqdomain.h>
> +#include <linux/pci-ecam.h>
> +#include <linux/msi.h>
> +
> +#define MSI_COUNT 32

Is this something that is hardcoded? Unlikely to ever change?

> +
> +struct tango_pcie {
> +	void __iomem *mux;
> +	void __iomem *msi_status;
> +	void __iomem *msi_mask;
> +	phys_addr_t msi_doorbell;
> +	struct mutex lock; /* lock for updating msi_mask */
> +	struct irq_domain *irq_domain;
> +	struct irq_domain *msi_domain;
> +	int irq;
> +};
> +
> +/*** MSI CONTROLLER SUPPORT ***/
> +
> +static void tango_msi_isr(struct irq_desc *desc)
> +{
> +	struct irq_chip *chip = irq_desc_get_chip(desc);
> +	struct tango_pcie *pcie;
> +	unsigned long status, virq;
> +	int pos;
> +
> +	chained_irq_enter(chip, desc);
> +	pcie = irq_desc_get_handler_data(desc);
> +
> +	status = readl_relaxed(pcie->msi_status);

Please use types that unambiguously match that of the MMIO accessor (u32
in this case). On a 64bit system, unsigned long is likely to be 64bit.
You can assign it to an unsigned long before calling the
for_each_set_bit operator.

> +	writel_relaxed(status, pcie->msi_status); /* clear IRQs */

Why isn't this your irq_ack method instead of open-coding it?

> +
> +	for_each_set_bit(pos, &status, MSI_COUNT) {
> +		virq = irq_find_mapping(pcie->irq_domain, pos);
> +		if (virq)
> +			generic_handle_irq(virq);
> +		else
> +			pr_err("Unhandled MSI: %d\n", pos);

Please rate-limit this.

> +	}
> +
> +	chained_irq_exit(chip, desc);
> +}
> +
> +static struct irq_chip tango_msi_irq_chip = {
> +	.name = "MSI",
> +	.irq_mask = pci_msi_mask_irq,
> +	.irq_unmask = pci_msi_unmask_irq,
> +};
> +
> +static struct msi_domain_info msi_domain_info = {
> +	.flags	= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,

No support for MSI-X? Why?

> +	.chip	= &tango_msi_irq_chip,
> +};
> +
> +static void tango_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
> +{
> +	struct tango_pcie *pcie = irq_data_get_irq_chip_data(data);
> +
> +	msg->address_lo = lower_32_bits(pcie->msi_doorbell);
> +	msg->address_hi = upper_32_bits(pcie->msi_doorbell);
> +	msg->data = data->hwirq;
> +}
> +
> +static int tango_set_affinity(struct irq_data *irq_data,
> +		const struct cpumask *mask, bool force)
> +{
> +	 return -EINVAL;
> +}
> +
> +static struct irq_chip tango_msi_chip = {
> +	.name			= "MSI",
> +	.irq_compose_msi_msg	= tango_compose_msi_msg,
> +	.irq_set_affinity	= tango_set_affinity,
> +};
> +
> +static int tango_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
> +		unsigned int nr_irqs, void *args)
> +{
> +	struct tango_pcie *pcie = domain->host_data;
> +	int pos, err = 0;
> +	u32 mask;
> +
> +	if (nr_irqs != 1) /* When does that happen? */
> +		return -EINVAL;

Only if the end-point wants to use Multi-MSI. You don't advertise
support for it, so it should never happen.

> +
> +	mutex_lock(&pcie->lock);
> +
> +	mask = readl_relaxed(pcie->msi_mask);

Do you really need to read this from the HW each time you allocate an
interrupt? That feels pretty crazy. You're much better off having an
in-memory bitmap that will make things more efficient, and avoid the
following bug...

> +	pos = find_first_zero_bit(&mask, MSI_COUNT);

... where using a u32 as a bitmap is a very bad idea (because not the
whole world is a 32bit, little endian platform).

> +	if (pos < MSI_COUNT)
> +		writel(mask | BIT(pos), pcie->msi_mask);

And it would make a lot more sense to move this write (which should be
relaxed) to irq_unmask. Also, calling msi_mask for something that is an
enable register is a bit counter intuitive.

> +	else
> +		err = -ENOSPC;
> +
> +	mutex_unlock(&pcie->lock);
> +
> +	irq_domain_set_info(domain, virq, pos, &tango_msi_chip,
> +			domain->host_data, handle_simple_irq, NULL, NULL);

And here, you're polluting the domain even if you failed to allocate the
interrupt.

> +
> +	return err;
> +}
> +
> +static void tango_irq_domain_free(struct irq_domain *domain,
> +				   unsigned int virq, unsigned int nr_irqs)
> +{
> +	struct irq_data *d = irq_domain_get_irq_data(domain, virq);
> +	struct tango_pcie *pcie = irq_data_get_irq_chip_data(d);
> +	int pos = d->hwirq;
> +	u32 mask;
> +
> +	mutex_lock(&pcie->lock);
> +
> +	mask = readl(pcie->msi_mask);
> +	writel(mask & ~BIT(pos), pcie->msi_mask);

Same as above, please move this to the irq_unmask method.

> +
> +	mutex_unlock(&pcie->lock);
> +}
> +
> +static const struct irq_domain_ops msi_domain_ops = {
> +	.alloc	= tango_irq_domain_alloc,
> +	.free	= tango_irq_domain_free,
> +};
> +
> +static int tango_msi_remove(struct platform_device *pdev)
> +{
> +	struct tango_pcie *msi = platform_get_drvdata(pdev);
> +
> +	irq_set_chained_handler(msi->irq, NULL);
> +	irq_set_handler_data(msi->irq, NULL);
> +	/* irq_set_chained_handler_and_data(msi->irq, NULL, NULL); instead? */
> +
> +	irq_domain_remove(msi->msi_domain);
> +	irq_domain_remove(msi->irq_domain);
> +
> +	return 0;
> +}
> +
> +static int tango_msi_probe(struct platform_device *pdev, struct tango_pcie *pcie)
> +{
> +	int virq;
> +	struct fwnode_handle *fwnode = of_node_to_fwnode(pdev->dev.of_node);
> +	struct irq_domain *msi_dom, *irq_dom;
> +
> +	mutex_init(&pcie->lock);
> +	writel(0, pcie->msi_mask);
> +
> +	/* Why is fwnode for this call? */
> +	irq_dom = irq_domain_add_linear(NULL, MSI_COUNT, &msi_domain_ops, pcie);

Use irq_domain_create_linear, pass the same fwnode.

> +	if (!irq_dom) {
> +		pr_err("Failed to create IRQ domain\n");
> +		return -ENOMEM;
> +	}
> +
> +	msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);
> +	if (!msi_dom) {
> +		pr_err("Failed to create MSI domain\n");
> +		irq_domain_remove(irq_dom);
> +		return -ENOMEM;
> +	}
> +
> +	virq = platform_get_irq(pdev, 1);

In the absence of a documented binding, it is hard to know if you're
doing the right thing.

> +	if (virq <= 0) {
> +		irq_domain_remove(msi_dom);
> +		irq_domain_remove(irq_dom);
> +		return -ENXIO;

Maybe add a message indicating what failed?

> +	}
> +
> +	pcie->irq_domain = irq_dom;
> +	pcie->msi_domain = msi_dom;
> +	pcie->irq = virq;
> +	irq_set_chained_handler_and_data(virq, tango_msi_isr, pcie);
> +
> +	return 0;
> +}
> +
> +/*** HOST BRIDGE SUPPORT ***/

[...]

I don't know much about PCIe itself, hence stopping here.

I'd like to see the MSI code as a separate patch, because it is pretty
much standalone. And please write a DT binding document for the whole
thing, because I end-up second guessing what you're trying to do...

Thanks,

	M.

Mason March 23, 2017, 5:03 p.m. UTC | #2

On 23/03/2017 15:22, Marc Zyngier wrote:

> On 23/03/17 13:05, Mason wrote:
> 
>> +#define MSI_COUNT 32
> 
> Is this something that is hardcoded? Unlikely to ever change?

The host bridge actually supports 256 MSIs.

IIUC, what you suggested on IRC is that I support 256 in the driver,
and only read the status for *enabled* MSIs.

Pseudo-code:

for every 32-bit blob in the enabled bitmap
  if the value is non-zero
    lookup the corresponding status reg

Problem is that a BITMAP is unsigned long (as you point out below).
So I'm not sure how to iterate 32-bits at a time over the BITMAP.




>> +static void tango_msi_isr(struct irq_desc *desc)
>> +{
>> +	struct irq_chip *chip = irq_desc_get_chip(desc);
>> +	struct tango_pcie *pcie;
>> +	unsigned long status, virq;
>> +	int pos;
>> +
>> +	chained_irq_enter(chip, desc);
>> +	pcie = irq_desc_get_handler_data(desc);
>> +
>> +	status = readl_relaxed(pcie->msi_status);
> 
> Please use types that unambiguously match that of the MMIO accessor (u32
> in this case). On a 64bit system, unsigned long is likely to be 64bit.
> You can assign it to an unsigned long before calling the
> for_each_set_bit operator.

OK. I'm aware that unsigned long is 64 bits on sane 64b platforms,
but since extending u32 to u64 would pad with zeros, I didn't expect
this to be an issue. I will change the code. Note: I copied the
code from the Altera driver.

>> +	writel_relaxed(status, pcie->msi_status); /* clear IRQs */
> 
> Why isn't this your irq_ack method instead of open-coding it?

I based my driver on the Altera driver, and I did it like
I thought they did. I will try fixing my code.


>> +	for_each_set_bit(pos, &status, MSI_COUNT) {
>> +		virq = irq_find_mapping(pcie->irq_domain, pos);
>> +		if (virq)
>> +			generic_handle_irq(virq);
>> +		else
>> +			pr_err("Unhandled MSI: %d\n", pos);
> 
> Please rate-limit this.

I'll use pr_err_ratelimited


>> +static struct msi_domain_info msi_domain_info = {
>> +	.flags	= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
> 
> No support for MSI-X? Why?

Good question.
https://en.wikipedia.org/wiki/Message_Signaled_Interrupts#MSI-X
My controller supports a single doorbell, and only 256 MSIs.
I thought that meant it didn't support MSI-X.


>> +static int tango_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
>> +		unsigned int nr_irqs, void *args)
>> +{
>> +	struct tango_pcie *pcie = domain->host_data;
>> +	int pos, err = 0;
>> +	u32 mask;
>> +
>> +	if (nr_irqs != 1) /* When does that happen? */
>> +		return -EINVAL;
> 
> Only if the end-point wants to use Multi-MSI. You don't advertise
> support for it, so it should never happen.

Should I keep the test or remove it?


>> +	mutex_lock(&pcie->lock);
>> +
>> +	mask = readl_relaxed(pcie->msi_mask);
> 
> Do you really need to read this from the HW each time you allocate an
> interrupt? That feels pretty crazy. You're much better off having an
> in-memory bitmap that will make things more efficient, and avoid the
> following bug...
> 
>> +	pos = find_first_zero_bit(&mask, MSI_COUNT);
> 
> ... where using a u32 as a bitmap is a very bad idea (because not the
> whole world is a 32bit, little endian platform).

I understand your point. This ties in to the ISR discussion.


>> +	if (pos < MSI_COUNT)
>> +		writel(mask | BIT(pos), pcie->msi_mask);
> 
> And it would make a lot more sense to move this write (which should be
> relaxed) to irq_unmask. Also, calling msi_mask for something that is an
> enable register is a bit counter intuitive.

I don't have as much experience as you.
I just used the names in the HW documentation.
I think it is the "mask" (as in bitmap) of enabled MSIs.
I will change "mask" to "enable".

Are you saying I should not use pci_msi_mask_irq and pci_msi_unmask_irq,
but register custom implementations? I should still call these in my
custom functions, right?


>> +	else
>> +		err = -ENOSPC;
>> +
>> +	mutex_unlock(&pcie->lock);
>> +
>> +	irq_domain_set_info(domain, virq, pos, &tango_msi_chip,
>> +			domain->host_data, handle_simple_irq, NULL, NULL);
> 
> And here, you're polluting the domain even if you failed to allocate the
> interrupt.

This bug is 100% mine. Will fix.

>> +
>> +	return err;
>> +}
>> +
>> +static void tango_irq_domain_free(struct irq_domain *domain,
>> +				   unsigned int virq, unsigned int nr_irqs)
>> +{
>> +	struct irq_data *d = irq_domain_get_irq_data(domain, virq);
>> +	struct tango_pcie *pcie = irq_data_get_irq_chip_data(d);
>> +	int pos = d->hwirq;
>> +	u32 mask;
>> +
>> +	mutex_lock(&pcie->lock);
>> +
>> +	mask = readl(pcie->msi_mask);
>> +	writel(mask & ~BIT(pos), pcie->msi_mask);
> 
> Same as above, please move this to the irq_unmask method.

This one should be irq_mask, no?

Even If I move the MMIO write, it should be done under lock,
I think. But I don't know in what context irq_unmask will
be called.
You said: not mutex, spinlock.


>> +static int tango_msi_remove(struct platform_device *pdev)
>> +{
>> +	struct tango_pcie *msi = platform_get_drvdata(pdev);
>> +
>> +	irq_set_chained_handler(msi->irq, NULL);
>> +	irq_set_handler_data(msi->irq, NULL);
>> +	/* irq_set_chained_handler_and_data(msi->irq, NULL, NULL); instead? */

Can I call irq_set_chained_handler_and_data(msi->irq, NULL, NULL);
instead of the two calls?


>> +	mutex_init(&pcie->lock);
>> +	writel(0, pcie->msi_mask);
>> +
>> +	/* Why is fwnode for this call? */
>> +	irq_dom = irq_domain_add_linear(NULL, MSI_COUNT, &msi_domain_ops, pcie);
> 
> Use irq_domain_create_linear, pass the same fwnode.

Will change that.


>> +	if (!irq_dom) {
>> +		pr_err("Failed to create IRQ domain\n");
>> +		return -ENOMEM;
>> +	}
>> +
>> +	msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);
>> +	if (!msi_dom) {
>> +		pr_err("Failed to create MSI domain\n");
>> +		irq_domain_remove(irq_dom);
>> +		return -ENOMEM;
>> +	}
>> +
>> +	virq = platform_get_irq(pdev, 1);
> 
> In the absence of a documented binding, it is hard to know if you're
> doing the right thing.

		pcie@50000000 {
			compatible = "sigma,smp8759-pcie";
			reg = <0x50000000 SZ_64M>, <0x2e000 0x100>;
			device_type = "pci";
			bus-range = <0 63>;
			#size-cells = <2>;
			#address-cells = <3>;
			#interrupt-cells = <1>;
			ranges = <0x02000000 0x0 0x04000000  0x54000000  0x0 SZ_192M>;
			msi-controller;
			/* 54 for misc interrupts, 55 for MSI */
			interrupts = <54 IRQ_TYPE_LEVEL_HIGH>, <55 IRQ_TYPE_LEVEL_HIGH>;
		};

Note: I don't have an "interrupt-map" prop because rev1 doesn't support
legacy PCI interrupts (INTx). But I see the PCI framework wrongly mapping
intA to my system's interrupt #1, presumably because I am lacking an
interrupt-map?

Also I find the MSI interrupt number to be high:

# cat /proc/interrupts 
           CPU0       CPU1       
 19:      21171       1074     GIC-0  29 Edge      twd
 20:        116          0      irq0   1 Level     serial
 26:          7          0       MSI   0 Edge      aerdrv
 28:       3263          0       MSI 524288 Edge      xhci_hcd

524288 is 0x80000. Was this offset chosen by the intc core?
Or by my (lack of) DT?

>> +	if (virq <= 0) {
>> +		irq_domain_remove(msi_dom);
>> +		irq_domain_remove(irq_dom);
>> +		return -ENXIO;
> 
> Maybe add a message indicating what failed?

Will do.

Thanks again for the thorough review.

Regards.

Mason March 23, 2017, 11:40 p.m. UTC | #3

On 23/03/2017 18:03, Mason wrote:

> The host bridge actually supports 256 MSIs.
> 
> IIUC, what you suggested on IRC is that I support 256 in the driver,
> and only read the status for *enabled* MSIs.
> 
> Pseudo-code:
> 
> for every 32-bit blob in the enabled bitmap
>   if the value is non-zero
>     lookup the corresponding status reg
> 
> Problem is that a BITMAP is unsigned long (as you point out below).
> So I'm not sure how to iterate 32-bits at a time over the BITMAP.

Something along these lines:

	DECLARE_BITMAP(enabled, 256);

	unsigned int pos = 0;

	while ((pos = find_next_bit(enabled, 256, pos)) < 256) {
		int offset = (pos / 32) * 4;
		u32 status = readl_relaxed(status + offset);
		/* Handle each pos set in status */
		pos = round_up(pos, 32);
	}

You mentioned a bug in my code (due to the platform endianness)
when passing the result of readl_relaxed to the bitops routine...
How is one supposed to iterate over status?

I'm not yet seeing this endianness issue, since (status & BIT(i))
provides the status of MSI_i, irrespective of endianness.

Although I see that arch/arm/include/asm/bitops.h declares
BE and LE variants... I'm confused.

Regards.

Marc Zyngier March 24, 2017, 6:22 p.m. UTC | #4

On 23/03/17 23:40, Mason wrote:
> On 23/03/2017 18:03, Mason wrote:
> 
>> The host bridge actually supports 256 MSIs.
>>
>> IIUC, what you suggested on IRC is that I support 256 in the driver,
>> and only read the status for *enabled* MSIs.
>>
>> Pseudo-code:
>>
>> for every 32-bit blob in the enabled bitmap
>>   if the value is non-zero
>>     lookup the corresponding status reg
>>
>> Problem is that a BITMAP is unsigned long (as you point out below).
>> So I'm not sure how to iterate 32-bits at a time over the BITMAP.
> 
> Something along these lines:
> 
> 	DECLARE_BITMAP(enabled, 256);
> 
> 	unsigned int pos = 0;
> 
> 	while ((pos = find_next_bit(enabled, 256, pos)) < 256) {
> 		int offset = (pos / 32) * 4;
> 		u32 status = readl_relaxed(status + offset);
> 		/* Handle each pos set in status */
> 		pos = round_up(pos, 32);
> 	}

Something along those lines, yes.

> You mentioned a bug in my code (due to the platform endianness)
> when passing the result of readl_relaxed to the bitops routine...
> How is one supposed to iterate over status?

You cannot directly use a pointer to a u32 in any of the bitmap
operations. You need to copy the value to an unsigned long, and apply
the bitmap op on that.

> I'm not yet seeing this endianness issue, since (status & BIT(i))
> provides the status of MSI_i, irrespective of endianness.

We discussed this over IRC, but I was referring to the above case and
64bit BE platforms, which would not do what you expect.

> Although I see that arch/arm/include/asm/bitops.h declares
> BE and LE variants... I'm confused.

Nothing in this code should be ARM specific. The kernel gives you the
tools to write (mostly) architecture, endianness and word size agnostic
code.

Thanks,

	M.

Marc Zyngier March 24, 2017, 6:47 p.m. UTC | #5

On 23/03/17 17:03, Mason wrote:
> On 23/03/2017 15:22, Marc Zyngier wrote:
> 
>> On 23/03/17 13:05, Mason wrote:
>>
>>> +#define MSI_COUNT 32
>>
>> Is this something that is hardcoded? Unlikely to ever change?
> 
> The host bridge actually supports 256 MSIs.
> 
> IIUC, what you suggested on IRC is that I support 256 in the driver,
> and only read the status for *enabled* MSIs.
> 
> Pseudo-code:
> 
> for every 32-bit blob in the enabled bitmap
>   if the value is non-zero
>     lookup the corresponding status reg
> 
> Problem is that a BITMAP is unsigned long (as you point out below).
> So I'm not sure how to iterate 32-bits at a time over the BITMAP.

See my reply in a previous email.

>>> +static void tango_msi_isr(struct irq_desc *desc)
>>> +{
>>> +	struct irq_chip *chip = irq_desc_get_chip(desc);
>>> +	struct tango_pcie *pcie;
>>> +	unsigned long status, virq;
>>> +	int pos;
>>> +
>>> +	chained_irq_enter(chip, desc);
>>> +	pcie = irq_desc_get_handler_data(desc);
>>> +
>>> +	status = readl_relaxed(pcie->msi_status);
>>
>> Please use types that unambiguously match that of the MMIO accessor (u32
>> in this case). On a 64bit system, unsigned long is likely to be 64bit.
>> You can assign it to an unsigned long before calling the
>> for_each_set_bit operator.
> 
> OK. I'm aware that unsigned long is 64 bits on sane 64b platforms,
> but since extending u32 to u64 would pad with zeros, I didn't expect
> this to be an issue. I will change the code. Note: I copied the
> code from the Altera driver.

This is an issue when your system is 64bit BE, something that is not
that uncommon.

> 
>>> +	writel_relaxed(status, pcie->msi_status); /* clear IRQs */
>>
>> Why isn't this your irq_ack method instead of open-coding it?
> 
> I based my driver on the Altera driver, and I did it like
> I thought they did. I will try fixing my code.

Doesn't make it right, unfortunately. I wish you would try to understand
the API first instead of copy-pasting things (including potential bugs).

> 
>>> +	for_each_set_bit(pos, &status, MSI_COUNT) {
>>> +		virq = irq_find_mapping(pcie->irq_domain, pos);
>>> +		if (virq)
>>> +			generic_handle_irq(virq);
>>> +		else
>>> +			pr_err("Unhandled MSI: %d\n", pos);
>>
>> Please rate-limit this.
> 
> I'll use pr_err_ratelimited
> 
> 
>>> +static struct msi_domain_info msi_domain_info = {
>>> +	.flags	= MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
>>
>> No support for MSI-X? Why?
> 
> Good question.
> https://en.wikipedia.org/wiki/Message_Signaled_Interrupts#MSI-X
> My controller supports a single doorbell, and only 256 MSIs.
> I thought that meant it didn't support MSI-X.

The "single doorbell" requirement is on the end-point, not on the
controller. Multi-MSI only has a single register for all possible
interrupts, while MSI-X allows one doorbell address per interrupt. In
your case, all interrupts will have the same doorbell address, which is
perfectly fine.

> 
> 
>>> +static int tango_irq_domain_alloc(struct irq_domain *domain, unsigned int virq,
>>> +		unsigned int nr_irqs, void *args)
>>> +{
>>> +	struct tango_pcie *pcie = domain->host_data;
>>> +	int pos, err = 0;
>>> +	u32 mask;
>>> +
>>> +	if (nr_irqs != 1) /* When does that happen? */
>>> +		return -EINVAL;
>>
>> Only if the end-point wants to use Multi-MSI. You don't advertise
>> support for it, so it should never happen.
> 
> Should I keep the test or remove it?

Up to you. Some people warn loudly, some other ignore it, you've chosen
a middle ground.

> 
> 
>>> +	mutex_lock(&pcie->lock);
>>> +
>>> +	mask = readl_relaxed(pcie->msi_mask);
>>
>> Do you really need to read this from the HW each time you allocate an
>> interrupt? That feels pretty crazy. You're much better off having an
>> in-memory bitmap that will make things more efficient, and avoid the
>> following bug...
>>
>>> +	pos = find_first_zero_bit(&mask, MSI_COUNT);
>>
>> ... where using a u32 as a bitmap is a very bad idea (because not the
>> whole world is a 32bit, little endian platform).
> 
> I understand your point. This ties in to the ISR discussion.
> 
> 
>>> +	if (pos < MSI_COUNT)
>>> +		writel(mask | BIT(pos), pcie->msi_mask);
>>
>> And it would make a lot more sense to move this write (which should be
>> relaxed) to irq_unmask. Also, calling msi_mask for something that is an
>> enable register is a bit counter intuitive.
> 
> I don't have as much experience as you.
> I just used the names in the HW documentation.
> I think it is the "mask" (as in bitmap) of enabled MSIs.
> I will change "mask" to "enable".
> 
> Are you saying I should not use pci_msi_mask_irq and pci_msi_unmask_irq,
> but register custom implementations? I should still call these in my
> custom functions, right?

You can call both in your own mask/unmask methods. They serve different
purpose (one is at the endpoint level, the other is at the MSI
controller level).

> 
> 
>>> +	else
>>> +		err = -ENOSPC;
>>> +
>>> +	mutex_unlock(&pcie->lock);
>>> +
>>> +	irq_domain_set_info(domain, virq, pos, &tango_msi_chip,
>>> +			domain->host_data, handle_simple_irq, NULL, NULL);
>>
>> And here, you're polluting the domain even if you failed to allocate the
>> interrupt.
> 
> This bug is 100% mine. Will fix.

Erm. In this file, all bugs are yours! :-)

> 
>>> +
>>> +	return err;
>>> +}
>>> +
>>> +static void tango_irq_domain_free(struct irq_domain *domain,
>>> +				   unsigned int virq, unsigned int nr_irqs)
>>> +{
>>> +	struct irq_data *d = irq_domain_get_irq_data(domain, virq);
>>> +	struct tango_pcie *pcie = irq_data_get_irq_chip_data(d);
>>> +	int pos = d->hwirq;
>>> +	u32 mask;
>>> +
>>> +	mutex_lock(&pcie->lock);
>>> +
>>> +	mask = readl(pcie->msi_mask);
>>> +	writel(mask & ~BIT(pos), pcie->msi_mask);
>>
>> Same as above, please move this to the irq_unmask method.
> 
> This one should be irq_mask, no?

Yes.

> 
> Even If I move the MMIO write, it should be done under lock,
> I think. But I don't know in what context irq_unmask will
> be called.
> You said: not mutex, spinlock.

It can be called from interrupt context, so it cannot be a mutex.

> 
> 
>>> +static int tango_msi_remove(struct platform_device *pdev)
>>> +{
>>> +	struct tango_pcie *msi = platform_get_drvdata(pdev);
>>> +
>>> +	irq_set_chained_handler(msi->irq, NULL);
>>> +	irq_set_handler_data(msi->irq, NULL);
>>> +	/* irq_set_chained_handler_and_data(msi->irq, NULL, NULL); instead? */
> 
> Can I call irq_set_chained_handler_and_data(msi->irq, NULL, NULL);
> instead of the two calls?

Probably. Reading the code should tell you.

> 
>>> +	mutex_init(&pcie->lock);
>>> +	writel(0, pcie->msi_mask);
>>> +
>>> +	/* Why is fwnode for this call? */
>>> +	irq_dom = irq_domain_add_linear(NULL, MSI_COUNT, &msi_domain_ops, pcie);
>>
>> Use irq_domain_create_linear, pass the same fwnode.
> 
> Will change that.
> 
> 
>>> +	if (!irq_dom) {
>>> +		pr_err("Failed to create IRQ domain\n");
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);
>>> +	if (!msi_dom) {
>>> +		pr_err("Failed to create MSI domain\n");
>>> +		irq_domain_remove(irq_dom);
>>> +		return -ENOMEM;
>>> +	}
>>> +
>>> +	virq = platform_get_irq(pdev, 1);
>>
>> In the absence of a documented binding, it is hard to know if you're
>> doing the right thing.
> 
> 		pcie@50000000 {
> 			compatible = "sigma,smp8759-pcie";
> 			reg = <0x50000000 SZ_64M>, <0x2e000 0x100>;
> 			device_type = "pci";
> 			bus-range = <0 63>;
> 			#size-cells = <2>;
> 			#address-cells = <3>;
> 			#interrupt-cells = <1>;
> 			ranges = <0x02000000 0x0 0x04000000  0x54000000  0x0 SZ_192M>;
> 			msi-controller;
> 			/* 54 for misc interrupts, 55 for MSI */
> 			interrupts = <54 IRQ_TYPE_LEVEL_HIGH>, <55 IRQ_TYPE_LEVEL_HIGH>;
> 		};

This is not a binding. This is an example from your DT. Look at
Documentation/devicetree/bindings/pci/ for examples of the required
documentation.

> 
> Note: I don't have an "interrupt-map" prop because rev1 doesn't support
> legacy PCI interrupts (INTx). But I see the PCI framework wrongly mapping
> intA to my system's interrupt #1, presumably because I am lacking an
> interrupt-map?

Probably. I don't think it is legal not to have an interrupt-map.

> 
> Also I find the MSI interrupt number to be high:
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>  19:      21171       1074     GIC-0  29 Edge      twd
>  20:        116          0      irq0   1 Level     serial
>  26:          7          0       MSI   0 Edge      aerdrv
>  28:       3263          0       MSI 524288 Edge      xhci_hcd
> 
> 524288 is 0x80000. Was this offset chosen by the intc core?

By the PCI/MSI layer. See pci_msi_domain_calc_hwirq().

Thanks,

	M.

Mason March 27, 2017, 2:35 p.m. UTC | #6

On 24/03/2017 19:22, Marc Zyngier wrote:

> You cannot directly use a pointer to a u32 in any of the bitmap
> operations. You need to copy the value to an unsigned long, and
> apply the bitmap op on that.

On my platform, find_first_zero_bit() resolves to

  int _find_first_zero_bit_le(const void * p, unsigned size);

If the underlying implementation actually expects an unsigned long
pointer, should the function prototype be changed?

Regards.

Thomas Gleixner March 27, 2017, 2:46 p.m. UTC | #7

On Mon, 27 Mar 2017, Mason wrote:
> On 24/03/2017 19:22, Marc Zyngier wrote:
> 
> > You cannot directly use a pointer to a u32 in any of the bitmap
> > operations. You need to copy the value to an unsigned long, and
> > apply the bitmap op on that.
> 
> On my platform, find_first_zero_bit() resolves to
> 
>   int _find_first_zero_bit_le(const void * p, unsigned size);
> 
> If the underlying implementation actually expects an unsigned long
> pointer, should the function prototype be changed?

Errm? Why are you worrying about the underlying implementations?

find_first_zero_bit() is what you are supposed to use in your code. And
that explicitely takes a unsigned long pointer.

Thanks,

	tglx

Mason March 27, 2017, 3:18 p.m. UTC | #8

On 27/03/2017 16:46, Thomas Gleixner wrote:

> On Mon, 27 Mar 2017, Mason wrote:
>
>> On 24/03/2017 19:22, Marc Zyngier wrote:
>>
>>> You cannot directly use a pointer to a u32 in any of the bitmap
>>> operations. You need to copy the value to an unsigned long, and
>>> apply the bitmap op on that.
>>
>> On my platform, find_first_zero_bit() resolves to
>>
>>   int _find_first_zero_bit_le(const void * p, unsigned size);
>>
>> If the underlying implementation actually expects an unsigned long
>> pointer, should the function prototype be changed?
> 
> Errm? Why are you worrying about the underlying implementations?
> 
> find_first_zero_bit() is what you are supposed to use in your code. And
> that explicitely takes a unsigned long pointer.

I don't think so.

If the prototype for find_first_zero_bit() specified the first
argument as an unsigned long pointer, then the compiler would
have rejected my code like this:

  CC      drivers/pci/host/pcie-tango.o
In file included from ./include/linux/bitops.h:36:0,
                 from ./include/linux/kernel.h:10,
                 from ./include/linux/list.h:8,
                 from ./include/linux/smp.h:11,
                 from ./include/linux/irq.h:12,
                 from ./include/linux/irqchip/chained_irq.h:21,
                 from drivers/pci/host/pcie-tango.c:1:
drivers/pci/host/pcie-tango.c: In function 'tango_irq_domain_alloc':
drivers/pci/host/pcie-tango.c:122:28: error: passing argument 1 of '_find_first_zero_bit_le' from incompatible pointer type [-Werror=incompatible-pointer-types]
  pos = find_first_zero_bit(&mask, 32);
                            ^
./arch/arm/include/asm/bitops.h:199:59: note: in definition of macro 'find_first_zero_bit'
 #define find_first_zero_bit(p,sz) _find_first_zero_bit_le(p,sz)
                                                           ^
./arch/arm/include/asm/bitops.h:162:12: note: expected 'const long unsigned int *' but argument is of type 'u32 * {aka unsigned int *}'
 extern int _find_first_zero_bit_le(const unsigned long * p, unsigned size);
            ^
cc1: some warnings being treated as errors
make[1]: *** [drivers/pci/host/pcie-tango.o] Error 1
make: *** [drivers/pci/host/pcie-tango.o] Error 2


But, in fact, the compiler remained silent, specifically because
the situation on my platform is:

  #define find_first_zero_bit(p,sz)	_find_first_zero_bit_le(p,sz)
  int _find_first_zero_bit_le(const void * p, unsigned size);


So I asked if the prototype could/should be changed, to have the
compiler catch the error as early as possible.

Regards.

Mason March 27, 2017, 3:53 p.m. UTC | #9

On 24/03/2017 19:47, Marc Zyngier wrote:

> On 23/03/17 17:03, Mason wrote:
>
>> On 23/03/2017 15:22, Marc Zyngier wrote:
>>
>>> On 23/03/17 13:05, Mason wrote:
>>>
>>>> +	writel_relaxed(status, pcie->msi_status); /* clear IRQs */
>>>
>>> Why isn't this your irq_ack method instead of open-coding it?
>>
>> I based my driver on the Altera driver, and I did it like
>> I thought they did. I will try fixing my code.
> 
> Doesn't make it right, unfortunately. I wish you would try to understand
> the API first instead of copy-pasting things (including potential bugs).

So far, I have not been able to get the irqchip framework to
call the irq_ack functions I registered.

Should I pass a different handler than handle_simple_irq
to irq_domain_set_info?

	irq_domain_set_info(domain, virq, pos, &tango_msi_chip,
		domain->host_data, handle_simple_irq, NULL, NULL);

When an MSI packet arrives at the MSI doorbell address, the controller
reads the packet's data; this is the MSI number "num". It sets bit "num"
to 1 in the status regs, and raises IRQ line 55 on the system intc.
The IRQ signal remains high, until software clears it by writing 1
in bit "num" of the status regs.

Is this an edge interrupt or a level interrupt?

I was told if the interrupt request is triggered by an event, then
it is an edge interrupt. The reception of an MSI packet is an event.
But the IRQ remains high, so this feels like a level high.
I'm hopelessly confused :-(

>>>> +	mutex_lock(&pcie->lock);
>>>> +
>>>> +	mask = readl_relaxed(pcie->msi_mask);
>>>
>>> Do you really need to read this from the HW each time you allocate an
>>> interrupt? That feels pretty crazy. You're much better off having an
>>> in-memory bitmap that will make things more efficient [...]

I have one remaining issue with bitmaps.

My HW regs are 32b. How do I grab e.g. bits 96-127?
All I can think of is
  u32 val = ((u32 *)bitmap)[3];

Is this acceptable?

mrutland mentioned bitmap_to_u32array() but IIUC it is used to
copy an entire bitmap.

>>>> +	if (pos < MSI_COUNT)
>>>> +		writel(mask | BIT(pos), pcie->msi_mask);
>>>
>>> And it would make a lot more sense to move this write (which should be
>>> relaxed) to irq_unmask. Also, calling msi_mask for something that is an
>>> enable register is a bit counter intuitive.
>>
>> I don't have as much experience as you.
>> I just used the names in the HW documentation.
>> I think it is the "mask" (as in bitmap) of enabled MSIs.
>> I will change "mask" to "enable".
>>
>> Are you saying I should not use pci_msi_mask_irq and pci_msi_unmask_irq,
>> but register custom implementations? I should still call these in my
>> custom functions, right?
> 
> You can call both in your own mask/unmask methods. They serve different
> purpose (one is at the endpoint level, the other is at the MSI
> controller level).

So, if I understand correctly, I should check for an available MSIs
using the in-memory bitmap in tango_irq_domain_alloc(), but I would
defer actually enabling the MSI until irq_unmask?

I think work on bitmap and on the underlying HW regs need to be
protected under the same spinlock, correct?

>> Note: I don't have an "interrupt-map" prop because rev1 doesn't support
>> legacy PCI interrupts (INTx). But I see the PCI framework wrongly mapping
>> intA to my system's interrupt #1, presumably because I am lacking an
>> interrupt-map?
> 
> Probably. I don't think it is legal not to have an interrupt-map.

My understanding is that the interrupt-map actually specifies how to
map the legacy IRQs. My platform does not support legacy IRQs; maybe
there is some binding to say that? Maybe this is more a question for
the PCI folks.

Regards.

Marc Zyngier March 27, 2017, 5:09 p.m. UTC | #10

On 27/03/17 16:53, Mason wrote:
> On 24/03/2017 19:47, Marc Zyngier wrote:
> 
>> On 23/03/17 17:03, Mason wrote:
>>
>>> On 23/03/2017 15:22, Marc Zyngier wrote:
>>>
>>>> On 23/03/17 13:05, Mason wrote:
>>>>
>>>>> +	writel_relaxed(status, pcie->msi_status); /* clear IRQs */
>>>>
>>>> Why isn't this your irq_ack method instead of open-coding it?
>>>
>>> I based my driver on the Altera driver, and I did it like
>>> I thought they did. I will try fixing my code.
>>
>> Doesn't make it right, unfortunately. I wish you would try to understand
>> the API first instead of copy-pasting things (including potential bugs).
> 
> So far, I have not been able to get the irqchip framework to
> call the irq_ack functions I registered.
> 
> Should I pass a different handler than handle_simple_irq
> to irq_domain_set_info?
> 
> 	irq_domain_set_info(domain, virq, pos, &tango_msi_chip,
> 		domain->host_data, handle_simple_irq, NULL, NULL);

Only you can answer that question. But looking at the documentation is 
always a good start:

/**
 *      handle_simple_irq - Simple and software-decoded IRQs.
 *      @desc:  the interrupt description structure for this irq
 *
 *      Simple interrupts are either sent from a demultiplexing interrupt
 *      handler or come from hardware, where no interrupt hardware control
 *      is necessary.
 *
 *      Note: The caller is expected to handle the ack, clear, mask and
 *      unmask issues if necessary.
 */

I'd tend to infer that this is not what you want. On the other hand,
something called handle_edge_irq sounds promising when you deal
with edge interrupts. But again, you are writing the driver, I cannot
guide your hand.

> When an MSI packet arrives at the MSI doorbell address, the controller
> reads the packet's data; this is the MSI number "num". It sets bit "num"
> to 1 in the status regs, and raises IRQ line 55 on the system intc.
> The IRQ signal remains high, until software clears it by writing 1
> in bit "num" of the status regs.
> 
> Is this an edge interrupt or a level interrupt?
> 
> I was told if the interrupt request is triggered by an event, then
> it is an edge interrupt. The reception of an MSI packet is an event.
> But the IRQ remains high, so this feels like a level high.
> I'm hopelessly confused :-(

Here's what your system looks like:

PCI-EP -------> MSI Controller ------> INTC
         MSI                    IRQ

A PCI MSI is always edge. No ifs, no buts. That's what it is, and nothing
else. Now, your MSI controller signals its output using a level interrupt,
since you need to whack it on the head so that it lowers its line.

There is not a single trigger, because there is not a single interrupt.

>>>>> +	mutex_lock(&pcie->lock);
>>>>> +
>>>>> +	mask = readl_relaxed(pcie->msi_mask);
>>>>
>>>> Do you really need to read this from the HW each time you allocate an
>>>> interrupt? That feels pretty crazy. You're much better off having an
>>>> in-memory bitmap that will make things more efficient [...]
> 
> I have one remaining issue with bitmaps.
> 
> My HW regs are 32b. How do I grab e.g. bits 96-127?
> All I can think of is
>   u32 val = ((u32 *)bitmap)[3];
> 
> Is this acceptable?

No.

> 
> mrutland mentioned bitmap_to_u32array() but IIUC it is used to
> copy an entire bitmap.

The real question is "Why do you need such a thing?".

> 
> 
>>>>> +	if (pos < MSI_COUNT)
>>>>> +		writel(mask | BIT(pos), pcie->msi_mask);
>>>>
>>>> And it would make a lot more sense to move this write (which should be
>>>> relaxed) to irq_unmask. Also, calling msi_mask for something that is an
>>>> enable register is a bit counter intuitive.
>>>
>>> I don't have as much experience as you.
>>> I just used the names in the HW documentation.
>>> I think it is the "mask" (as in bitmap) of enabled MSIs.
>>> I will change "mask" to "enable".
>>>
>>> Are you saying I should not use pci_msi_mask_irq and pci_msi_unmask_irq,
>>> but register custom implementations? I should still call these in my
>>> custom functions, right?
>>
>> You can call both in your own mask/unmask methods. They serve different
>> purpose (one is at the endpoint level, the other is at the MSI
>> controller level).
> 
> So, if I understand correctly, I should check for an available MSIs
> using the in-memory bitmap in tango_irq_domain_alloc(), but I would
> defer actually enabling the MSI until irq_unmask?

Yes.

> I think work on bitmap and on the underlying HW regs need to be
> protected under the same spinlock, correct?

Yes.

> 
> 
>>> Note: I don't have an "interrupt-map" prop because rev1 doesn't support
>>> legacy PCI interrupts (INTx). But I see the PCI framework wrongly mapping
>>> intA to my system's interrupt #1, presumably because I am lacking an
>>> interrupt-map?
>>
>> Probably. I don't think it is legal not to have an interrupt-map.
> 
> My understanding is that the interrupt-map actually specifies how to
> map the legacy IRQs. My platform does not support legacy IRQs; maybe
> there is some binding to say that? Maybe this is more a question for
> the PCI folks.

Probably. But it looks to me that a a PCIe RC that doesn't support legacy
interrupts is not a correct implementation. We can probably work around it,
but that feels pretty wrong.

	M.

Mason March 27, 2017, 7:44 p.m. UTC | #11

On 27/03/2017 19:09, Marc Zyngier wrote:

> On 27/03/17 16:53, Mason wrote:
> 
>> I have one remaining issue with bitmaps.
>>
>> My HW regs are 32b. How do I grab e.g. bits 96-127?
>> All I can think of is
>>   u32 val = ((u32 *)bitmap)[3];
>> 
>> Is this acceptable?
> 
> No.
> 
>> mrutland mentioned bitmap_to_u32array() but IIUC it is used to
>> copy an entire bitmap.
> 
> The real question is "Why do you need such a thing?".

You told me to use an in-memory version of the "unmasked"
bitmap, yes? In that case, I need to be able to grab
a piece of said bitmap, to update the corresponding piece
of the HW bitmap.

For example, assume the first 3 MSIs are unmasked,
in other words, unmasked = 0x7

A new user comes along and wants to assign an MSI.
Scan "unmasked", found bit at pos 3.
Update "unmasked" to 0xf.
At some point, I need to write 0xf to some HW reg.
So I need to grab a piece of "unmasked" (bits 0-31 in my example)

  pos = find_first_zero_bit(unmasked, COUNT);
  __set_bit(pos, unmasked);
  int reg_index = pos / 32;
  u32 val = ((u32 *)unmasked)[reg_index];
  writel_relaxed(val, pcie->enabled + reg_index * 4);

Or did I miss something in your suggestion?

Regards.

Marc Zyngier March 27, 2017, 9:07 p.m. UTC | #12

On Mon, Mar 27 2017 at 08:44:08 PM, Mason <slash.tmp@free.fr> wrote:
> On 27/03/2017 19:09, Marc Zyngier wrote:
>
>> On 27/03/17 16:53, Mason wrote:
>> 
>>> I have one remaining issue with bitmaps.
>>>
>>> My HW regs are 32b. How do I grab e.g. bits 96-127?
>>> All I can think of is
>>>   u32 val = ((u32 *)bitmap)[3];
>>> 
>>> Is this acceptable?
>> 
>> No.
>> 
>>> mrutland mentioned bitmap_to_u32array() but IIUC it is used to
>>> copy an entire bitmap.
>> 
>> The real question is "Why do you need such a thing?".
>
> You told me to use an in-memory version of the "unmasked"
> bitmap, yes? In that case, I need to be able to grab
> a piece of said bitmap, to update the corresponding piece
> of the HW bitmap.
>
> For example, assume the first 3 MSIs are unmasked,
> in other words, unmasked = 0x7
>
> A new user comes along and wants to assign an MSI.
> Scan "unmasked", found bit at pos 3.
> Update "unmasked" to 0xf.
> At some point, I need to write 0xf to some HW reg.
> So I need to grab a piece of "unmasked" (bits 0-31 in my example)
>
>   pos = find_first_zero_bit(unmasked, COUNT);
>   __set_bit(pos, unmasked);
>   int reg_index = pos / 32;
>   u32 val = ((u32 *)unmasked)[reg_index];
>   writel_relaxed(val, pcie->enabled + reg_index * 4);
>
> Or did I miss something in your suggestion?

I don't know, I'm sightly taken aback by your question. Completely
puzzled, actually. "Read Modify Write" is a fairly obvious construct.

     val = readl_relaxed(pcie->enabled + reg_index);
     writel_relaxed(val | BIT(pos % 32), pcie->enabled + reg_index);

I never realized this could be such a novel concept. Replace pos with
hwirq, add a spinlock, and that's your irq_unmask.

My suggestion was to use a bitmap in order not to perform extra MMIO
accesses on the fast path. It doesn't mean that you cannot read from the
register under any circumstances. It just means that you don't do it
when there are more efficient ways to do it.

	M.

Mason March 27, 2017, 10:04 p.m. UTC | #13

On 27/03/2017 23:07, Marc Zyngier wrote:

> On Mon, Mar 27 2017 at 08:44:08 PM, Mason wrote:
>
>> On 27/03/2017 19:09, Marc Zyngier wrote:
>>
>>> On 27/03/17 16:53, Mason wrote:
>>>
>>>> I have one remaining issue with bitmaps.
>>>>
>>>> My HW regs are 32b. How do I grab e.g. bits 96-127?
>>>> All I can think of is
>>>>   u32 val = ((u32 *)bitmap)[3];
>>>>
>>>> Is this acceptable?
>>>
>>> No.
>>>
>>>> mrutland mentioned bitmap_to_u32array() but IIUC it is used to
>>>> copy an entire bitmap.
>>>
>>> The real question is "Why do you need such a thing?".
>>
>> You told me to use an in-memory version of the "unmasked"
>> bitmap, yes? In that case, I need to be able to grab
>> a piece of said bitmap, to update the corresponding piece
>> of the HW bitmap.
>>
>> For example, assume the first 3 MSIs are unmasked,
>> in other words, unmasked = 0x7
>>
>> A new user comes along and wants to assign an MSI.
>> Scan "unmasked", found bit at pos 3.
>> Update "unmasked" to 0xf.
>> At some point, I need to write 0xf to some HW reg.
>> So I need to grab a piece of "unmasked" (bits 0-31 in my example)
>>
>>   pos = find_first_zero_bit(unmasked, COUNT);
>>   __set_bit(pos, unmasked);
>>   int reg_index = pos / 32;
>>   u32 val = ((u32 *)unmasked)[reg_index];
>>   writel_relaxed(val, pcie->enabled + reg_index * 4);
>>
>> Or did I miss something in your suggestion?
> 
> I don't know, I'm sightly taken aback by your question. Completely
> puzzled, actually. "Read Modify Write" is a fairly obvious construct.
> 
>      val = readl_relaxed(pcie->enabled + reg_index);
>      writel_relaxed(val | BIT(pos % 32), pcie->enabled + reg_index);
> 
> I never realized this could be such a novel concept. Replace pos with
> hwirq, add a spinlock, and that's your irq_unmask.

I'm not sure why you would think I'm not familiar with RMW.

My original code was:

	mask = readl_relaxed(pcie->msi_mask);
	pos = find_first_zero_bit(&mask, 32);
	writel(mask | BIT(pos), pcie->msi_mask);

To which you replied:
"Do you really need to read this from the HW each time you allocate an
interrupt? That feels pretty crazy. You're much better off having an
in-memory bitmap that will make things more efficient"

This is why I was trying to avoid a MMIO read from the HW
each time I allocate an interrupt.

> My suggestion was to use a bitmap in order not to perform extra MMIO
> accesses on the fast path. It doesn't mean that you cannot read from the
> register under any circumstances. It just means that you don't do it
> when there are more efficient ways to do it.

The fast path is the interrupt handler, right?
(I didn't read the interrupt mask in the ISR.)

I understand your underlying point. It is fine to read
from MMIO in the slow path, but try as hard as possible
NOT to do so in the fast path.

Regards.

Marc Zyngier March 28, 2017, 8:21 a.m. UTC | #14

On 27/03/17 23:04, Mason wrote:
> On 27/03/2017 23:07, Marc Zyngier wrote:
> 
>> On Mon, Mar 27 2017 at 08:44:08 PM, Mason wrote:
>>
>>> On 27/03/2017 19:09, Marc Zyngier wrote:
>>>
>>>> On 27/03/17 16:53, Mason wrote:
>>>>
>>>>> I have one remaining issue with bitmaps.
>>>>>
>>>>> My HW regs are 32b. How do I grab e.g. bits 96-127?
>>>>> All I can think of is
>>>>>   u32 val = ((u32 *)bitmap)[3];
>>>>>
>>>>> Is this acceptable?
>>>>
>>>> No.
>>>>
>>>>> mrutland mentioned bitmap_to_u32array() but IIUC it is used to
>>>>> copy an entire bitmap.
>>>>
>>>> The real question is "Why do you need such a thing?".
>>>
>>> You told me to use an in-memory version of the "unmasked"
>>> bitmap, yes? In that case, I need to be able to grab
>>> a piece of said bitmap, to update the corresponding piece
>>> of the HW bitmap.
>>>
>>> For example, assume the first 3 MSIs are unmasked,
>>> in other words, unmasked = 0x7
>>>
>>> A new user comes along and wants to assign an MSI.
>>> Scan "unmasked", found bit at pos 3.
>>> Update "unmasked" to 0xf.
>>> At some point, I need to write 0xf to some HW reg.
>>> So I need to grab a piece of "unmasked" (bits 0-31 in my example)
>>>
>>>   pos = find_first_zero_bit(unmasked, COUNT);
>>>   __set_bit(pos, unmasked);
>>>   int reg_index = pos / 32;
>>>   u32 val = ((u32 *)unmasked)[reg_index];
>>>   writel_relaxed(val, pcie->enabled + reg_index * 4);
>>>
>>> Or did I miss something in your suggestion?
>>
>> I don't know, I'm sightly taken aback by your question. Completely
>> puzzled, actually. "Read Modify Write" is a fairly obvious construct.
>>
>>      val = readl_relaxed(pcie->enabled + reg_index);
>>      writel_relaxed(val | BIT(pos % 32), pcie->enabled + reg_index);
>>
>> I never realized this could be such a novel concept. Replace pos with
>> hwirq, add a spinlock, and that's your irq_unmask.
> 
> I'm not sure why you would think I'm not familiar with RMW.
> 
> My original code was:
> 
> 	mask = readl_relaxed(pcie->msi_mask);
> 	pos = find_first_zero_bit(&mask, 32);
> 	writel(mask | BIT(pos), pcie->msi_mask);
> 
> To which you replied:
> "Do you really need to read this from the HW each time you allocate an
> interrupt? That feels pretty crazy. You're much better off having an
> in-memory bitmap that will make things more efficient"
> 
> This is why I was trying to avoid a MMIO read from the HW
> each time I allocate an interrupt.

Yes, and I stand by this comment. I also said, in the bit that you
didn't quote, that the RMW was to be moved to the irq_unmask method.
This ensures separation of IRQ allocation and IRQ chip operations.

>> My suggestion was to use a bitmap in order not to perform extra MMIO
>> accesses on the fast path. It doesn't mean that you cannot read from the
>> register under any circumstances. It just means that you don't do it
>> when there are more efficient ways to do it.
> 
> The fast path is the interrupt handler, right?
> (I didn't read the interrupt mask in the ISR.)

You didn't then, and yet you have to, since you need to support more
than 32 MSIs, requiring you to either read these mask regs in the HW, or
(tata!) look into the allocation bitmap.

	M.

Mason March 29, 2017, 11:39 a.m. UTC | #15

On 23/03/2017 14:05, Mason wrote:

> I think this version is ready for review.
> It has all the required bits and pieces.
> I still have a few questions, embedded as comments in the code.

Superseded by [PATCH v3 0/2] Tango PCIe controller support

The v0.2 series remains relevant for one aspect: as an example
of the upcoming support for controller rev2.

Regards.

Mason March 30, 2017, 11:09 a.m. UTC | #16

On 23/03/2017 14:05, Mason wrote:

> +/*
> + * This should probably be module_platform_driver ?
> + */
> +builtin_platform_driver(tango_pcie_driver);

For the sake of completeness,

ERROR: "pci_ecam_map_bus" [drivers/pci/host/pcie-tango.ko] undefined!
ERROR: "pci_host_common_probe" [drivers/pci/host/pcie-tango.ko] undefined!

The two above functions are currently not exported to modules.

I don't know whether this is intentional.

Regards.

Mason April 11, 2017, 3:13 p.m. UTC | #17

On 27/03/2017 19:09, Marc Zyngier wrote:

> Here's what your system looks like:
> 
> PCI-EP -------> MSI Controller ------> INTC
>          MSI                    IRQ
> 
> A PCI MSI is always edge. No ifs, no buts. That's what it is, and nothing
> else. Now, your MSI controller signals its output using a level interrupt,
> since you need to whack it on the head so that it lowers its line.
> 
> There is not a single trigger, because there is not a single interrupt.

Hello Marc,

I was hoping you or Thomas might help clear some confusion
in my mind around IRQ domains (struct irq_domain).

I have read https://www.kernel.org/doc/Documentation/IRQ-domain.txt

IIUC, there should be one IRQ domain per IRQ controller.

I have this MSI controller handling 256 interrupts, so I should
have *one* domain for all possible MSIs. Yet the Altera driver
registers *two* domains (msi_domain and inner_domain).

Could I make everything work with a single IRQ domain?

I'm confused by the sequence:

  irq_dom = irq_domain_create_linear(fwnode, MSI_COUNT, &msi_dom_ops, pcie);
  msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);

It seems I should be able to call only pci_msi_create_irq_domain...
And the parent of the MSI controller is given in the DT.

I'm not quite sure how I tell pci_msi_create_irq_domain how many
MSIs it's supposed to manage. Nor how I pass my private struct.

I will keep looking.

Regards.

Marc Zyngier April 11, 2017, 3:49 p.m. UTC | #18

On 11/04/17 16:13, Mason wrote:
> On 27/03/2017 19:09, Marc Zyngier wrote:
> 
>> Here's what your system looks like:
>>
>> PCI-EP -------> MSI Controller ------> INTC
>>          MSI                    IRQ
>>
>> A PCI MSI is always edge. No ifs, no buts. That's what it is, and nothing
>> else. Now, your MSI controller signals its output using a level interrupt,
>> since you need to whack it on the head so that it lowers its line.
>>
>> There is not a single trigger, because there is not a single interrupt.
> 
> Hello Marc,
> 
> I was hoping you or Thomas might help clear some confusion
> in my mind around IRQ domains (struct irq_domain).
> 
> I have read https://www.kernel.org/doc/Documentation/IRQ-domain.txt
> 
> IIUC, there should be one IRQ domain per IRQ controller.
> 
> I have this MSI controller handling 256 interrupts, so I should
> have *one* domain for all possible MSIs. Yet the Altera driver
> registers *two* domains (msi_domain and inner_domain).
> 
> Could I make everything work with a single IRQ domain?

No, because you have two irqchips. One that deals with the HW, and the
other that deals with the MSIs how they are presented to the kernel,
depending on the bus (PCI or something else). The fact that it doesn't
really drive any HW doesn't make it irrelevant.

> I'm confused by the sequence:
> 
>   irq_dom = irq_domain_create_linear(fwnode, MSI_COUNT, &msi_dom_ops, pcie);
>   msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);
> 
> It seems I should be able to call only pci_msi_create_irq_domain...
> And the parent of the MSI controller is given in the DT.

No, see above.

> I'm not quite sure how I tell pci_msi_create_irq_domain how many
> MSIs it's supposed to manage. Nor how I pass my private struct.

You don't need to tell it anything about the number of interrupts you
manage. As for your private structure, you've already given it to your
low level domain, and there is no need to propagate it any further.

Thanks,

	M.

Mason April 11, 2017, 4:26 p.m. UTC | #19

On 11/04/2017 17:49, Marc Zyngier wrote:
> On 11/04/17 16:13, Mason wrote:
>> On 27/03/2017 19:09, Marc Zyngier wrote:
>>
>>> Here's what your system looks like:
>>>
>>> PCI-EP -------> MSI Controller ------> INTC
>>>          MSI                    IRQ
>>>
>>> A PCI MSI is always edge. No ifs, no buts. That's what it is, and nothing
>>> else. Now, your MSI controller signals its output using a level interrupt,
>>> since you need to whack it on the head so that it lowers its line.
>>>
>>> There is not a single trigger, because there is not a single interrupt.
>>
>> Hello Marc,
>>
>> I was hoping you or Thomas might help clear some confusion
>> in my mind around IRQ domains (struct irq_domain).
>>
>> I have read https://www.kernel.org/doc/Documentation/IRQ-domain.txt
>>
>> IIUC, there should be one IRQ domain per IRQ controller.
>>
>> I have this MSI controller handling 256 interrupts, so I should
>> have *one* domain for all possible MSIs. Yet the Altera driver
>> registers *two* domains (msi_domain and inner_domain).
>>
>> Could I make everything work with a single IRQ domain?
> 
> No, because you have two irqchips. One that deals with the HW, and the
> other that deals with the MSIs how they are presented to the kernel,
> depending on the bus (PCI or something else). The fact that it doesn't
> really drive any HW doesn't make it irrelevant.

The example given in IRQ-domain.txt is

  Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU

with an irq_domain for each interrupt controller.


On my system I have:

  PCI-EP -> MSI controller -> System INTC -> GIC -> CPU

The driver for System INTC is drivers/irqchip/irq-tango.c
I think it has only one domain.

For the GIC, drivers/irqchip/irq-gic.c
I see a call to irq_domain_create_linear()

Is the handling of MSI different, and that is why we need
two domains? (Sorry, I did not understand that part well.)

When I looked at drivers/pci/host/pci-hyperv.c
they seem to have a single pci_msi_create_irq_domain call,
no call to domain_add or domain_create.
And they have a single struct irq_chip.

> You don't need to tell it anything about the number of interrupts you
> manage. As for your private structure, you've already given it to your
> low level domain, and there is no need to propagate it any further.

My main issue is that in the ack callback, I was in the "wrong"
domain, in that d->hwirq was not the MSI number. So I thought
I needed a single irq_domain.

Is there a function to map virq to the hwirq in any domain?

Regards.

Marc Zyngier April 11, 2017, 4:43 p.m. UTC | #20

On 11/04/17 17:26, Mason wrote:
> On 11/04/2017 17:49, Marc Zyngier wrote:
>> On 11/04/17 16:13, Mason wrote:
>>> On 27/03/2017 19:09, Marc Zyngier wrote:
>>>
>>>> Here's what your system looks like:
>>>>
>>>> PCI-EP -------> MSI Controller ------> INTC
>>>>          MSI                    IRQ
>>>>
>>>> A PCI MSI is always edge. No ifs, no buts. That's what it is, and nothing
>>>> else. Now, your MSI controller signals its output using a level interrupt,
>>>> since you need to whack it on the head so that it lowers its line.
>>>>
>>>> There is not a single trigger, because there is not a single interrupt.
>>>
>>> Hello Marc,
>>>
>>> I was hoping you or Thomas might help clear some confusion
>>> in my mind around IRQ domains (struct irq_domain).
>>>
>>> I have read https://www.kernel.org/doc/Documentation/IRQ-domain.txt
>>>
>>> IIUC, there should be one IRQ domain per IRQ controller.
>>>
>>> I have this MSI controller handling 256 interrupts, so I should
>>> have *one* domain for all possible MSIs. Yet the Altera driver
>>> registers *two* domains (msi_domain and inner_domain).
>>>
>>> Could I make everything work with a single IRQ domain?
>>
>> No, because you have two irqchips. One that deals with the HW, and the
>> other that deals with the MSIs how they are presented to the kernel,
>> depending on the bus (PCI or something else). The fact that it doesn't
>> really drive any HW doesn't make it irrelevant.
> 
> The example given in IRQ-domain.txt is
> 
>   Device --> IOAPIC -> Interrupt remapping Controller -> Local APIC -> CPU
> 
> with an irq_domain for each interrupt controller.

Which doesn't use the generic MSI layer the way arm/arm64 do, so that's
the wrong example.

> 
> 
> On my system I have:
> 
>   PCI-EP -> MSI controller -> System INTC -> GIC -> CPU
> 
> The driver for System INTC is drivers/irqchip/irq-tango.c
> I think it has only one domain.
> 
> For the GIC, drivers/irqchip/irq-gic.c
> I see a call to irq_domain_create_linear()

Can we please stick to the problem at hand and not drift into other
considerations which do not matter at all?

> Is the handling of MSI different, and that is why we need
> two domains? (Sorry, I did not understand that part well.)

Let me repeat it again, then:
- You have a top-level MSI domain that is completely virtual, mapping a
virtual hwirq to the virtual interrupt. Nothing to see here.
- You have your own irqdomain, associated with your own irq_chip, which
does what it needs to do talking to the HW and allocating interrupts.

> When I looked at drivers/pci/host/pci-hyperv.c
> they seem to have a single pci_msi_create_irq_domain call,
> no call to domain_add or domain_create.
> And they have a single struct irq_chip.

Which is not using the generic MSI layer the way we do either.

> 
>> You don't need to tell it anything about the number of interrupts you
>> manage. As for your private structure, you've already given it to your
>> low level domain, and there is no need to propagate it any further.
> 
> My main issue is that in the ack callback, I was in the "wrong"
> domain, in that d->hwirq was not the MSI number. So I thought
> I needed a single irq_domain.

No. You need two, but you only need to manage yours.

> Is there a function to map virq to the hwirq in any domain?

Be more precise. If you want the hwirq associated with the view of a
virq in a given domain, that's the hwirq field in the corresponding
irq_data structure. Or are you after something else?

	M.

Mason April 11, 2017, 5:52 p.m. UTC | #21

On 11/04/2017 18:43, Marc Zyngier wrote:

> On 11/04/17 17:26, Mason wrote:
>
>> Is there a function to map virq to the hwirq in any domain?
> 
> Be more precise. If you want the hwirq associated with the view of a
> virq in a given domain, that's the hwirq field in the corresponding
> irq_data structure. Or are you after something else?

I registered an unmask method for my irq_chip.
(IIUC, I'm supposed to unmask a specific MSI in this callback.)

# cat /proc/interrupts 
           CPU0       CPU1       
 30:          0          0    MSIfoo      0 Edge      aerdrv
 34:          0          0    MSIfoo 524288 Edge      xhci_hcd
 35:          0          0    MSIfoo 524289 Edge      xhci_hcd
 36:          0          0    MSIfoo 524290 Edge      xhci_hcd


void foo_unmask(struct irq_data *data)
{
	int xx,yy;
	struct irq_domain *dom = data->domain;
	printk("%s: irq_data=%p irq=%u hwirq=%lu chip=%p dom=%p pdata=%p data=%p\n",
		__func__, data, data->irq, data->hwirq, data->chip, data->domain,
		data->parent_data, data->chip_data);
	printk("%s ops=%p data=%p parent=%p\n\n",
		dom->name, dom->ops, dom->host_data, dom->parent);
	printk("pcie=%p\n", dom->parent->host_data);
	//dump_stack();
	pci_msi_unmask_irq(data);
	struct tango_pcie *pcie = data->domain->parent->host_data;
	printk("\n%s: pcie=%p irq=%u hwirq=%lu\n\n",
		__func__, pcie, data->irq, data->hwirq);
	xx = irq_find_mapping(pcie->irq_domain, data->hwirq);
	yy = irq_find_mapping(pcie->msi_domain, data->hwirq);
	printk("xx=%d yy=%d\n", xx, yy);
}

so data->irq is the virq (30, 34, 35, 36)
and data->hwirq is the domain hwirq (0, 524288, 524289, 524290)

Is there a way to map hwirq 524288 to MSI 0, hwirq 524289 to MSI 1, etc?

Regards.

Marc Zyngier April 12, 2017, 8:08 a.m. UTC | #22

On 11/04/17 18:52, Mason wrote:
> On 11/04/2017 18:43, Marc Zyngier wrote:
> 
>> On 11/04/17 17:26, Mason wrote:
>>
>>> Is there a function to map virq to the hwirq in any domain?
>>
>> Be more precise. If you want the hwirq associated with the view of a
>> virq in a given domain, that's the hwirq field in the corresponding
>> irq_data structure. Or are you after something else?
> 
> I registered an unmask method for my irq_chip.
> (IIUC, I'm supposed to unmask a specific MSI in this callback.)
> 
> # cat /proc/interrupts 
>            CPU0       CPU1       
>  30:          0          0    MSIfoo      0 Edge      aerdrv
>  34:          0          0    MSIfoo 524288 Edge      xhci_hcd
>  35:          0          0    MSIfoo 524289 Edge      xhci_hcd
>  36:          0          0    MSIfoo 524290 Edge      xhci_hcd
> 
> 
> void foo_unmask(struct irq_data *data)
> {
> 	int xx,yy;
> 	struct irq_domain *dom = data->domain;
> 	printk("%s: irq_data=%p irq=%u hwirq=%lu chip=%p dom=%p pdata=%p data=%p\n",
> 		__func__, data, data->irq, data->hwirq, data->chip, data->domain,
> 		data->parent_data, data->chip_data);
> 	printk("%s ops=%p data=%p parent=%p\n\n",
> 		dom->name, dom->ops, dom->host_data, dom->parent);
> 	printk("pcie=%p\n", dom->parent->host_data);
> 	//dump_stack();
> 	pci_msi_unmask_irq(data);
> 	struct tango_pcie *pcie = data->domain->parent->host_data;
> 	printk("\n%s: pcie=%p irq=%u hwirq=%lu\n\n",
> 		__func__, pcie, data->irq, data->hwirq);
> 	xx = irq_find_mapping(pcie->irq_domain, data->hwirq);
> 	yy = irq_find_mapping(pcie->msi_domain, data->hwirq);

Do you realize that hwirq is *per domain*, and only makes sense to *one*
irqchip? What you've written here doesn't make any sense.

> 	printk("xx=%d yy=%d\n", xx, yy);
> }
> 
> so data->irq is the virq (30, 34, 35, 36)
> and data->hwirq is the domain hwirq (0, 524288, 524289, 524290)
> 
> Is there a way to map hwirq 524288 to MSI 0, hwirq 524289 to MSI 1, etc?

Why would you need to do such a thing?
- In MSI domain: IRQ34 -> hwirq 524288
- In foo domain: IRQ34 -> hwirq [whatever your driver has allocated]

The data is already there, at your fingertips. Just deal with with your
"foo" irqchip.

	M.

Mason April 12, 2017, 9:50 a.m. UTC | #23

On 12/04/2017 10:08, Marc Zyngier wrote:

> On 11/04/17 18:52, Mason wrote:
> 
>> so data->irq is the virq (30, 34, 35, 36)
>> and data->hwirq is the domain hwirq (0, 524288, 524289, 524290)
>>
>> Is there a way to map hwirq 524288 to MSI 0, hwirq 524289 to MSI 1, etc?
> 
> Why would you need to do such a thing?
> - In MSI domain: IRQ34 -> hwirq 524288
> - In foo domain: IRQ34 -> hwirq [whatever your driver has allocated]
> 
> The data is already there, at your fingertips. Just deal with with your
> "foo" irqchip.

OK, let me take a step back, I may have missed the forest for
the trees.

In my original code (copied from the Altera driver) I unmasked
a given MSI in the tango_irq_domain_alloc() function. You said
this was the wrong place, as it should be done in the irqchip
unmask callback. Did I understand correctly, so far?

I have registered custom mask/unmask callbacks for both irq_chips.

The unmask callback that gets called is the one where hwirqs are
524288 and up. But what I need at that point is the "real" MSI
number, because I need to set bit i in some HW register.

Did I somehow mix up domains in one of the init functions?

Regards.



Current debug code posted below, because natural language is often
too ambiguous.

#define MSI_COUNT 32

struct tango_pcie {
	void __iomem *mux;
	void __iomem *msi_status;
	void __iomem *msi_mask;
	phys_addr_t msi_doorbell;
	struct mutex lock; /* lock for updating msi_mask */
	struct irq_domain *irq_domain;
	struct irq_domain *msi_domain;
	int irq;
};

static void tango_msi_isr(struct irq_desc *desc)
{
	unsigned long status;
	unsigned int pos, virq;
	struct irq_chip *chip = irq_desc_get_chip(desc);
	struct tango_pcie *pcie = irq_desc_get_handler_data(desc);

	chained_irq_enter(chip, desc);

	status = readl_relaxed(pcie->msi_status);
	writel_relaxed(status, pcie->msi_status); /* clear IRQs */

	for_each_set_bit(pos, &status, MSI_COUNT) {
		virq = irq_find_mapping(pcie->irq_domain, pos);
		generic_handle_irq(virq);
	}

	chained_irq_exit(chip, desc);
}

#define HOP printk("\nENTER %s\n", __func__); dump_stack()

void foo_ack(struct irq_data *data)	{ HOP; }
void foo_mask(struct irq_data *data)	{ HOP; pci_msi_mask_irq(data); }
void foo_unmask(struct irq_data *data)	{ HOP; pci_msi_unmask_irq(data); }

static struct irq_chip tango_msi_irq_chip = {
	.name = "MSI_foo",
	.irq_ack = foo_ack,
	.irq_mask = foo_mask,
	.irq_unmask = foo_unmask,
};

static struct msi_domain_info msi_domain_info = {
	.flags	= MSI_FLAG_PCI_MSIX | MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
	.chip	= &tango_msi_irq_chip,
};

static void tango_compose_msi_msg(struct irq_data *data, struct msi_msg *msg)
{
	struct tango_pcie *pcie = irq_data_get_irq_chip_data(data);

	msg->address_lo = lower_32_bits(pcie->msi_doorbell);
	msg->address_hi = upper_32_bits(pcie->msi_doorbell);
	msg->data = data->hwirq;
}

static int tango_set_affinity(struct irq_data *irq_data,
		const struct cpumask *mask, bool force)
{
	 return -EINVAL;
}

void bar_ack(struct irq_data *data)	{ HOP; }
void bar_mask(struct irq_data *data)	{ HOP; }
void bar_unmask(struct irq_data *data)	{ HOP; }

static struct irq_chip tango_msi_chip = {
	.name			= "MSI_bar",
	.irq_compose_msi_msg	= tango_compose_msi_msg,
	.irq_set_affinity	= tango_set_affinity,
	.irq_ack = bar_ack,
	.irq_mask = bar_mask,
	.irq_unmask = bar_unmask,
};

static int find_free_msi(struct irq_domain *dom, unsigned int virq)
{
	unsigned int pos;
	unsigned long mask;
	struct tango_pcie *pcie = dom->host_data;

	mask = readl_relaxed(pcie->msi_mask);
	pos = find_first_zero_bit(&mask, MSI_COUNT);
	if (pos >= MSI_COUNT)
		return -ENOSPC;
	writel_relaxed(mask | BIT(pos), pcie->msi_mask);

	irq_domain_set_info(dom, virq, pos, &tango_msi_chip,
			dom->host_data, handle_edge_irq, NULL, NULL);

	return 0;
}

static int tango_irq_domain_alloc(struct irq_domain *dom,
		unsigned int virq, unsigned int nr_irqs, void *args)
{
	int err;
	struct tango_pcie *pcie = dom->host_data;

	mutex_lock(&pcie->lock);
	err = find_free_msi(dom, virq);
	mutex_unlock(&pcie->lock);

	return err;
}

static void tango_irq_domain_free(struct irq_domain *dom,
		unsigned int virq, unsigned int nr_irqs)
{
	struct irq_data *d = irq_domain_get_irq_data(dom, virq);
	struct tango_pcie *pcie = irq_data_get_irq_chip_data(d);
	unsigned int mask, pos = d->hwirq;

	mutex_lock(&pcie->lock);
	mask = readl_relaxed(pcie->msi_mask);
	writel_relaxed(mask & ~BIT(pos), pcie->msi_mask);
	mutex_unlock(&pcie->lock);
}

static const struct irq_domain_ops msi_dom_ops = {
	.alloc	= tango_irq_domain_alloc,
	.free	= tango_irq_domain_free,
};

static int tango_msi_remove(struct platform_device *pdev)
{
	struct tango_pcie *pcie = platform_get_drvdata(pdev);

	irq_set_chained_handler_and_data(pcie->irq, NULL, NULL);
	irq_domain_remove(pcie->msi_domain);
	irq_domain_remove(pcie->irq_domain);

	return 0;
}

static int tango_msi_probe(struct platform_device *pdev, struct tango_pcie *pcie)
{
	int virq;
	struct fwnode_handle *fwnode = of_node_to_fwnode(pdev->dev.of_node);
	struct irq_domain *msi_dom, *irq_dom;

	mutex_init(&pcie->lock);
	writel_relaxed(0, pcie->msi_mask);

	irq_dom = irq_domain_create_linear(fwnode, MSI_COUNT, &msi_dom_ops, pcie);
	if (!irq_dom) {
		pr_err("Failed to create IRQ domain\n");
		return -ENOMEM;
	}

	msi_dom = pci_msi_create_irq_domain(fwnode, &msi_domain_info, irq_dom);
	if (!msi_dom) {
		pr_err("Failed to create MSI domain\n");
		irq_domain_remove(irq_dom);
		return -ENOMEM;
	}

	virq = platform_get_irq(pdev, 1);
	if (virq <= 0) {
		pr_err("Failed to map IRQ\n");
		irq_domain_remove(msi_dom);
		irq_domain_remove(irq_dom);
		return -ENXIO;
	}

	pcie->irq_domain = irq_dom;
	pcie->msi_domain = msi_dom;
	pcie->irq = virq;
	irq_set_chained_handler_and_data(virq, tango_msi_isr, pcie);

	return 0;
}

Marc Zyngier April 12, 2017, 9:59 a.m. UTC | #24

On 12/04/17 10:50, Mason wrote:
> On 12/04/2017 10:08, Marc Zyngier wrote:
> 
>> On 11/04/17 18:52, Mason wrote:
>>
>>> so data->irq is the virq (30, 34, 35, 36)
>>> and data->hwirq is the domain hwirq (0, 524288, 524289, 524290)
>>>
>>> Is there a way to map hwirq 524288 to MSI 0, hwirq 524289 to MSI 1, etc?
>>
>> Why would you need to do such a thing?
>> - In MSI domain: IRQ34 -> hwirq 524288
>> - In foo domain: IRQ34 -> hwirq [whatever your driver has allocated]
>>
>> The data is already there, at your fingertips. Just deal with with your
>> "foo" irqchip.
> 
> OK, let me take a step back, I may have missed the forest for
> the trees.
> 
> In my original code (copied from the Altera driver) I unmasked
> a given MSI in the tango_irq_domain_alloc() function. You said
> this was the wrong place, as it should be done in the irqchip
> unmask callback. Did I understand correctly, so far?

Yes.

> I have registered custom mask/unmask callbacks for both irq_chips.

And that's *wrong*. I've repeatedly said that you only need to deal with
*your* irqchip. End of story. Nowhere else. Please leave the MSI irqchip
alone, it is very unlikely that you have to provide anything at all to it.

	M.

Mason April 19, 2017, 11:19 a.m. UTC | #25

On 12/04/2017 11:59, Marc Zyngier wrote:

> And that's *wrong*. I've repeatedly said that you only need to deal with
> *your* irqchip. End of story. Nowhere else. Please leave the MSI irqchip
> alone, it is very unlikely that you have to provide anything at all to it.

Hello Marc,

I am hoping you can help me pinpoint what's missing in my code,
to have it behave the way you describe.

As you have stated, I have *two* struct irq_chip:
one is mine (tango_chip), the other not_mine (msi_chip).

static struct irq_chip tango_chip = {
	.name			= "TANGO IRQ_CHIP",
	.irq_compose_msi_msg	= tango_compose_msi_msg,
	.irq_set_affinity	= tango_set_affinity,
	.irq_ack		= tango_ack,
	.irq_mask		= tango_mask,
	.irq_unmask		= tango_unmask,
};

static struct irq_chip msi_chip = {
	.name = "NOT MY IRQ_CHIP",
	.irq_ack = not_mine_ack,
	.irq_mask = not_mine_mask,
	.irq_unmask = not_mine_unmask,
};


A struct msi_domain_info links to the msi_chip.

static struct msi_domain_info msi_dom_info = {
	.flags	= MSI_FLAG_PCI_MSIX | MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS,
	.chip	= &msi_chip,
};

irq_dom = irq_domain_create_linear(fwnode, MSI_COUNT, &irq_dom_ops, pcie);
msi_dom = pci_msi_create_irq_domain(fwnode, &msi_dom_info, irq_dom);


I believe irq_dom and tango_chip are tied together by:

	irq_domain_set_info(dom, virq, pos, &tango_chip, pcie, handle_edge_irq, NULL, NULL);


My biggest problem is that tango_unmask() is never called.


Here is the log of function calls as they happen on my system:

[    0.361586] TANGO irq_chip=c0d10a00 MSI irq_chip=c0d10aa8
[    0.361595] IRQ_DOM=cf9db300 MSI_DOM=cf9b5080

[    0.363214] find_free_msi: dom=cf9db300 virq=30 hwirq=0 chip=c0d10a00 cdata=cfb0a410
[    0.363314] [<c0316e54>] (tango_irq_domain_alloc) from [<c0163468>] (msi_domain_alloc+0x68/0x128)

[    0.363543] ENTER tango_compose_msi_msg
[    0.363552] TANGO IRQ_CHIP: irq_data=cfaf4180 irq=30 hwirq=0 chip=c0d10a00 dom=cf9db300 pdata=  (null) cdata=cfb0a410
[    0.363609] [<c0316ca4>] (tango_compose_msi_msg) from [<c015f654>] (irq_chip_compose_msi_msg+0x48/0x58)

[    0.363928] ENTER not_mine_unmask
[    0.363943] NOT MY IRQ_CHIP: irq_data=cf940110 irq=30 hwirq=0 chip=c0d10aa8 dom=cf9b5080 pdata=cfaf4180 cdata=  (null)
[    0.364012] [<c0316c48>] (not_mine_unmask) from [<c015edc0>] (irq_enable+0x30/0x44)

[    2.475111] find_free_msi: dom=cf9db300 virq=34 hwirq=1 chip=c0d10a00 cdata=cfb0a410
[    2.516895] [<c0316e54>] (tango_irq_domain_alloc) from [<c0163468>] (msi_domain_alloc+0x68/0x128)

[    2.686593] find_free_msi: dom=cf9db300 virq=35 hwirq=2 chip=c0d10a00 cdata=cfb0a410
[    2.728334] [<c0316e54>] (tango_irq_domain_alloc) from [<c0163468>] (msi_domain_alloc+0x68/0x128)

[    2.897946] find_free_msi: dom=cf9db300 virq=36 hwirq=3 chip=c0d10a00 cdata=cfb0a410
[    2.939680] [<c0316e54>] (tango_irq_domain_alloc) from [<c0163468>] (msi_domain_alloc+0x68/0x128)

[    3.109254] ENTER tango_compose_msi_msg
[    3.114603] TANGO IRQ_CHIP: irq_data=cfb4b500 irq=34 hwirq=1 chip=c0d10a00 dom=cf9db300 pdata=  (null) cdata=cfb0a410
[    3.160621] [<c0316ca4>] (tango_compose_msi_msg) from [<c015f654>] (irq_chip_compose_msi_msg+0x48/0x58)

[    3.340161] ENTER tango_compose_msi_msg
[    3.345507] TANGO IRQ_CHIP: irq_data=cfb4b440 irq=35 hwirq=2 chip=c0d10a00 dom=cf9db300 pdata=  (null) cdata=cfb0a410
[    3.391522] [<c0316ca4>] (tango_compose_msi_msg) from [<c015f654>] (irq_chip_compose_msi_msg+0x48/0x58)

[    3.571148] ENTER tango_compose_msi_msg
[    3.576503] TANGO IRQ_CHIP: irq_data=cfb4b380 irq=36 hwirq=3 chip=c0d10a00 dom=cf9db300 pdata=  (null) cdata=cfb0a410
[    3.631966] [<c015f654>] (irq_chip_compose_msi_msg) from [<c016362c>] (msi_domain_activate+0x18/0x40)

[    3.802121] ENTER not_mine_unmask
[    3.806937] NOT MY IRQ_CHIP: irq_data=cf95c210 irq=34 hwirq=524288 chip=c0d10aa8 dom=cf9b5080 pdata=cfb4b500 cdata=  (null)
[    3.852943] [<c0316c48>] (not_mine_unmask) from [<c015edc0>] (irq_enable+0x30/0x44)

[    3.998596] ENTER not_mine_unmask
[    4.003410] NOT MY IRQ_CHIP: irq_data=cf95c010 irq=35 hwirq=524289 chip=c0d10aa8 dom=cf9b5080 pdata=cfb4b440 cdata=  (null)
[    4.049415] [<c0316c48>] (not_mine_unmask) from [<c015edc0>] (irq_enable+0x30/0x44)

[    4.195055] ENTER not_mine_unmask
[    4.199870] NOT MY IRQ_CHIP: irq_data=cf97cf10 irq=36 hwirq=524290 chip=c0d10aa8 dom=cf9b5080 pdata=cfb4b380 cdata=  (null)
[    4.245874] [<c0316c48>] (not_mine_unmask) from [<c015edc0>] (irq_enable+0x30/0x44)



It's driving me nuts. Do you know what might be wrong?

Regards.

Mason April 20, 2017, 8:20 a.m. UTC | #26

On 19/04/2017 13:19, Mason wrote:

> My biggest problem is that tango_unmask() is never called.

FTR, the missing incantation was:
Explicitly calling tango_{mask/unmask/ack} from the corresponding msi_{mask/unmask/ack}

Marc, I have one nagging doubt, wrt splitting MSI line selection
and MSI enable.

tango_irq_domain_alloc : finds an available MSI 'j' to allocate
tango_irq_domain_free : release MSI 'j'
tango_unmask : enable MSI 'j'
tango_mask : disable MSI 'j'

Is the following scenario guaranteed to never happen?

tango_irq_domain_alloc // alloc 0
tango_irq_domain_free  // free 0
tango_irq_domain_alloc // alloc 0
tango_unmask // enable 0
tango_unmask // enable 0 = NOP
tango_mask   // disable 0

In this theoretical scenario, we would be left with a non-functional
MSI 0.

Regards.

Marc Zyngier April 20, 2017, 9:43 a.m. UTC | #27

On 20/04/17 09:20, Mason wrote:
> On 19/04/2017 13:19, Mason wrote:
> 
>> My biggest problem is that tango_unmask() is never called.
> 
> FTR, the missing incantation was:
> Explicitly calling tango_{mask/unmask/ack} from the corresponding msi_{mask/unmask/ack}

Using irq_chip_mask_parent and co, you mean?

> Marc, I have one nagging doubt, wrt splitting MSI line selection
> and MSI enable.
> 
> tango_irq_domain_alloc : finds an available MSI 'j' to allocate
> tango_irq_domain_free : release MSI 'j'
> tango_unmask : enable MSI 'j'
> tango_mask : disable MSI 'j'
> 
> Is the following scenario guaranteed to never happen?
> 
> tango_irq_domain_alloc // alloc 0
> tango_irq_domain_free  // free 0
> tango_irq_domain_alloc // alloc 0
> tango_unmask // enable 0
> tango_unmask // enable 0 = NOP
> tango_mask   // disable 0
> 
> In this theoretical scenario, we would be left with a non-functional
> MSI 0.

I'm not sure I get the example above, and what the various alloc/free
calls have to do with anything. If you have unbalanced
enable/disable_irq, you loose. Don't do that.

	M.

[RFC,v0.2] PCI: Add support for tango PCIe host bridge

Commit Message

Comments

Patch