[v2] spi: bcm2835: Enable shared interrupt support

Message ID	20200604212819.715-1-f.fainelli@gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=yMTu=7R=vger.kernel.org=linux-spi-owner@kernel.org> From: Florian Fainelli <f.fainelli@gmail.com> To: linux-kernel@vger.kernel.org Cc: Florian Fainelli <f.fainelli@gmail.com>, Mark Brown <broonie@kernel.org>, Rob Herring <robh+dt@kernel.org>, Nicolas Saenz Julienne <nsaenzjulienne@suse.de>, Ray Jui <rjui@broadcom.com>, Scott Branden <sbranden@broadcom.com>, bcm-kernel-feedback-list@broadcom.com (maintainer:BROADCOM BCM281XX/BCM11XXX/BCM216XX ARM ARCHITE...), linux-spi@vger.kernel.org (open list:SPI SUBSYSTEM), devicetree@vger.kernel.org (open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS), linux-rpi-kernel@lists.infradead.org (moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE), linux-arm-kernel@lists.infradead.org (moderated list:BROADCOM BCM2711/BCM2835 ARM ARCHITECTURE), Martin Sperl <kernel@martin.sperl.org>, lukas@wunner.de Subject: [PATCH v2] spi: bcm2835: Enable shared interrupt support Date: Thu, 4 Jun 2020 14:28:19 -0700 Message-Id: <20200604212819.715-1-f.fainelli@gmail.com> Sender: linux-spi-owner@vger.kernel.org Precedence: bulk
Series	[v2] spi: bcm2835: Enable shared interrupt support \| expand [v2] spi: bcm2835: Enable shared interrupt support

Florian Fainelli June 4, 2020, 9:28 p.m. UTC

The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.

For the BCM2835 case which is deemed performance critical, we would like
to continue using an interrupt handler which does not have the extra
comparison on BCM2835_SPI_CS_INTR.

To support that requirement the common interrupt handling code between
the shared and non-shared interrupt paths is split into a
bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well
as bcm2835_spi_shared_interrupt() make use of it.

During probe, we determine if there is at least another instance of this
SPI controller, and if there is, then we install a shared interrupt
handler.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v2:

- identify other available SPI nodes to determine if we need to set-up
  interrupt sharing. This needs to happen for the very first instance
  since we cannot know for the first instance whether interrupt sharing
  is needed or not.

 drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++-------
 1 file changed, 50 insertions(+), 11 deletions(-)

Nicolas Saenz Julienne June 5, 2020, 8:46 a.m. UTC | #1

Hi Florian,
Thanks for taking over this!

On Thu, 2020-06-04 at 14:28 -0700, Florian Fainelli wrote:
> The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
> SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.

I think this isn't 100% correct. SPI0 has its own interrupt, but SPI[3-6] share
the same interrupt.

> For the BCM2835 case which is deemed performance critical, we would like
> to continue using an interrupt handler which does not have the extra
> comparison on BCM2835_SPI_CS_INTR.
> 
> To support that requirement the common interrupt handling code between
> the shared and non-shared interrupt paths is split into a
> bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well
> as bcm2835_spi_shared_interrupt() make use of it.
> 
> During probe, we determine if there is at least another instance of this
> SPI controller, and if there is, then we install a shared interrupt
> handler.

As there was pushback to use a different compatible string for an otherwise
identical IP, I think it's a good compromise.

> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> Changes in v2:
> 
> - identify other available SPI nodes to determine if we need to set-up
>   interrupt sharing. This needs to happen for the very first instance
>   since we cannot know for the first instance whether interrupt sharing
>   is needed or not.
> 
>  drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++-------
>  1 file changed, 50 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
> index 237bd306c268..0288b5b3de1e 100644
> --- a/drivers/spi/spi-bcm2835.c
> +++ b/drivers/spi/spi-bcm2835.c
> @@ -361,11 +361,10 @@ static void bcm2835_spi_reset_hw(struct spi_controller
> *ctlr)
>  	bcm2835_wr(bs, BCM2835_SPI_DLEN, 0);
>  }
>  
> -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
> +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller
> *ctlr,
> +						       u32 cs)

Keep in mind the new 100 character limit.

>  {
> -	struct spi_controller *ctlr = dev_id;
>  	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> -	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
>  
>  	/*
>  	 * An interrupt is signaled either if DONE is set (TX FIFO empty)
> @@ -394,6 +393,27 @@ static irqreturn_t bcm2835_spi_interrupt(int irq, void
> *dev_id)
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
> +{
> +	struct spi_controller *ctlr = dev_id;
> +	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> +	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
> +
> +	return bcm2835_spi_interrupt_common(ctlr, cs);
> +}
> +
> +static irqreturn_t bcm2835_spi_shared_interrupt(int irq, void *dev_id)
> +{
> +	struct spi_controller *ctlr = dev_id;
> +	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> +	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
> +
> +	if (!(cs & BCM2835_SPI_CS_INTR))
> +		return IRQ_NONE;
> +
> +	return bcm2835_spi_interrupt_common(ctlr, cs);
> +}
> +
>  static int bcm2835_spi_transfer_one_irq(struct spi_controller *ctlr,
>  					struct spi_device *spi,
>  					struct spi_transfer *tfr,
> @@ -1287,12 +1307,37 @@ static int bcm2835_spi_setup(struct spi_device *spi)
>  	return 0;
>  }
>  
> +static const struct of_device_id bcm2835_spi_match[] = {
> +	{ .compatible = "brcm,bcm2835-spi", },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(of, bcm2835_spi_match);
> +
>  static int bcm2835_spi_probe(struct platform_device *pdev)
>  {
> +	irq_handler_t bcm2835_spi_isr_func = bcm2835_spi_interrupt;
>  	struct spi_controller *ctlr;
> +	unsigned long flags = 0;
> +	struct device_node *dn;
>  	struct bcm2835_spi *bs;
>  	int err;
>  
> +	/* On BCM2711 there can be multiple SPI controllers enabled sharing the
> +	 * same interrupt line, but we also want to minimize the overhead if
> +	 * there is no need to support interrupt sharing. If we find at least
> +	 * another available instane (not counting the one we are probed from),

"instance"

> +	 * then we assume that interrupt sharing is necessary.
> +	 */
> +	for_each_compatible_node(dn, NULL, bcm2835_spi_match[0].compatible) {
> +		err = of_device_is_available(dn) && dn != pdev->dev.of_node;

nit: maybe err is not the ideal variable name here.

> +		of_node_put(dn);
> +		if (err) {
> +			flags = IRQF_SHARED;
> +			bcm2835_spi_isr_func = bcm2835_spi_shared_interrupt;
> +			break;
> +		}
> +	}
> +
>  	ctlr = spi_alloc_master(&pdev->dev, ALIGN(sizeof(*bs),
>  						  dma_get_cache_alignment()));
>  	if (!ctlr)
> @@ -1344,8 +1389,8 @@ static int bcm2835_spi_probe(struct platform_device
> *pdev)
>  	bcm2835_wr(bs, BCM2835_SPI_CS,
>  		   BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
>  
> -	err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_interrupt, 0,
> -			       dev_name(&pdev->dev), ctlr);
> +	err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_isr_func,
> +			       flags, dev_name(&pdev->dev), ctlr);
>  	if (err) {
>  		dev_err(&pdev->dev, "could not request IRQ: %d\n", err);
>  		goto out_dma_release;
> @@ -1400,12 +1445,6 @@ static void bcm2835_spi_shutdown(struct platform_device
> *pdev)
>  		dev_err(&pdev->dev, "failed to shutdown\n");
>  }
>  
> -static const struct of_device_id bcm2835_spi_match[] = {
> -	{ .compatible = "brcm,bcm2835-spi", },
> -	{}
> -};
> -MODULE_DEVICE_TABLE(of, bcm2835_spi_match);
> -
>  static struct platform_driver bcm2835_spi_driver = {
>  	.driver		= {
>  		.name		= DRV_NAME,

Lukas Wunner June 5, 2020, 10:51 a.m. UTC | #2

On Thu, Jun 04, 2020 at 02:28:19PM -0700, Florian Fainelli wrote:
> The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
> SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.
> 
> For the BCM2835 case which is deemed performance critical, we would like
> to continue using an interrupt handler which does not have the extra
> comparison on BCM2835_SPI_CS_INTR.
> 
> To support that requirement the common interrupt handling code between
> the shared and non-shared interrupt paths is split into a
> bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well
> as bcm2835_spi_shared_interrupt() make use of it.
> 
> During probe, we determine if there is at least another instance of this
> SPI controller, and if there is, then we install a shared interrupt
> handler.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Reviewed-by: Lukas Wunner <lukas@wunner.de>

Mark Brown June 5, 2020, 10:52 a.m. UTC | #3

On Fri, Jun 05, 2020 at 10:46:57AM +0200, Nicolas Saenz Julienne wrote:

> > -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
> > +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller
> > *ctlr,
> > +						       u32 cs)

> Keep in mind the new 100 character limit.

That's more about stopping people doing awful contortions to shut
checkpatch up than saying that it's a particularly good idea to lengthen
lines.

Nicolas Saenz Julienne June 5, 2020, 10:58 a.m. UTC | #4

On Fri, 2020-06-05 at 10:46 +0200, Nicolas Saenz Julienne wrote:
> Hi Florian,
> Thanks for taking over this!
> 
> On Thu, 2020-06-04 at 14:28 -0700, Florian Fainelli wrote:
> > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
> > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.
> 
> I think this isn't 100% correct. SPI0 has its own interrupt, but SPI[3-6]
> share
> the same interrupt.

I'm wrong here, I missed this in bcm2711.dtsi:

&spi {
	interrupts = <GIC_SPI 118 IRQ_TYPE_LEVEL_HIGH>;
};

Sorry for the noise.

Regards,
Nicolas

Mark Brown June 5, 2020, 11:14 a.m. UTC | #5

On Thu, 4 Jun 2020 14:28:19 -0700, Florian Fainelli wrote:
> The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
> SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.
> 
> For the BCM2835 case which is deemed performance critical, we would like
> to continue using an interrupt handler which does not have the extra
> comparison on BCM2835_SPI_CS_INTR.
> 
> [...]

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: bcm2835: Enable shared interrupt support
      commit: ecfbd3cf3b8bb73ac6a80ddf430b5912fd4402a6

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

Robin Murphy June 5, 2020, 11:34 a.m. UTC | #6

On 2020-06-04 22:28, Florian Fainelli wrote:
> The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3,
> SPI4, SPI5 and SPI6) share the same interrupt line with SPI0.
> 
> For the BCM2835 case which is deemed performance critical, we would like
> to continue using an interrupt handler which does not have the extra
> comparison on BCM2835_SPI_CS_INTR.

FWIW, if I'm reading the patch correctly, then with sensible codegen 
that "overhead" should amount to a bit test on a live register plus a 
not-taken conditional branch - according to the 1176 TRM that should add 
up to a whopping 2 cycles. If that's really significant then I'd have to 
wonder whether you want to be at the mercy of the whole generic IRQ 
stack at all, and should perhaps consider using FIQ instead.

> To support that requirement the common interrupt handling code between
> the shared and non-shared interrupt paths is split into a
> bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well
> as bcm2835_spi_shared_interrupt() make use of it.
> 
> During probe, we determine if there is at least another instance of this
> SPI controller, and if there is, then we install a shared interrupt
> handler.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
> ---
> Changes in v2:
> 
> - identify other available SPI nodes to determine if we need to set-up
>    interrupt sharing. This needs to happen for the very first instance
>    since we cannot know for the first instance whether interrupt sharing
>    is needed or not.
> 
>   drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++-------
>   1 file changed, 50 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c
> index 237bd306c268..0288b5b3de1e 100644
> --- a/drivers/spi/spi-bcm2835.c
> +++ b/drivers/spi/spi-bcm2835.c
> @@ -361,11 +361,10 @@ static void bcm2835_spi_reset_hw(struct spi_controller *ctlr)
>   	bcm2835_wr(bs, BCM2835_SPI_DLEN, 0);
>   }
>   
> -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
> +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller *ctlr,
> +						       u32 cs)
>   {
> -	struct spi_controller *ctlr = dev_id;
>   	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> -	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
>   
>   	/*
>   	 * An interrupt is signaled either if DONE is set (TX FIFO empty)
> @@ -394,6 +393,27 @@ static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
>   	return IRQ_HANDLED;
>   }
>   
> +static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id)
> +{
> +	struct spi_controller *ctlr = dev_id;
> +	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> +	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
> +
> +	return bcm2835_spi_interrupt_common(ctlr, cs);
> +}
> +
> +static irqreturn_t bcm2835_spi_shared_interrupt(int irq, void *dev_id)
> +{
> +	struct spi_controller *ctlr = dev_id;
> +	struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr);
> +	u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS);
> +
> +	if (!(cs & BCM2835_SPI_CS_INTR))
> +		return IRQ_NONE;
> +
> +	return bcm2835_spi_interrupt_common(ctlr, cs);
> +}
> +
>   static int bcm2835_spi_transfer_one_irq(struct spi_controller *ctlr,
>   					struct spi_device *spi,
>   					struct spi_transfer *tfr,
> @@ -1287,12 +1307,37 @@ static int bcm2835_spi_setup(struct spi_device *spi)
>   	return 0;
>   }
>   
> +static const struct of_device_id bcm2835_spi_match[] = {
> +	{ .compatible = "brcm,bcm2835-spi", },
> +	{}
> +};
> +MODULE_DEVICE_TABLE(of, bcm2835_spi_match);
> +
>   static int bcm2835_spi_probe(struct platform_device *pdev)
>   {
> +	irq_handler_t bcm2835_spi_isr_func = bcm2835_spi_interrupt;
>   	struct spi_controller *ctlr;
> +	unsigned long flags = 0;
> +	struct device_node *dn;
>   	struct bcm2835_spi *bs;
>   	int err;
>   
> +	/* On BCM2711 there can be multiple SPI controllers enabled sharing the
> +	 * same interrupt line, but we also want to minimize the overhead if
> +	 * there is no need to support interrupt sharing. If we find at least
> +	 * another available instane (not counting the one we are probed from),
> +	 * then we assume that interrupt sharing is necessary.
> +	 */
> +	for_each_compatible_node(dn, NULL, bcm2835_spi_match[0].compatible) {
> +		err = of_device_is_available(dn) && dn != pdev->dev.of_node;
> +		of_node_put(dn);

This is in the wrong place - it should only be where you terminate the 
loop early and thus bypass the "of_node_put(from)" call in the iterator 
itself.

> +		if (err) {
> +			flags = IRQF_SHARED;

Is there really any harm to setting IRQF_SHARED even when the interrupt 
isn't shared in hardware? Sure, it means you lose a degree of API-level 
validation and some other driver might also be able to simultaneously 
claim it on bcm283x, but in that case the DT would be wrong and 
*something* isn't going to work correctly anyway, so does this one 
driver really need to care about trying to be the DT police?

Robin.

> +			bcm2835_spi_isr_func = bcm2835_spi_shared_interrupt;
> +			break;
> +		}
> +	}
> +
>   	ctlr = spi_alloc_master(&pdev->dev, ALIGN(sizeof(*bs),
>   						  dma_get_cache_alignment()));
>   	if (!ctlr)
> @@ -1344,8 +1389,8 @@ static int bcm2835_spi_probe(struct platform_device *pdev)
>   	bcm2835_wr(bs, BCM2835_SPI_CS,
>   		   BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX);
>   
> -	err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_interrupt, 0,
> -			       dev_name(&pdev->dev), ctlr);
> +	err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_isr_func,
> +			       flags, dev_name(&pdev->dev), ctlr);
>   	if (err) {
>   		dev_err(&pdev->dev, "could not request IRQ: %d\n", err);
>   		goto out_dma_release;
> @@ -1400,12 +1445,6 @@ static void bcm2835_spi_shutdown(struct platform_device *pdev)
>   		dev_err(&pdev->dev, "failed to shutdown\n");
>   }
>   
> -static const struct of_device_id bcm2835_spi_match[] = {
> -	{ .compatible = "brcm,bcm2835-spi", },
> -	{}
> -};
> -MODULE_DEVICE_TABLE(of, bcm2835_spi_match);
> -
>   static struct platform_driver bcm2835_spi_driver = {
>   	.driver		= {
>   		.name		= DRV_NAME,
>

Mark Brown June 5, 2020, 12:20 p.m. UTC | #7

On Fri, Jun 05, 2020 at 12:14:07PM +0100, Mark Brown wrote:

> [1/1] spi: bcm2835: Enable shared interrupt support
>       commit: ecfbd3cf3b8bb73ac6a80ddf430b5912fd4402a6

Eh, sorry - this was me fat fingering another fix.  At the very least
this needs to wait for the end of the merge window.

Mark Brown June 5, 2020, 1:20 p.m. UTC | #8

On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote:
> On 2020-06-04 22:28, Florian Fainelli wrote:

> > For the BCM2835 case which is deemed performance critical, we would like
> > to continue using an interrupt handler which does not have the extra
> > comparison on BCM2835_SPI_CS_INTR.

> FWIW, if I'm reading the patch correctly, then with sensible codegen that
> "overhead" should amount to a bit test on a live register plus a not-taken
> conditional branch - according to the 1176 TRM that should add up to a
> whopping 2 cycles. If that's really significant then I'd have to wonder
> whether you want to be at the mercy of the whole generic IRQ stack at all,
> and should perhaps consider using FIQ instead.

Yes, and indeed the compiler does seem to manage that.  It *is* non-zero
overhead though.

Robin Murphy June 5, 2020, 1:46 p.m. UTC | #9

On 2020-06-05 14:20, Mark Brown wrote:
> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote:
>> On 2020-06-04 22:28, Florian Fainelli wrote:
> 
>>> For the BCM2835 case which is deemed performance critical, we would like
>>> to continue using an interrupt handler which does not have the extra
>>> comparison on BCM2835_SPI_CS_INTR.
> 
>> FWIW, if I'm reading the patch correctly, then with sensible codegen that
>> "overhead" should amount to a bit test on a live register plus a not-taken
>> conditional branch - according to the 1176 TRM that should add up to a
>> whopping 2 cycles. If that's really significant then I'd have to wonder
>> whether you want to be at the mercy of the whole generic IRQ stack at all,
>> and should perhaps consider using FIQ instead.
> 
> Yes, and indeed the compiler does seem to manage that.  It *is* non-zero
> overhead though.

True, but so's the existing level of pointer-chasing indirection that 
with some straightforward refactoring could be taken right out of the 
critical path and confined to just the conditional complete() call. 
That's the kind of thing leaving me unconvinced that this is code where 
every single cycle counts ;)

Robin.

Robin Murphy June 5, 2020, 2:41 p.m. UTC | #10

On 2020-06-05 14:46, Robin Murphy wrote:
> On 2020-06-05 14:20, Mark Brown wrote:
>> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote:
>>> On 2020-06-04 22:28, Florian Fainelli wrote:
>>
>>>> For the BCM2835 case which is deemed performance critical, we would 
>>>> like
>>>> to continue using an interrupt handler which does not have the extra
>>>> comparison on BCM2835_SPI_CS_INTR.
>>
>>> FWIW, if I'm reading the patch correctly, then with sensible codegen 
>>> that
>>> "overhead" should amount to a bit test on a live register plus a 
>>> not-taken
>>> conditional branch - according to the 1176 TRM that should add up to a
>>> whopping 2 cycles. If that's really significant then I'd have to wonder
>>> whether you want to be at the mercy of the whole generic IRQ stack at 
>>> all,
>>> and should perhaps consider using FIQ instead.
>>
>> Yes, and indeed the compiler does seem to manage that.  It *is* non-zero
>> overhead though.
> 
> True, but so's the existing level of pointer-chasing indirection that 
> with some straightforward refactoring could be taken right out of the 
> critical path and confined to just the conditional complete() call. 
> That's the kind of thing leaving me unconvinced that this is code where 
> every single cycle counts ;)

Ha, and in fact having checked a build out of curiosity, this patch 
as-is actually stands to make things considerably worse. At least with 
GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't 
get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack 
frame and makes an out-of-line call to bcm2835_spi_interrupt_common(), 
resulting in massively *more* work than the extra two instructions of 
simply inlining the test.

So yes, the overhead of inlining the test vs. the alternative is indeed 
non-zero. It's just also negative :D

Robin.

Mark Brown June 5, 2020, 3:27 p.m. UTC | #11

On Fri, Jun 05, 2020 at 03:41:27PM +0100, Robin Murphy wrote:

> Ha, and in fact having checked a build out of curiosity, this patch as-is
> actually stands to make things considerably worse. At least with GCC 8.3 and
> bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't get inlined, which
> means bcm2835_spi_interrupt() pushes/pops a stack frame and makes an
> out-of-line call to bcm2835_spi_interrupt_common(), resulting in massively
> *more* work than the extra two instructions of simply inlining the test.

Whichever compiler I was using (clang-11 probably) did manage to inline
the tail call so it really was the two instructions but yeah, in general
this approach is going to be fragile.

> So yes, the overhead of inlining the test vs. the alternative is indeed
> non-zero. It's just also negative :D

And variable!

Florian Fainelli June 5, 2020, 10:04 p.m. UTC | #12

On 6/5/2020 7:41 AM, Robin Murphy wrote:
> On 2020-06-05 14:46, Robin Murphy wrote:
>> On 2020-06-05 14:20, Mark Brown wrote:
>>> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote:
>>>> On 2020-06-04 22:28, Florian Fainelli wrote:
>>>
>>>>> For the BCM2835 case which is deemed performance critical, we would
>>>>> like
>>>>> to continue using an interrupt handler which does not have the extra
>>>>> comparison on BCM2835_SPI_CS_INTR.
>>>
>>>> FWIW, if I'm reading the patch correctly, then with sensible codegen
>>>> that
>>>> "overhead" should amount to a bit test on a live register plus a
>>>> not-taken
>>>> conditional branch - according to the 1176 TRM that should add up to a
>>>> whopping 2 cycles. If that's really significant then I'd have to wonder
>>>> whether you want to be at the mercy of the whole generic IRQ stack
>>>> at all,
>>>> and should perhaps consider using FIQ instead.
>>>
>>> Yes, and indeed the compiler does seem to manage that.  It *is* non-zero
>>> overhead though.
>>
>> True, but so's the existing level of pointer-chasing indirection that
>> with some straightforward refactoring could be taken right out of the
>> critical path and confined to just the conditional complete() call.
>> That's the kind of thing leaving me unconvinced that this is code
>> where every single cycle counts ;)
> 
> Ha, and in fact having checked a build out of curiosity, this patch
> as-is actually stands to make things considerably worse. At least with
> GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't
> get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack
> frame and makes an out-of-line call to bcm2835_spi_interrupt_common(),
> resulting in massively *more* work than the extra two instructions of
> simply inlining the test.
> 
> So yes, the overhead of inlining the test vs. the alternative is indeed
> non-zero. It's just also negative :D

Is it reliable across compiler versions if we use __always_inline?

The only other alternative that I can think of is using a static key to
eliminate the test for the single controller case. This feels highly
over engineered, but if that proves more reliable and gets everybody
their cookie, why not.

Lukas, do you have any way to test with the conditional being present
that the performance or latency does not suffer so much that it becomes
unacceptable for your use cases?

Robin Murphy June 8, 2020, 11:11 a.m. UTC | #13

On 2020-06-05 23:04, Florian Fainelli wrote:
> On 6/5/2020 7:41 AM, Robin Murphy wrote:
>> On 2020-06-05 14:46, Robin Murphy wrote:
>>> On 2020-06-05 14:20, Mark Brown wrote:
>>>> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote:
>>>>> On 2020-06-04 22:28, Florian Fainelli wrote:
>>>>
>>>>>> For the BCM2835 case which is deemed performance critical, we would
>>>>>> like
>>>>>> to continue using an interrupt handler which does not have the extra
>>>>>> comparison on BCM2835_SPI_CS_INTR.
>>>>
>>>>> FWIW, if I'm reading the patch correctly, then with sensible codegen
>>>>> that
>>>>> "overhead" should amount to a bit test on a live register plus a
>>>>> not-taken
>>>>> conditional branch - according to the 1176 TRM that should add up to a
>>>>> whopping 2 cycles. If that's really significant then I'd have to wonder
>>>>> whether you want to be at the mercy of the whole generic IRQ stack
>>>>> at all,
>>>>> and should perhaps consider using FIQ instead.
>>>>
>>>> Yes, and indeed the compiler does seem to manage that.  It *is* non-zero
>>>> overhead though.
>>>
>>> True, but so's the existing level of pointer-chasing indirection that
>>> with some straightforward refactoring could be taken right out of the
>>> critical path and confined to just the conditional complete() call.
>>> That's the kind of thing leaving me unconvinced that this is code
>>> where every single cycle counts ;)
>>
>> Ha, and in fact having checked a build out of curiosity, this patch
>> as-is actually stands to make things considerably worse. At least with
>> GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't
>> get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack
>> frame and makes an out-of-line call to bcm2835_spi_interrupt_common(),
>> resulting in massively *more* work than the extra two instructions of
>> simply inlining the test.
>>
>> So yes, the overhead of inlining the test vs. the alternative is indeed
>> non-zero. It's just also negative :D
> 
> Is it reliable across compiler versions if we use __always_inline?
> 
> The only other alternative that I can think of is using a static key to
> eliminate the test for the single controller case. This feels highly
> over engineered, but if that proves more reliable and gets everybody
> their cookie, why not.

Again, 2 cycles. The overhead of a static key alone is at least 50% of 
that. And that's not even considering whether the change in code layout 
caused by doubling up the IRQ handler might affect I-cache or branch 
predictor behaviour, where a single miss stands to more than wipe out 
any perceived saving. And all in code that has at least one obvious 
inefficiency left on the table either way.

This thread truly epitomises Knuth's "premature optimisation" quote... ;)

Robin.

> 
> Lukas, do you have any way to test with the conditional being present
> that the performance or latency does not suffer so much that it becomes
> unacceptable for your use cases?
>

Mark Brown June 8, 2020, 11:28 a.m. UTC | #14

On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote:

> Again, 2 cycles. The overhead of a static key alone is at least 50% of that.
> And that's not even considering whether the change in code layout caused by
> doubling up the IRQ handler might affect I-cache or branch predictor
> behaviour, where a single miss stands to more than wipe out any perceived
> saving. And all in code that has at least one obvious inefficiency left on
> the table either way.

> This thread truly epitomises Knuth's "premature optimisation" quote... ;)

In fairness the main reason this driver is so heavily tuned already (and
has lead to some really nice improvements in the core) is that there are
a number of users hitting 100% CPU utilization driving SPI devices on
some of the older RPi hardware, IIRC around IIO type applications
mostly.  I do tend to agree that this particular optimization is a bit
marginal but there has been a lot of effort put into this.

Lukas Wunner June 8, 2020, 11:41 a.m. UTC | #15

On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote:
> And all in code that has at least one obvious inefficiency left on
> the table either way.

Care to submit a patch to overcome that inefficiency?

> This thread truly epitomises Knuth's "premature optimisation" quote... ;)

The thread came about because it can be determined at compile time
whether the interrupt is going to be shared:

On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this
SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt.

The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which
do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does.

The Raspberry Pi Foundation compiles BCM2711 kernels with CONFIG_ARM_LPAE=y,
but Florian considered that kludgy as a discriminator and opted for
runtime-detection via the compatible string instead.  If you've got
a better idea please come forward.

Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6),
else leave it in" the best we can do?

Thanks,

Lukas

Florian Fainelli June 15, 2020, 4:34 p.m. UTC | #16

On 6/8/2020 4:28 AM, Mark Brown wrote:
> On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote:
> 
>> Again, 2 cycles. The overhead of a static key alone is at least 50% of that.
>> And that's not even considering whether the change in code layout caused by
>> doubling up the IRQ handler might affect I-cache or branch predictor
>> behaviour, where a single miss stands to more than wipe out any perceived
>> saving. And all in code that has at least one obvious inefficiency left on
>> the table either way.
> 
>> This thread truly epitomises Knuth's "premature optimisation" quote... ;)
> 
> In fairness the main reason this driver is so heavily tuned already (and
> has lead to some really nice improvements in the core) is that there are
> a number of users hitting 100% CPU utilization driving SPI devices on
> some of the older RPi hardware, IIRC around IIO type applications
> mostly.  I do tend to agree that this particular optimization is a bit
> marginal but there has been a lot of effort put into this.

OK, so this has been dropped for spi/for-next right? How do we move from
there?

Mark Brown June 15, 2020, 5 p.m. UTC | #17

On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote:

> OK, so this has been dropped for spi/for-next right? How do we move from
> there?

Well, I actually have it queued up for applying so unless I pull it
before my scripts get that far through the stuff I queued over the merge
window it'll go in (I dropped it due to it not being a bugfix).  If it
were me I'd go with the two instruction hit from checking the flag TBH
but otherwise I guess __always_inline should work for compilers that
misoptimize.  None of this is getting in the way of the framework so if
everyone involved in the driver is happy to spend time optimising it
and dealing with the fragility then it's fine by me.

Florian Fainelli June 15, 2020, 5:04 p.m. UTC | #18

On 6/15/2020 10:00 AM, Mark Brown wrote:
> On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote:
> 
>> OK, so this has been dropped for spi/for-next right? How do we move from
>> there?
> 
> Well, I actually have it queued up for applying so unless I pull it
> before my scripts get that far through the stuff I queued over the merge
> window it'll go in (I dropped it due to it not being a bugfix).  If it
> were me I'd go with the two instruction hit from checking the flag TBH
> but otherwise I guess __always_inline should work for compilers that
> misoptimize.  None of this is getting in the way of the framework so if
> everyone involved in the driver is happy to spend time optimising it
> and dealing with the fragility then it's fine by me.

OK, how about I send you an increment patch (would a fixup be okay?)
that adds __always_inline since we know from this thread that some
compilers may mis-optimize the function inlining?

Mark Brown June 15, 2020, 5:30 p.m. UTC | #19

On Mon, Jun 15, 2020 at 10:04:46AM -0700, Florian Fainelli wrote:

> OK, how about I send you an increment patch (would a fixup be okay?)
> that adds __always_inline since we know from this thread that some
> compilers may mis-optimize the function inlining?

That's fine for me.

Robin Murphy June 15, 2020, 5:31 p.m. UTC | #20

On 2020-06-15 18:04, Florian Fainelli wrote:
> 
> 
> On 6/15/2020 10:00 AM, Mark Brown wrote:
>> On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote:
>>
>>> OK, so this has been dropped for spi/for-next right? How do we move from
>>> there?
>>
>> Well, I actually have it queued up for applying so unless I pull it
>> before my scripts get that far through the stuff I queued over the merge
>> window it'll go in (I dropped it due to it not being a bugfix).  If it
>> were me I'd go with the two instruction hit from checking the flag TBH
>> but otherwise I guess __always_inline should work for compilers that
>> misoptimize.  None of this is getting in the way of the framework so if
>> everyone involved in the driver is happy to spend time optimising it
>> and dealing with the fragility then it's fine by me.
> 
> OK, how about I send you an increment patch (would a fixup be okay?)
> that adds __always_inline since we know from this thread that some
> compilers may mis-optimize the function inlining?

Now that I've been inclined to go and look up the documentation, are we 
sure this so-very-contentious check is even correct? From my reading of 
things we're checking whether the RXR interrupt function is *enabled*, 
which still says nothing about whether either condition for the 
interrupt being *asserted* is true (RXR = 1 or DONE = 1). Thus if more 
than one SPI instance is active at once we could still end up trying to 
service an IRQ on a controller that didn't raise it.

Robin.

Robin Murphy June 15, 2020, 7:09 p.m. UTC | #21

On 2020-06-08 12:41, Lukas Wunner wrote:
> On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote:
>> And all in code that has at least one obvious inefficiency left on
>> the table either way.
> 
> Care to submit a patch to overcome that inefficiency?

I'll have a quick go, but without any way to measure performance impact 
(or even test for correctness) I don't fancy going too deep based purely 
on disassembly and ARM11 cycle timings.

>> This thread truly epitomises Knuth's "premature optimisation" quote... ;)
> 
> The thread came about because it can be determined at compile time
> whether the interrupt is going to be shared:

...which is exactly my point - "because it can be" is anything but proof 
that avoiding a trivial check makes enough measurable difference to 
justify putting in the effort to do so.

> On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this
> SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt.
> 
> The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which
> do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does.

Hmm, how much relative importance does that have? On a 700MHz ARM11 it's 
obviously desirable to spend as little time in the IRQ handler as 
possible in order to have time left to do anything else, but on the 
other SoCs even if the IRQ remains permanently asserted it can still 
only consume 25% of the available CPU capacity, at which point the 
impact of 2-3 cycles either way at 1GHz+ seems pretty much immeasurable.

> The Raspberry Pi Foundation compiles BCM2711 kernels with CONFIG_ARM_LPAE=y,
> but Florian considered that kludgy as a discriminator and opted for
> runtime-detection via the compatible string instead.  If you've got
> a better idea please come forward.
> 
> Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6),
> else leave it in" the best we can do?

In all honesty I'm starting to think it seriously might be :)

Robin.

Mark Brown June 15, 2020, 7:26 p.m. UTC | #22

On Mon, Jun 15, 2020 at 06:31:58PM +0100, Robin Murphy wrote:

> Now that I've been inclined to go and look up the documentation, are we sure
> this so-very-contentious check is even correct? From my reading of things
> we're checking whether the RXR interrupt function is *enabled*, which still
> says nothing about whether either condition for the interrupt being
> *asserted* is true (RXR = 1 or DONE = 1). Thus if more than one SPI instance
> is active at once we could still end up trying to service an IRQ on a
> controller that didn't raise it.

OK, I've pulled the patch from the queue for now :/

Florian Fainelli June 15, 2020, 7:42 p.m. UTC | #23

On 6/15/2020 12:09 PM, Robin Murphy wrote:
> On 2020-06-08 12:41, Lukas Wunner wrote:
>> On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote:
>>> And all in code that has at least one obvious inefficiency left on
>>> the table either way.
>>
>> Care to submit a patch to overcome that inefficiency?
> 
> I'll have a quick go, but without any way to measure performance impact
> (or even test for correctness) I don't fancy going too deep based purely
> on disassembly and ARM11 cycle timings.
> 
>>> This thread truly epitomises Knuth's "premature optimisation"
>>> quote... ;)
>>
>> The thread came about because it can be determined at compile time
>> whether the interrupt is going to be shared:
> 
> ...which is exactly my point - "because it can be" is anything but proof
> that avoiding a trivial check makes enough measurable difference to
> justify putting in the effort to do so.
> 
>> On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this
>> SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt.
>>
>> The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which
>> do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does.
> 
> Hmm, how much relative importance does that have? On a 700MHz ARM11 it's
> obviously desirable to spend as little time in the IRQ handler as
> possible in order to have time left to do anything else, but on the
> other SoCs even if the IRQ remains permanently asserted it can still
> only consume 25% of the available CPU capacity, at which point the
> impact of 2-3 cycles either way at 1GHz+ seems pretty much immeasurable.
> 
>> The Raspberry Pi Foundation compiles BCM2711 kernels with
>> CONFIG_ARM_LPAE=y,
>> but Florian considered that kludgy as a discriminator and opted for
>> runtime-detection via the compatible string instead.  If you've got
>> a better idea please come forward.
>>
>> Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6),
>> else leave it in" the best we can do?
> 
> In all honesty I'm starting to think it seriously might be :)

Or how about this: we slightly re-structure the interrupt handler such
that all possible interrupt conditions are explicitly handled and
terminate with a return IRQ_HANDLED, and those which are not, including
in the case of a "spurious" (because the interrupt was triggered for
another SPI controller instance), then we finish the function with
IRQ_NONE. This would not impact the performance for the BCM2835/36/37
which would still have a single controller/single interrupt line to handle.

Mark Brown June 15, 2020, 8:48 p.m. UTC | #24

On Mon, Jun 15, 2020 at 12:42:50PM -0700, Florian Fainelli wrote:

> Or how about this: we slightly re-structure the interrupt handler such
> that all possible interrupt conditions are explicitly handled and
> terminate with a return IRQ_HANDLED, and those which are not, including
> in the case of a "spurious" (because the interrupt was triggered for
> another SPI controller instance), then we finish the function with
> IRQ_NONE. This would not impact the performance for the BCM2835/36/37
> which would still have a single controller/single interrupt line to handle.

That seems sensible - it's generally a good way to structure interrupt
handlers.  It's almost there already.

[v2] spi: bcm2835: Enable shared interrupt support

Commit Message

Comments

Patch