Message ID | 20200604212819.715-1-f.fainelli@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] spi: bcm2835: Enable shared interrupt support | expand |
Hi Florian, Thanks for taking over this! On Thu, 2020-06-04 at 14:28 -0700, Florian Fainelli wrote: > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. I think this isn't 100% correct. SPI0 has its own interrupt, but SPI[3-6] share the same interrupt. > For the BCM2835 case which is deemed performance critical, we would like > to continue using an interrupt handler which does not have the extra > comparison on BCM2835_SPI_CS_INTR. > > To support that requirement the common interrupt handling code between > the shared and non-shared interrupt paths is split into a > bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well > as bcm2835_spi_shared_interrupt() make use of it. > > During probe, we determine if there is at least another instance of this > SPI controller, and if there is, then we install a shared interrupt > handler. As there was pushback to use a different compatible string for an otherwise identical IP, I think it's a good compromise. > > Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> > --- > Changes in v2: > > - identify other available SPI nodes to determine if we need to set-up > interrupt sharing. This needs to happen for the very first instance > since we cannot know for the first instance whether interrupt sharing > is needed or not. > > drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++------- > 1 file changed, 50 insertions(+), 11 deletions(-) > > diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c > index 237bd306c268..0288b5b3de1e 100644 > --- a/drivers/spi/spi-bcm2835.c > +++ b/drivers/spi/spi-bcm2835.c > @@ -361,11 +361,10 @@ static void bcm2835_spi_reset_hw(struct spi_controller > *ctlr) > bcm2835_wr(bs, BCM2835_SPI_DLEN, 0); > } > > -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller > *ctlr, > + u32 cs) Keep in mind the new 100 character limit. > { > - struct spi_controller *ctlr = dev_id; > struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > - u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > > /* > * An interrupt is signaled either if DONE is set (TX FIFO empty) > @@ -394,6 +393,27 @@ static irqreturn_t bcm2835_spi_interrupt(int irq, void > *dev_id) > return IRQ_HANDLED; > } > > +static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > +{ > + struct spi_controller *ctlr = dev_id; > + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > + > + return bcm2835_spi_interrupt_common(ctlr, cs); > +} > + > +static irqreturn_t bcm2835_spi_shared_interrupt(int irq, void *dev_id) > +{ > + struct spi_controller *ctlr = dev_id; > + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > + > + if (!(cs & BCM2835_SPI_CS_INTR)) > + return IRQ_NONE; > + > + return bcm2835_spi_interrupt_common(ctlr, cs); > +} > + > static int bcm2835_spi_transfer_one_irq(struct spi_controller *ctlr, > struct spi_device *spi, > struct spi_transfer *tfr, > @@ -1287,12 +1307,37 @@ static int bcm2835_spi_setup(struct spi_device *spi) > return 0; > } > > +static const struct of_device_id bcm2835_spi_match[] = { > + { .compatible = "brcm,bcm2835-spi", }, > + {} > +}; > +MODULE_DEVICE_TABLE(of, bcm2835_spi_match); > + > static int bcm2835_spi_probe(struct platform_device *pdev) > { > + irq_handler_t bcm2835_spi_isr_func = bcm2835_spi_interrupt; > struct spi_controller *ctlr; > + unsigned long flags = 0; > + struct device_node *dn; > struct bcm2835_spi *bs; > int err; > > + /* On BCM2711 there can be multiple SPI controllers enabled sharing the > + * same interrupt line, but we also want to minimize the overhead if > + * there is no need to support interrupt sharing. If we find at least > + * another available instane (not counting the one we are probed from), "instance" > + * then we assume that interrupt sharing is necessary. > + */ > + for_each_compatible_node(dn, NULL, bcm2835_spi_match[0].compatible) { > + err = of_device_is_available(dn) && dn != pdev->dev.of_node; nit: maybe err is not the ideal variable name here. > + of_node_put(dn); > + if (err) { > + flags = IRQF_SHARED; > + bcm2835_spi_isr_func = bcm2835_spi_shared_interrupt; > + break; > + } > + } > + > ctlr = spi_alloc_master(&pdev->dev, ALIGN(sizeof(*bs), > dma_get_cache_alignment())); > if (!ctlr) > @@ -1344,8 +1389,8 @@ static int bcm2835_spi_probe(struct platform_device > *pdev) > bcm2835_wr(bs, BCM2835_SPI_CS, > BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX); > > - err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_interrupt, 0, > - dev_name(&pdev->dev), ctlr); > + err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_isr_func, > + flags, dev_name(&pdev->dev), ctlr); > if (err) { > dev_err(&pdev->dev, "could not request IRQ: %d\n", err); > goto out_dma_release; > @@ -1400,12 +1445,6 @@ static void bcm2835_spi_shutdown(struct platform_device > *pdev) > dev_err(&pdev->dev, "failed to shutdown\n"); > } > > -static const struct of_device_id bcm2835_spi_match[] = { > - { .compatible = "brcm,bcm2835-spi", }, > - {} > -}; > -MODULE_DEVICE_TABLE(of, bcm2835_spi_match); > - > static struct platform_driver bcm2835_spi_driver = { > .driver = { > .name = DRV_NAME,
On Thu, Jun 04, 2020 at 02:28:19PM -0700, Florian Fainelli wrote: > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. > > For the BCM2835 case which is deemed performance critical, we would like > to continue using an interrupt handler which does not have the extra > comparison on BCM2835_SPI_CS_INTR. > > To support that requirement the common interrupt handling code between > the shared and non-shared interrupt paths is split into a > bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well > as bcm2835_spi_shared_interrupt() make use of it. > > During probe, we determine if there is at least another instance of this > SPI controller, and if there is, then we install a shared interrupt > handler. > > Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Lukas Wunner <lukas@wunner.de>
On Fri, Jun 05, 2020 at 10:46:57AM +0200, Nicolas Saenz Julienne wrote: > > -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > > +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller > > *ctlr, > > + u32 cs) > Keep in mind the new 100 character limit. That's more about stopping people doing awful contortions to shut checkpatch up than saying that it's a particularly good idea to lengthen lines.
On Fri, 2020-06-05 at 10:46 +0200, Nicolas Saenz Julienne wrote: > Hi Florian, > Thanks for taking over this! > > On Thu, 2020-06-04 at 14:28 -0700, Florian Fainelli wrote: > > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, > > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. > > I think this isn't 100% correct. SPI0 has its own interrupt, but SPI[3-6] > share > the same interrupt. I'm wrong here, I missed this in bcm2711.dtsi: &spi { interrupts = <GIC_SPI 118 IRQ_TYPE_LEVEL_HIGH>; }; Sorry for the noise. Regards, Nicolas
On Thu, 4 Jun 2020 14:28:19 -0700, Florian Fainelli wrote: > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. > > For the BCM2835 case which is deemed performance critical, we would like > to continue using an interrupt handler which does not have the extra > comparison on BCM2835_SPI_CS_INTR. > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next Thanks! [1/1] spi: bcm2835: Enable shared interrupt support commit: ecfbd3cf3b8bb73ac6a80ddf430b5912fd4402a6 All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
On 2020-06-04 22:28, Florian Fainelli wrote: > The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, > SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. > > For the BCM2835 case which is deemed performance critical, we would like > to continue using an interrupt handler which does not have the extra > comparison on BCM2835_SPI_CS_INTR. FWIW, if I'm reading the patch correctly, then with sensible codegen that "overhead" should amount to a bit test on a live register plus a not-taken conditional branch - according to the 1176 TRM that should add up to a whopping 2 cycles. If that's really significant then I'd have to wonder whether you want to be at the mercy of the whole generic IRQ stack at all, and should perhaps consider using FIQ instead. > To support that requirement the common interrupt handling code between > the shared and non-shared interrupt paths is split into a > bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well > as bcm2835_spi_shared_interrupt() make use of it. > > During probe, we determine if there is at least another instance of this > SPI controller, and if there is, then we install a shared interrupt > handler. > > Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> > --- > Changes in v2: > > - identify other available SPI nodes to determine if we need to set-up > interrupt sharing. This needs to happen for the very first instance > since we cannot know for the first instance whether interrupt sharing > is needed or not. > > drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++------- > 1 file changed, 50 insertions(+), 11 deletions(-) > > diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c > index 237bd306c268..0288b5b3de1e 100644 > --- a/drivers/spi/spi-bcm2835.c > +++ b/drivers/spi/spi-bcm2835.c > @@ -361,11 +361,10 @@ static void bcm2835_spi_reset_hw(struct spi_controller *ctlr) > bcm2835_wr(bs, BCM2835_SPI_DLEN, 0); > } > > -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller *ctlr, > + u32 cs) > { > - struct spi_controller *ctlr = dev_id; > struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > - u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > > /* > * An interrupt is signaled either if DONE is set (TX FIFO empty) > @@ -394,6 +393,27 @@ static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > return IRQ_HANDLED; > } > > +static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) > +{ > + struct spi_controller *ctlr = dev_id; > + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > + > + return bcm2835_spi_interrupt_common(ctlr, cs); > +} > + > +static irqreturn_t bcm2835_spi_shared_interrupt(int irq, void *dev_id) > +{ > + struct spi_controller *ctlr = dev_id; > + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); > + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); > + > + if (!(cs & BCM2835_SPI_CS_INTR)) > + return IRQ_NONE; > + > + return bcm2835_spi_interrupt_common(ctlr, cs); > +} > + > static int bcm2835_spi_transfer_one_irq(struct spi_controller *ctlr, > struct spi_device *spi, > struct spi_transfer *tfr, > @@ -1287,12 +1307,37 @@ static int bcm2835_spi_setup(struct spi_device *spi) > return 0; > } > > +static const struct of_device_id bcm2835_spi_match[] = { > + { .compatible = "brcm,bcm2835-spi", }, > + {} > +}; > +MODULE_DEVICE_TABLE(of, bcm2835_spi_match); > + > static int bcm2835_spi_probe(struct platform_device *pdev) > { > + irq_handler_t bcm2835_spi_isr_func = bcm2835_spi_interrupt; > struct spi_controller *ctlr; > + unsigned long flags = 0; > + struct device_node *dn; > struct bcm2835_spi *bs; > int err; > > + /* On BCM2711 there can be multiple SPI controllers enabled sharing the > + * same interrupt line, but we also want to minimize the overhead if > + * there is no need to support interrupt sharing. If we find at least > + * another available instane (not counting the one we are probed from), > + * then we assume that interrupt sharing is necessary. > + */ > + for_each_compatible_node(dn, NULL, bcm2835_spi_match[0].compatible) { > + err = of_device_is_available(dn) && dn != pdev->dev.of_node; > + of_node_put(dn); This is in the wrong place - it should only be where you terminate the loop early and thus bypass the "of_node_put(from)" call in the iterator itself. > + if (err) { > + flags = IRQF_SHARED; Is there really any harm to setting IRQF_SHARED even when the interrupt isn't shared in hardware? Sure, it means you lose a degree of API-level validation and some other driver might also be able to simultaneously claim it on bcm283x, but in that case the DT would be wrong and *something* isn't going to work correctly anyway, so does this one driver really need to care about trying to be the DT police? Robin. > + bcm2835_spi_isr_func = bcm2835_spi_shared_interrupt; > + break; > + } > + } > + > ctlr = spi_alloc_master(&pdev->dev, ALIGN(sizeof(*bs), > dma_get_cache_alignment())); > if (!ctlr) > @@ -1344,8 +1389,8 @@ static int bcm2835_spi_probe(struct platform_device *pdev) > bcm2835_wr(bs, BCM2835_SPI_CS, > BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX); > > - err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_interrupt, 0, > - dev_name(&pdev->dev), ctlr); > + err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_isr_func, > + flags, dev_name(&pdev->dev), ctlr); > if (err) { > dev_err(&pdev->dev, "could not request IRQ: %d\n", err); > goto out_dma_release; > @@ -1400,12 +1445,6 @@ static void bcm2835_spi_shutdown(struct platform_device *pdev) > dev_err(&pdev->dev, "failed to shutdown\n"); > } > > -static const struct of_device_id bcm2835_spi_match[] = { > - { .compatible = "brcm,bcm2835-spi", }, > - {} > -}; > -MODULE_DEVICE_TABLE(of, bcm2835_spi_match); > - > static struct platform_driver bcm2835_spi_driver = { > .driver = { > .name = DRV_NAME, >
On Fri, Jun 05, 2020 at 12:14:07PM +0100, Mark Brown wrote: > [1/1] spi: bcm2835: Enable shared interrupt support > commit: ecfbd3cf3b8bb73ac6a80ddf430b5912fd4402a6 Eh, sorry - this was me fat fingering another fix. At the very least this needs to wait for the end of the merge window.
On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote: > On 2020-06-04 22:28, Florian Fainelli wrote: > > For the BCM2835 case which is deemed performance critical, we would like > > to continue using an interrupt handler which does not have the extra > > comparison on BCM2835_SPI_CS_INTR. > FWIW, if I'm reading the patch correctly, then with sensible codegen that > "overhead" should amount to a bit test on a live register plus a not-taken > conditional branch - according to the 1176 TRM that should add up to a > whopping 2 cycles. If that's really significant then I'd have to wonder > whether you want to be at the mercy of the whole generic IRQ stack at all, > and should perhaps consider using FIQ instead. Yes, and indeed the compiler does seem to manage that. It *is* non-zero overhead though.
On 2020-06-05 14:20, Mark Brown wrote: > On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote: >> On 2020-06-04 22:28, Florian Fainelli wrote: > >>> For the BCM2835 case which is deemed performance critical, we would like >>> to continue using an interrupt handler which does not have the extra >>> comparison on BCM2835_SPI_CS_INTR. > >> FWIW, if I'm reading the patch correctly, then with sensible codegen that >> "overhead" should amount to a bit test on a live register plus a not-taken >> conditional branch - according to the 1176 TRM that should add up to a >> whopping 2 cycles. If that's really significant then I'd have to wonder >> whether you want to be at the mercy of the whole generic IRQ stack at all, >> and should perhaps consider using FIQ instead. > > Yes, and indeed the compiler does seem to manage that. It *is* non-zero > overhead though. True, but so's the existing level of pointer-chasing indirection that with some straightforward refactoring could be taken right out of the critical path and confined to just the conditional complete() call. That's the kind of thing leaving me unconvinced that this is code where every single cycle counts ;) Robin.
On 2020-06-05 14:46, Robin Murphy wrote: > On 2020-06-05 14:20, Mark Brown wrote: >> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote: >>> On 2020-06-04 22:28, Florian Fainelli wrote: >> >>>> For the BCM2835 case which is deemed performance critical, we would >>>> like >>>> to continue using an interrupt handler which does not have the extra >>>> comparison on BCM2835_SPI_CS_INTR. >> >>> FWIW, if I'm reading the patch correctly, then with sensible codegen >>> that >>> "overhead" should amount to a bit test on a live register plus a >>> not-taken >>> conditional branch - according to the 1176 TRM that should add up to a >>> whopping 2 cycles. If that's really significant then I'd have to wonder >>> whether you want to be at the mercy of the whole generic IRQ stack at >>> all, >>> and should perhaps consider using FIQ instead. >> >> Yes, and indeed the compiler does seem to manage that. It *is* non-zero >> overhead though. > > True, but so's the existing level of pointer-chasing indirection that > with some straightforward refactoring could be taken right out of the > critical path and confined to just the conditional complete() call. > That's the kind of thing leaving me unconvinced that this is code where > every single cycle counts ;) Ha, and in fact having checked a build out of curiosity, this patch as-is actually stands to make things considerably worse. At least with GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack frame and makes an out-of-line call to bcm2835_spi_interrupt_common(), resulting in massively *more* work than the extra two instructions of simply inlining the test. So yes, the overhead of inlining the test vs. the alternative is indeed non-zero. It's just also negative :D Robin.
On Fri, Jun 05, 2020 at 03:41:27PM +0100, Robin Murphy wrote: > Ha, and in fact having checked a build out of curiosity, this patch as-is > actually stands to make things considerably worse. At least with GCC 8.3 and > bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't get inlined, which > means bcm2835_spi_interrupt() pushes/pops a stack frame and makes an > out-of-line call to bcm2835_spi_interrupt_common(), resulting in massively > *more* work than the extra two instructions of simply inlining the test. Whichever compiler I was using (clang-11 probably) did manage to inline the tail call so it really was the two instructions but yeah, in general this approach is going to be fragile. > So yes, the overhead of inlining the test vs. the alternative is indeed > non-zero. It's just also negative :D And variable!
On 6/5/2020 7:41 AM, Robin Murphy wrote: > On 2020-06-05 14:46, Robin Murphy wrote: >> On 2020-06-05 14:20, Mark Brown wrote: >>> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote: >>>> On 2020-06-04 22:28, Florian Fainelli wrote: >>> >>>>> For the BCM2835 case which is deemed performance critical, we would >>>>> like >>>>> to continue using an interrupt handler which does not have the extra >>>>> comparison on BCM2835_SPI_CS_INTR. >>> >>>> FWIW, if I'm reading the patch correctly, then with sensible codegen >>>> that >>>> "overhead" should amount to a bit test on a live register plus a >>>> not-taken >>>> conditional branch - according to the 1176 TRM that should add up to a >>>> whopping 2 cycles. If that's really significant then I'd have to wonder >>>> whether you want to be at the mercy of the whole generic IRQ stack >>>> at all, >>>> and should perhaps consider using FIQ instead. >>> >>> Yes, and indeed the compiler does seem to manage that. It *is* non-zero >>> overhead though. >> >> True, but so's the existing level of pointer-chasing indirection that >> with some straightforward refactoring could be taken right out of the >> critical path and confined to just the conditional complete() call. >> That's the kind of thing leaving me unconvinced that this is code >> where every single cycle counts ;) > > Ha, and in fact having checked a build out of curiosity, this patch > as-is actually stands to make things considerably worse. At least with > GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't > get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack > frame and makes an out-of-line call to bcm2835_spi_interrupt_common(), > resulting in massively *more* work than the extra two instructions of > simply inlining the test. > > So yes, the overhead of inlining the test vs. the alternative is indeed > non-zero. It's just also negative :D Is it reliable across compiler versions if we use __always_inline? The only other alternative that I can think of is using a static key to eliminate the test for the single controller case. This feels highly over engineered, but if that proves more reliable and gets everybody their cookie, why not. Lukas, do you have any way to test with the conditional being present that the performance or latency does not suffer so much that it becomes unacceptable for your use cases?
On 2020-06-05 23:04, Florian Fainelli wrote: > On 6/5/2020 7:41 AM, Robin Murphy wrote: >> On 2020-06-05 14:46, Robin Murphy wrote: >>> On 2020-06-05 14:20, Mark Brown wrote: >>>> On Fri, Jun 05, 2020 at 12:34:36PM +0100, Robin Murphy wrote: >>>>> On 2020-06-04 22:28, Florian Fainelli wrote: >>>> >>>>>> For the BCM2835 case which is deemed performance critical, we would >>>>>> like >>>>>> to continue using an interrupt handler which does not have the extra >>>>>> comparison on BCM2835_SPI_CS_INTR. >>>> >>>>> FWIW, if I'm reading the patch correctly, then with sensible codegen >>>>> that >>>>> "overhead" should amount to a bit test on a live register plus a >>>>> not-taken >>>>> conditional branch - according to the 1176 TRM that should add up to a >>>>> whopping 2 cycles. If that's really significant then I'd have to wonder >>>>> whether you want to be at the mercy of the whole generic IRQ stack >>>>> at all, >>>>> and should perhaps consider using FIQ instead. >>>> >>>> Yes, and indeed the compiler does seem to manage that. It *is* non-zero >>>> overhead though. >>> >>> True, but so's the existing level of pointer-chasing indirection that >>> with some straightforward refactoring could be taken right out of the >>> critical path and confined to just the conditional complete() call. >>> That's the kind of thing leaving me unconvinced that this is code >>> where every single cycle counts ;) >> >> Ha, and in fact having checked a build out of curiosity, this patch >> as-is actually stands to make things considerably worse. At least with >> GCC 8.3 and bcm2835_defconfig, bcm2835_spi_interrupt_common() doesn't >> get inlined, which means bcm2835_spi_interrupt() pushes/pops a stack >> frame and makes an out-of-line call to bcm2835_spi_interrupt_common(), >> resulting in massively *more* work than the extra two instructions of >> simply inlining the test. >> >> So yes, the overhead of inlining the test vs. the alternative is indeed >> non-zero. It's just also negative :D > > Is it reliable across compiler versions if we use __always_inline? > > The only other alternative that I can think of is using a static key to > eliminate the test for the single controller case. This feels highly > over engineered, but if that proves more reliable and gets everybody > their cookie, why not. Again, 2 cycles. The overhead of a static key alone is at least 50% of that. And that's not even considering whether the change in code layout caused by doubling up the IRQ handler might affect I-cache or branch predictor behaviour, where a single miss stands to more than wipe out any perceived saving. And all in code that has at least one obvious inefficiency left on the table either way. This thread truly epitomises Knuth's "premature optimisation" quote... ;) Robin. > > Lukas, do you have any way to test with the conditional being present > that the performance or latency does not suffer so much that it becomes > unacceptable for your use cases? >
On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote: > Again, 2 cycles. The overhead of a static key alone is at least 50% of that. > And that's not even considering whether the change in code layout caused by > doubling up the IRQ handler might affect I-cache or branch predictor > behaviour, where a single miss stands to more than wipe out any perceived > saving. And all in code that has at least one obvious inefficiency left on > the table either way. > This thread truly epitomises Knuth's "premature optimisation" quote... ;) In fairness the main reason this driver is so heavily tuned already (and has lead to some really nice improvements in the core) is that there are a number of users hitting 100% CPU utilization driving SPI devices on some of the older RPi hardware, IIRC around IIO type applications mostly. I do tend to agree that this particular optimization is a bit marginal but there has been a lot of effort put into this.
On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote: > And all in code that has at least one obvious inefficiency left on > the table either way. Care to submit a patch to overcome that inefficiency? > This thread truly epitomises Knuth's "premature optimisation" quote... ;) The thread came about because it can be determined at compile time whether the interrupt is going to be shared: On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt. The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does. The Raspberry Pi Foundation compiles BCM2711 kernels with CONFIG_ARM_LPAE=y, but Florian considered that kludgy as a discriminator and opted for runtime-detection via the compatible string instead. If you've got a better idea please come forward. Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6), else leave it in" the best we can do? Thanks, Lukas
On 6/8/2020 4:28 AM, Mark Brown wrote: > On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote: > >> Again, 2 cycles. The overhead of a static key alone is at least 50% of that. >> And that's not even considering whether the change in code layout caused by >> doubling up the IRQ handler might affect I-cache or branch predictor >> behaviour, where a single miss stands to more than wipe out any perceived >> saving. And all in code that has at least one obvious inefficiency left on >> the table either way. > >> This thread truly epitomises Knuth's "premature optimisation" quote... ;) > > In fairness the main reason this driver is so heavily tuned already (and > has lead to some really nice improvements in the core) is that there are > a number of users hitting 100% CPU utilization driving SPI devices on > some of the older RPi hardware, IIRC around IIO type applications > mostly. I do tend to agree that this particular optimization is a bit > marginal but there has been a lot of effort put into this. OK, so this has been dropped for spi/for-next right? How do we move from there?
On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote: > OK, so this has been dropped for spi/for-next right? How do we move from > there? Well, I actually have it queued up for applying so unless I pull it before my scripts get that far through the stuff I queued over the merge window it'll go in (I dropped it due to it not being a bugfix). If it were me I'd go with the two instruction hit from checking the flag TBH but otherwise I guess __always_inline should work for compilers that misoptimize. None of this is getting in the way of the framework so if everyone involved in the driver is happy to spend time optimising it and dealing with the fragility then it's fine by me.
On 6/15/2020 10:00 AM, Mark Brown wrote: > On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote: > >> OK, so this has been dropped for spi/for-next right? How do we move from >> there? > > Well, I actually have it queued up for applying so unless I pull it > before my scripts get that far through the stuff I queued over the merge > window it'll go in (I dropped it due to it not being a bugfix). If it > were me I'd go with the two instruction hit from checking the flag TBH > but otherwise I guess __always_inline should work for compilers that > misoptimize. None of this is getting in the way of the framework so if > everyone involved in the driver is happy to spend time optimising it > and dealing with the fragility then it's fine by me. OK, how about I send you an increment patch (would a fixup be okay?) that adds __always_inline since we know from this thread that some compilers may mis-optimize the function inlining?
On Mon, Jun 15, 2020 at 10:04:46AM -0700, Florian Fainelli wrote: > OK, how about I send you an increment patch (would a fixup be okay?) > that adds __always_inline since we know from this thread that some > compilers may mis-optimize the function inlining? That's fine for me.
On 2020-06-15 18:04, Florian Fainelli wrote: > > > On 6/15/2020 10:00 AM, Mark Brown wrote: >> On Mon, Jun 15, 2020 at 09:34:58AM -0700, Florian Fainelli wrote: >> >>> OK, so this has been dropped for spi/for-next right? How do we move from >>> there? >> >> Well, I actually have it queued up for applying so unless I pull it >> before my scripts get that far through the stuff I queued over the merge >> window it'll go in (I dropped it due to it not being a bugfix). If it >> were me I'd go with the two instruction hit from checking the flag TBH >> but otherwise I guess __always_inline should work for compilers that >> misoptimize. None of this is getting in the way of the framework so if >> everyone involved in the driver is happy to spend time optimising it >> and dealing with the fragility then it's fine by me. > > OK, how about I send you an increment patch (would a fixup be okay?) > that adds __always_inline since we know from this thread that some > compilers may mis-optimize the function inlining? Now that I've been inclined to go and look up the documentation, are we sure this so-very-contentious check is even correct? From my reading of things we're checking whether the RXR interrupt function is *enabled*, which still says nothing about whether either condition for the interrupt being *asserted* is true (RXR = 1 or DONE = 1). Thus if more than one SPI instance is active at once we could still end up trying to service an IRQ on a controller that didn't raise it. Robin.
On 2020-06-08 12:41, Lukas Wunner wrote: > On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote: >> And all in code that has at least one obvious inefficiency left on >> the table either way. > > Care to submit a patch to overcome that inefficiency? I'll have a quick go, but without any way to measure performance impact (or even test for correctness) I don't fancy going too deep based purely on disassembly and ARM11 cycle timings. >> This thread truly epitomises Knuth's "premature optimisation" quote... ;) > > The thread came about because it can be determined at compile time > whether the interrupt is going to be shared: ...which is exactly my point - "because it can be" is anything but proof that avoiding a trivial check makes enough measurable difference to justify putting in the effort to do so. > On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this > SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt. > > The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which > do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does. Hmm, how much relative importance does that have? On a 700MHz ARM11 it's obviously desirable to spend as little time in the IRQ handler as possible in order to have time left to do anything else, but on the other SoCs even if the IRQ remains permanently asserted it can still only consume 25% of the available CPU capacity, at which point the impact of 2-3 cycles either way at 1GHz+ seems pretty much immeasurable. > The Raspberry Pi Foundation compiles BCM2711 kernels with CONFIG_ARM_LPAE=y, > but Florian considered that kludgy as a discriminator and opted for > runtime-detection via the compatible string instead. If you've got > a better idea please come forward. > > Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6), > else leave it in" the best we can do? In all honesty I'm starting to think it seriously might be :) Robin.
On Mon, Jun 15, 2020 at 06:31:58PM +0100, Robin Murphy wrote: > Now that I've been inclined to go and look up the documentation, are we sure > this so-very-contentious check is even correct? From my reading of things > we're checking whether the RXR interrupt function is *enabled*, which still > says nothing about whether either condition for the interrupt being > *asserted* is true (RXR = 1 or DONE = 1). Thus if more than one SPI instance > is active at once we could still end up trying to service an IRQ on a > controller that didn't raise it. OK, I've pulled the patch from the queue for now :/
On 6/15/2020 12:09 PM, Robin Murphy wrote: > On 2020-06-08 12:41, Lukas Wunner wrote: >> On Mon, Jun 08, 2020 at 12:11:11PM +0100, Robin Murphy wrote: >>> And all in code that has at least one obvious inefficiency left on >>> the table either way. >> >> Care to submit a patch to overcome that inefficiency? > > I'll have a quick go, but without any way to measure performance impact > (or even test for correctness) I don't fancy going too deep based purely > on disassembly and ARM11 cycle timings. > >>> This thread truly epitomises Knuth's "premature optimisation" >>> quote... ;) >> >> The thread came about because it can be determined at compile time >> whether the interrupt is going to be shared: > > ...which is exactly my point - "because it can be" is anything but proof > that avoiding a trivial check makes enough measurable difference to > justify putting in the effort to do so. > >> On the BCM2835 (Raspberry Pi 1), CONFIG_ARCH_MULTI_V6 is set and this >> SoC doesn't have multiple bcm2835-spi instances, so no shared interrupt. >> >> The question is how to discern BCM2836/BCM2837 (Raspberry Pi 2/3), which >> do not have multiple instances, and BCM2711 (Raspberry Pi 4) which does. > > Hmm, how much relative importance does that have? On a 700MHz ARM11 it's > obviously desirable to spend as little time in the IRQ handler as > possible in order to have time left to do anything else, but on the > other SoCs even if the IRQ remains permanently asserted it can still > only consume 25% of the available CPU capacity, at which point the > impact of 2-3 cycles either way at 1GHz+ seems pretty much immeasurable. > >> The Raspberry Pi Foundation compiles BCM2711 kernels with >> CONFIG_ARM_LPAE=y, >> but Florian considered that kludgy as a discriminator and opted for >> runtime-detection via the compatible string instead. If you've got >> a better idea please come forward. >> >> Is "optimize shared IRQ support away if IS_ENABLED(CONFIG_ARCH_MULTI_V6), >> else leave it in" the best we can do? > > In all honesty I'm starting to think it seriously might be :) Or how about this: we slightly re-structure the interrupt handler such that all possible interrupt conditions are explicitly handled and terminate with a return IRQ_HANDLED, and those which are not, including in the case of a "spurious" (because the interrupt was triggered for another SPI controller instance), then we finish the function with IRQ_NONE. This would not impact the performance for the BCM2835/36/37 which would still have a single controller/single interrupt line to handle.
On Mon, Jun 15, 2020 at 12:42:50PM -0700, Florian Fainelli wrote: > Or how about this: we slightly re-structure the interrupt handler such > that all possible interrupt conditions are explicitly handled and > terminate with a return IRQ_HANDLED, and those which are not, including > in the case of a "spurious" (because the interrupt was triggered for > another SPI controller instance), then we finish the function with > IRQ_NONE. This would not impact the performance for the BCM2835/36/37 > which would still have a single controller/single interrupt line to handle. That seems sensible - it's generally a good way to structure interrupt handlers. It's almost there already.
diff --git a/drivers/spi/spi-bcm2835.c b/drivers/spi/spi-bcm2835.c index 237bd306c268..0288b5b3de1e 100644 --- a/drivers/spi/spi-bcm2835.c +++ b/drivers/spi/spi-bcm2835.c @@ -361,11 +361,10 @@ static void bcm2835_spi_reset_hw(struct spi_controller *ctlr) bcm2835_wr(bs, BCM2835_SPI_DLEN, 0); } -static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) +static inline irqreturn_t bcm2835_spi_interrupt_common(struct spi_controller *ctlr, + u32 cs) { - struct spi_controller *ctlr = dev_id; struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); - u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); /* * An interrupt is signaled either if DONE is set (TX FIFO empty) @@ -394,6 +393,27 @@ static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) return IRQ_HANDLED; } +static irqreturn_t bcm2835_spi_interrupt(int irq, void *dev_id) +{ + struct spi_controller *ctlr = dev_id; + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); + + return bcm2835_spi_interrupt_common(ctlr, cs); +} + +static irqreturn_t bcm2835_spi_shared_interrupt(int irq, void *dev_id) +{ + struct spi_controller *ctlr = dev_id; + struct bcm2835_spi *bs = spi_controller_get_devdata(ctlr); + u32 cs = bcm2835_rd(bs, BCM2835_SPI_CS); + + if (!(cs & BCM2835_SPI_CS_INTR)) + return IRQ_NONE; + + return bcm2835_spi_interrupt_common(ctlr, cs); +} + static int bcm2835_spi_transfer_one_irq(struct spi_controller *ctlr, struct spi_device *spi, struct spi_transfer *tfr, @@ -1287,12 +1307,37 @@ static int bcm2835_spi_setup(struct spi_device *spi) return 0; } +static const struct of_device_id bcm2835_spi_match[] = { + { .compatible = "brcm,bcm2835-spi", }, + {} +}; +MODULE_DEVICE_TABLE(of, bcm2835_spi_match); + static int bcm2835_spi_probe(struct platform_device *pdev) { + irq_handler_t bcm2835_spi_isr_func = bcm2835_spi_interrupt; struct spi_controller *ctlr; + unsigned long flags = 0; + struct device_node *dn; struct bcm2835_spi *bs; int err; + /* On BCM2711 there can be multiple SPI controllers enabled sharing the + * same interrupt line, but we also want to minimize the overhead if + * there is no need to support interrupt sharing. If we find at least + * another available instane (not counting the one we are probed from), + * then we assume that interrupt sharing is necessary. + */ + for_each_compatible_node(dn, NULL, bcm2835_spi_match[0].compatible) { + err = of_device_is_available(dn) && dn != pdev->dev.of_node; + of_node_put(dn); + if (err) { + flags = IRQF_SHARED; + bcm2835_spi_isr_func = bcm2835_spi_shared_interrupt; + break; + } + } + ctlr = spi_alloc_master(&pdev->dev, ALIGN(sizeof(*bs), dma_get_cache_alignment())); if (!ctlr) @@ -1344,8 +1389,8 @@ static int bcm2835_spi_probe(struct platform_device *pdev) bcm2835_wr(bs, BCM2835_SPI_CS, BCM2835_SPI_CS_CLEAR_RX | BCM2835_SPI_CS_CLEAR_TX); - err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_interrupt, 0, - dev_name(&pdev->dev), ctlr); + err = devm_request_irq(&pdev->dev, bs->irq, bcm2835_spi_isr_func, + flags, dev_name(&pdev->dev), ctlr); if (err) { dev_err(&pdev->dev, "could not request IRQ: %d\n", err); goto out_dma_release; @@ -1400,12 +1445,6 @@ static void bcm2835_spi_shutdown(struct platform_device *pdev) dev_err(&pdev->dev, "failed to shutdown\n"); } -static const struct of_device_id bcm2835_spi_match[] = { - { .compatible = "brcm,bcm2835-spi", }, - {} -}; -MODULE_DEVICE_TABLE(of, bcm2835_spi_match); - static struct platform_driver bcm2835_spi_driver = { .driver = { .name = DRV_NAME,
The 4 SPI controller instances added in BCM2711 and BCM7211 SoCs (SPI3, SPI4, SPI5 and SPI6) share the same interrupt line with SPI0. For the BCM2835 case which is deemed performance critical, we would like to continue using an interrupt handler which does not have the extra comparison on BCM2835_SPI_CS_INTR. To support that requirement the common interrupt handling code between the shared and non-shared interrupt paths is split into a bcm2835_spi_interrupt_common() and both bcm2835_spi_interrupt() as well as bcm2835_spi_shared_interrupt() make use of it. During probe, we determine if there is at least another instance of this SPI controller, and if there is, then we install a shared interrupt handler. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> --- Changes in v2: - identify other available SPI nodes to determine if we need to set-up interrupt sharing. This needs to happen for the very first instance since we cannot know for the first instance whether interrupt sharing is needed or not. drivers/spi/spi-bcm2835.c | 61 ++++++++++++++++++++++++++++++++------- 1 file changed, 50 insertions(+), 11 deletions(-)