mbox series

[RFC,v2,00/11] Optimize spi_sync path

Message ID 20220615124634.3302867-1-david@protonic.nl (mailing list archive)
Headers show
Series Optimize spi_sync path | expand

Message

David Jander June 15, 2022, 12:46 p.m. UTC
These patches optimize the spi_sync call for the common case that the
worker thread is idle and the queue is empty. It also opens the
possibility to potentially further optimize the async path also, since
it doesn't need to take into account the direct sync path anymore.

As an example for the performance gain, on an i.MX8MM SoC with a SPI CAN
controller attached (MCP2518FD), the time the interrupt line stays
active (which corresponds roughly with the time it takes to send 3
relatively short consecutive spi_sync messages) is reduced from 98us to
only 72us by this patch.

A note about message ordering:

This patch series should not change the behavior of message ordering when
coming from the same context. This means that if a client driver issues
one or more spi_async() messages immediately followed by a spi_sync()
message in the same context, it can still rely on these messages being
sent out in the order they were fired.

---
v2:
 - Avoid API change to spi_finalize_current_message
 - Fixed NULL-pointer dereference for drivers that rely on ctlr->cur_msg
 - Removed intentional printk() statement
 - Split out into small patches to document how code is morphed
 
David Jander (11):
  spi: Move ctlr->cur_msg_prepared to struct spi_message
  spi: Don't use the message queue if possible in spi_sync
  spi: Lock controller idling transition inside the io_mutex
  spi: __spi_pump_messages: Consolidate spin_unlocks to goto target
  spi: Remove check for controller idling in spi sync path
  spi: Remove check for idling in __spi_pump_messages()
  spi: Remove the now unused ctlr->idling flag
  spi: Remove unneeded READ_ONCE for ctlr->busy flag
  spi: Set ctlr->cur_msg also in the sync transfer case
  spi: Ensure the io_mutex is held until spi_finalize_current_message()
  spi: opportunistically skip ctlr->cur_msg_completion

 drivers/spi/spi.c       | 305 ++++++++++++++++++++++++----------------
 include/linux/spi/spi.h |  24 +++-
 2 files changed, 202 insertions(+), 127 deletions(-)

Comments

Marc Kleine-Budde June 15, 2022, 1:31 p.m. UTC | #1
On 15.06.2022 14:46:23, David Jander wrote:
> These patches optimize the spi_sync call for the common case that the
> worker thread is idle and the queue is empty. It also opens the
> possibility to potentially further optimize the async path also, since
> it doesn't need to take into account the direct sync path anymore.
> 
> As an example for the performance gain, on an i.MX8MM SoC with a SPI CAN
> controller attached (MCP2518FD), the time the interrupt line stays
> active (which corresponds roughly with the time it takes to send 3
> relatively short consecutive spi_sync messages) is reduced from 98us to
> only 72us by this patch.
> 
> A note about message ordering:
> 
> This patch series should not change the behavior of message ordering when
> coming from the same context. This means that if a client driver issues
> one or more spi_async() messages immediately followed by a spi_sync()
> message in the same context, it can still rely on these messages being
> sent out in the order they were fired.

Which git branch to use as the base?

Marc
David Jander June 15, 2022, 2:13 p.m. UTC | #2
On Wed, 15 Jun 2022 15:31:13 +0200
Marc Kleine-Budde <mkl@pengutronix.de> wrote:

> On 15.06.2022 14:46:23, David Jander wrote:
> > These patches optimize the spi_sync call for the common case that the
> > worker thread is idle and the queue is empty. It also opens the
> > possibility to potentially further optimize the async path also, since
> > it doesn't need to take into account the direct sync path anymore.
> > 
> > As an example for the performance gain, on an i.MX8MM SoC with a SPI CAN
> > controller attached (MCP2518FD), the time the interrupt line stays
> > active (which corresponds roughly with the time it takes to send 3
> > relatively short consecutive spi_sync messages) is reduced from 98us to
> > only 72us by this patch.
> > 
> > A note about message ordering:
> > 
> > This patch series should not change the behavior of message ordering when
> > coming from the same context. This means that if a client driver issues
> > one or more spi_async() messages immediately followed by a spi_sync()
> > message in the same context, it can still rely on these messages being
> > sent out in the order they were fired.  
> 
> Which git branch to use as the base?

Sorry, forgot to mention: spi for-next:
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git

This is because it relies on previous patches that have been applied there.
Mark Brown June 16, 2022, 1:22 p.m. UTC | #3
On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote:
> These patches optimize the spi_sync call for the common case that the
> worker thread is idle and the queue is empty. It also opens the
> possibility to potentially further optimize the async path also, since
> it doesn't need to take into account the direct sync path anymore.

I've given this a first pass and it looks sensible so far - I'll need to
give it a more thorough look but I'd expect it should be fine.  The
numbers certainly look good.
David Jander June 16, 2022, 2:13 p.m. UTC | #4
On Thu, 16 Jun 2022 14:22:13 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote:
> > These patches optimize the spi_sync call for the common case that the
> > worker thread is idle and the queue is empty. It also opens the
> > possibility to potentially further optimize the async path also, since
> > it doesn't need to take into account the direct sync path anymore.  
> 
> I've given this a first pass and it looks sensible so far - I'll need to
> give it a more thorough look but I'd expect it should be fine.  The
> numbers certainly look good.

Thanks!
The current patch set probably needs to get partly squashed, since there are a
few patches that undo changes from a previous patch. I left them like this in
order to hopefully make the step by step mutation more clear for review.

I had some doubts about patch 11, since it introduces 2 new members to struct
spi_controller. I was trying to keep the pollution down, but I couldn't find a
better way to do this optimization. Any suggestions? Maybe a better name/place
for these flags?

Ideally this would get as much different hardware testing as possible before
going further upstream. Do you have access to some platforms suitable for
stressing SPI with multiple clients simultaneously? Known "problematic"
controllers maybe?

Looking forward to your comments.

Best regards,
Mark Brown June 16, 2022, 2:55 p.m. UTC | #5
On Thu, Jun 16, 2022 at 04:13:23PM +0200, David Jander wrote:
> Mark Brown <broonie@kernel.org> wrote:
> > On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote:

> > I've given this a first pass and it looks sensible so far - I'll need to
> > give it a more thorough look but I'd expect it should be fine.  The
> > numbers certainly look good.

> The current patch set probably needs to get partly squashed, since there are a
> few patches that undo changes from a previous patch. I left them like this in
> order to hopefully make the step by step mutation more clear for review.

Yes, there's a bit of stuff.  I think it's based off your previous
proposed patch too?

> I had some doubts about patch 11, since it introduces 2 new members to struct
> spi_controller. I was trying to keep the pollution down, but I couldn't find a
> better way to do this optimization. Any suggestions? Maybe a better name/place
> for these flags?

Not really - I'm not too concerned about individual flags since we don't
have so many SPI controllers in a system, it's not like it's a per task
overhead or similar.

> Ideally this would get as much different hardware testing as possible before
> going further upstream. Do you have access to some platforms suitable for
> stressing SPI with multiple clients simultaneously? Known "problematic"
> controllers maybe?

Well, the fastest way to get it into a wide range of CI is for me to
apply it so people who test -next will start covering it...  I was going
to kick it into my test branch KernelCI once I've got it reviewed
properly which will get at least some boot testing on a bunch of
platforms.

For testing the main thing that'd be nice for testing would probably be
coverage of controllers that don't block in transfer_one_message() and
those that complete in interrupt context while blocking in there.
David Jander June 16, 2022, 3:30 p.m. UTC | #6
On Thu, 16 Jun 2022 15:55:29 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Thu, Jun 16, 2022 at 04:13:23PM +0200, David Jander wrote:
> > Mark Brown <broonie@kernel.org> wrote:  
> > > On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote:  
> 
> > > I've given this a first pass and it looks sensible so far - I'll need to
> > > give it a more thorough look but I'd expect it should be fine.  The
> > > numbers certainly look good.  
> 
> > The current patch set probably needs to get partly squashed, since there are a
> > few patches that undo changes from a previous patch. I left them like this in
> > order to hopefully make the step by step mutation more clear for review.  
> 
> Yes, there's a bit of stuff.  I think it's based off your previous
> proposed patch too?

Yes, in big part. I removed the API change, and all further optimizations and
improvements are done step by step on top, like your suggestion to introduce
the completion in __pump_messages and after that optimizing it further. Ideally
I should maybe have tried to split up patch 2 a bit more.

> > I had some doubts about patch 11, since it introduces 2 new members to struct
> > spi_controller. I was trying to keep the pollution down, but I couldn't find a
> > better way to do this optimization. Any suggestions? Maybe a better name/place
> > for these flags?  
> 
> Not really - I'm not too concerned about individual flags since we don't
> have so many SPI controllers in a system, it's not like it's a per task
> overhead or similar.

Ok, then we leave it as is. I was looking for a place that grouped "private"
or "internal" struct members, but couldn't fine one really. SPI drivers
looking at these wouldn't make sense I guess.

> > Ideally this would get as much different hardware testing as possible before
> > going further upstream. Do you have access to some platforms suitable for
> > stressing SPI with multiple clients simultaneously? Known "problematic"
> > controllers maybe?  
> 
> Well, the fastest way to get it into a wide range of CI is for me to
> apply it so people who test -next will start covering it...  I was going
> to kick it into my test branch KernelCI once I've got it reviewed
> properly which will get at least some boot testing on a bunch of
> platforms.

Ah, great. I will see if I can get it tested on some more other platforms from
our side.

> For testing the main thing that'd be nice for testing would probably be
> coverage of controllers that don't block in transfer_one_message() and
> those that complete in interrupt context while blocking in there.

Ah, yes, that would be ideal. spi-pl022.c and spi-axi-spi-engine.c do this
AFAIK.
Also, if someone could make some independent performance comparisons of
before/after this series and the per-cpu stats patch, that would be very
interesting. I don't like people having to trust me on my word about the
gains ;-)

Best regards,
David Jander June 17, 2022, 12:08 p.m. UTC | #7
On Thu, 16 Jun 2022 17:30:03 +0200
David Jander <david@protonic.nl> wrote:
> [...]
> > > Ideally this would get as much different hardware testing as possible before
> > > going further upstream. Do you have access to some platforms suitable for
> > > stressing SPI with multiple clients simultaneously? Known "problematic"
> > > controllers maybe?    
> > 
> > Well, the fastest way to get it into a wide range of CI is for me to
> > apply it so people who test -next will start covering it...  I was going
> > to kick it into my test branch KernelCI once I've got it reviewed
> > properly which will get at least some boot testing on a bunch of
> > platforms.  
> 
> Ah, great. I will see if I can get it tested on some more other platforms from
> our side.

Ok, I have managed to test kernel 5.19-rc2 with this patch set on an i.MX6DL
machine. This is an arm cortex-a9 dual core (32bit), but it does have the same
spi-imx controller that my 64-bit board has. It does have two SPI peripherals,
each on a separate bus unfortunately: A tsc2046e touchscreen controller using
the IIO ADC backend driver, and a SPI NOR flash chip.
Everything works fine. Hammering the NOR flash and the touchscreen
simultaneously works as expected. But I have some interesting performance
figures for several different kernel configurations.

First of all, the spi-imx driver seems to perform really, really bad in DMA
mode in all cases. Probably something wrong with SDMA on this board, or the
transfers are too small for DMA? The driver has two module parameters: use_dma
and polling_limit_us. Disabling DMA gets the driver to "normal" performance
levels, and setting polling_limit_us to a very big value forces the driver to
work in polling mode, i.e. to not use IRQs. Since the system isn't doing
anything apart from these tests, it is expected that forced polling mode will
likely give the highest throughput, but could be considered cheating.

Test procedure:

Frist create some continuous stress on the SPI bus:

$ (while true; do dd if=/dev/mtdblock2 of=/dev/null bs=1024; done) >/dev/null
2>&1 &

The histogram shows this:

$ for t in /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_*; do printf "%s:\t%d\n" $t `cat $t`; done
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_0-1:        228583
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_1024-2047:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_128-255:    1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16-31:      1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16384-32767:        0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2-3:        228579
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2048-4095:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_256-511:    0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32-63:      1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32768-65535:        0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4-7:        1
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4096-8191:  0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_512-1023:   228576
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_64-127:     0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_65536+:     0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8-15:       0
/sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8192-16383: 0

There are no spi_async() transfers involved. The spi-nor driver apparently
only issues sync calls in this scenario, which makes sense. Note that even
though I specified "bs=1024" to the dd command, the spi-nor driver seems to
want to do 512 byte transfers anyway. Making "bs" very small doesn't change
this either, so we cannot simulate many very small transfers this way
unfortunately. This means that performance impact is likely very limited in
this scenario since the overhead of a single transfer isn't that much compared
to the size of it.

The below data shows the CPU load distribution while running the above command
(this is the topmost line of the "top" tool I have on this board) and the
throughput of a single call to dd:

Tested kernel builds:

 1. Kernel 5.19-rc2 vanilla
 1.1. use_dma = true (default):

CPU:  0.4% usr 20.4% sys  0.0% nic  0.0% idle 77.5% io  0.0% irq  1.5% sirq
3604480 bytes (3.4MB) copied, 3.685375 seconds, 955.1KB/s

 1.2. use_dma = false:

CPU:  0.5% usr 55.9% sys  0.0% nic  0.0% idle 43.2% io  0.0% irq  0.2% sirq
3604480 bytes (3.4MB) copied, 2.017914 seconds, 1.7MB/s

 1.3. forced polling:

CPU:  0.3% usr 99.4% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.1% sirq
3604480 bytes (3.4MB) copied, 1.330517 seconds, 2.6MB/s

 2. Kernel 5.19-rc2 with only the per-cpu stat patch:
 2.1. use_dma = true (default):

CPU:  0.2% usr 20.1% sys  0.0% nic  0.0% idle 78.3% io  0.0% irq  1.3% sirq
3604480 bytes (3.4MB) copied, 3.682393 seconds, 955.9KB/s

 2.2. use_dma = false:

CPU:  0.5% usr 56.8% sys  0.0% nic  0.0% idle 42.3% io  0.0% irq  0.2% sirq
3604480 bytes (3.4MB) copied, 2.015509 seconds, 1.7MB/s

 2.3. forced polling:

CPU:  0.3% usr 99.4% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.1% sirq
3604480 bytes (3.4MB) copied, 1.319247 seconds, 2.6MB/s

 3. Kernel 5.19-rc2 with both per-cpu stats and optimized sync path patches:
 3.1. use_dma = true (default):

CPU:  0.0% usr 17.8% sys  0.0% nic  0.0% idle 81.0% io  0.0% irq  1.1% sirq
3604480 bytes (3.4MB) copied, 3.650646 seconds, 964.2KB/s

 3.2. use_dma = false:

CPU:  0.6% usr 43.8% sys  0.0% nic  0.0% idle 55.2% io  0.0% irq  0.3% sirq
3604480 bytes (3.4MB) copied, 1.971478 seconds, 1.7MB/s

 3.3. forced polling:

CPU:  0.3% usr 99.6% sys  0.0% nic  0.0% idle  0.0% io  0.0% irq  0.0% sirq
3604480 bytes (3.4MB) copied, 1.314353 seconds, 2.6MB/s

Of course I measured each value several times. Results were very consistent,
and I picked the median result for each of the cases above.
The differences are small, as is expected given the fact that data transfers
are rather big, but specially if you look at the "seconds" value returned by
"dd", there is an incremental improvement for each of the patches. The total
improvement seems to be around 1.5-2% less CPU overhead in total... which IMHO
is still remarkable given how insensitive this test is for transfer overhead.

Testing the impact on the touchscreen driver doesn't make much sense though,
since it runs on a 1MHz clock rate and is already optimized to only do very
large single SPI transfers, so the overhead will definitely be noise at best.

Will report more results if I can test on other platforms.

Best regards,
Mark Brown June 20, 2022, 6:15 p.m. UTC | #8
On Wed, Jun 15, 2022 at 04:13:56PM +0200, David Jander wrote:
> Marc Kleine-Budde <mkl@pengutronix.de> wrote:

> > Which git branch to use as the base?

> Sorry, forgot to mention: spi for-next:
> git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git

> This is because it relies on previous patches that have been applied there.

It looks like something changed underneath the series unfortunately -
I'm getting failures applying even the first patch.  Can you rebase
please?  It looks good to start trying to throw at CI, ideally we'd get
more on list review but most of the active work recently has been around
the MTD stuff which is all about trying to use hardware offload for
flash operations and therefore should be minimally affected by the
actual SPI data path.
David Jander June 21, 2022, 6:15 a.m. UTC | #9
On Mon, 20 Jun 2022 19:15:13 +0100
Mark Brown <broonie@kernel.org> wrote:

> On Wed, Jun 15, 2022 at 04:13:56PM +0200, David Jander wrote:
> > Marc Kleine-Budde <mkl@pengutronix.de> wrote:  
> 
> > > Which git branch to use as the base?  
> 
> > Sorry, forgot to mention: spi for-next:
> > git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git  
> 
> > This is because it relies on previous patches that have been applied there.  
> 
> It looks like something changed underneath the series unfortunately -
> I'm getting failures applying even the first patch.  Can you rebase
> please?  It looks good to start trying to throw at CI, ideally we'd get
> more on list review but most of the active work recently has been around
> the MTD stuff which is all about trying to use hardware offload for
> flash operations and therefore should be minimally affected by the
> actual SPI data path.

I just rebased v3 onto current spi-next branch. No other changes.

Best regards,