Message ID | 20220615124634.3302867-1-david@protonic.nl (mailing list archive) |
---|---|
Headers | show |
Series | Optimize spi_sync path | expand |
On 15.06.2022 14:46:23, David Jander wrote: > These patches optimize the spi_sync call for the common case that the > worker thread is idle and the queue is empty. It also opens the > possibility to potentially further optimize the async path also, since > it doesn't need to take into account the direct sync path anymore. > > As an example for the performance gain, on an i.MX8MM SoC with a SPI CAN > controller attached (MCP2518FD), the time the interrupt line stays > active (which corresponds roughly with the time it takes to send 3 > relatively short consecutive spi_sync messages) is reduced from 98us to > only 72us by this patch. > > A note about message ordering: > > This patch series should not change the behavior of message ordering when > coming from the same context. This means that if a client driver issues > one or more spi_async() messages immediately followed by a spi_sync() > message in the same context, it can still rely on these messages being > sent out in the order they were fired. Which git branch to use as the base? Marc
On Wed, 15 Jun 2022 15:31:13 +0200 Marc Kleine-Budde <mkl@pengutronix.de> wrote: > On 15.06.2022 14:46:23, David Jander wrote: > > These patches optimize the spi_sync call for the common case that the > > worker thread is idle and the queue is empty. It also opens the > > possibility to potentially further optimize the async path also, since > > it doesn't need to take into account the direct sync path anymore. > > > > As an example for the performance gain, on an i.MX8MM SoC with a SPI CAN > > controller attached (MCP2518FD), the time the interrupt line stays > > active (which corresponds roughly with the time it takes to send 3 > > relatively short consecutive spi_sync messages) is reduced from 98us to > > only 72us by this patch. > > > > A note about message ordering: > > > > This patch series should not change the behavior of message ordering when > > coming from the same context. This means that if a client driver issues > > one or more spi_async() messages immediately followed by a spi_sync() > > message in the same context, it can still rely on these messages being > > sent out in the order they were fired. > > Which git branch to use as the base? Sorry, forgot to mention: spi for-next: git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git This is because it relies on previous patches that have been applied there.
On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote: > These patches optimize the spi_sync call for the common case that the > worker thread is idle and the queue is empty. It also opens the > possibility to potentially further optimize the async path also, since > it doesn't need to take into account the direct sync path anymore. I've given this a first pass and it looks sensible so far - I'll need to give it a more thorough look but I'd expect it should be fine. The numbers certainly look good.
On Thu, 16 Jun 2022 14:22:13 +0100 Mark Brown <broonie@kernel.org> wrote: > On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote: > > These patches optimize the spi_sync call for the common case that the > > worker thread is idle and the queue is empty. It also opens the > > possibility to potentially further optimize the async path also, since > > it doesn't need to take into account the direct sync path anymore. > > I've given this a first pass and it looks sensible so far - I'll need to > give it a more thorough look but I'd expect it should be fine. The > numbers certainly look good. Thanks! The current patch set probably needs to get partly squashed, since there are a few patches that undo changes from a previous patch. I left them like this in order to hopefully make the step by step mutation more clear for review. I had some doubts about patch 11, since it introduces 2 new members to struct spi_controller. I was trying to keep the pollution down, but I couldn't find a better way to do this optimization. Any suggestions? Maybe a better name/place for these flags? Ideally this would get as much different hardware testing as possible before going further upstream. Do you have access to some platforms suitable for stressing SPI with multiple clients simultaneously? Known "problematic" controllers maybe? Looking forward to your comments. Best regards,
On Thu, Jun 16, 2022 at 04:13:23PM +0200, David Jander wrote: > Mark Brown <broonie@kernel.org> wrote: > > On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote: > > I've given this a first pass and it looks sensible so far - I'll need to > > give it a more thorough look but I'd expect it should be fine. The > > numbers certainly look good. > The current patch set probably needs to get partly squashed, since there are a > few patches that undo changes from a previous patch. I left them like this in > order to hopefully make the step by step mutation more clear for review. Yes, there's a bit of stuff. I think it's based off your previous proposed patch too? > I had some doubts about patch 11, since it introduces 2 new members to struct > spi_controller. I was trying to keep the pollution down, but I couldn't find a > better way to do this optimization. Any suggestions? Maybe a better name/place > for these flags? Not really - I'm not too concerned about individual flags since we don't have so many SPI controllers in a system, it's not like it's a per task overhead or similar. > Ideally this would get as much different hardware testing as possible before > going further upstream. Do you have access to some platforms suitable for > stressing SPI with multiple clients simultaneously? Known "problematic" > controllers maybe? Well, the fastest way to get it into a wide range of CI is for me to apply it so people who test -next will start covering it... I was going to kick it into my test branch KernelCI once I've got it reviewed properly which will get at least some boot testing on a bunch of platforms. For testing the main thing that'd be nice for testing would probably be coverage of controllers that don't block in transfer_one_message() and those that complete in interrupt context while blocking in there.
On Thu, 16 Jun 2022 15:55:29 +0100 Mark Brown <broonie@kernel.org> wrote: > On Thu, Jun 16, 2022 at 04:13:23PM +0200, David Jander wrote: > > Mark Brown <broonie@kernel.org> wrote: > > > On Wed, Jun 15, 2022 at 02:46:23PM +0200, David Jander wrote: > > > > I've given this a first pass and it looks sensible so far - I'll need to > > > give it a more thorough look but I'd expect it should be fine. The > > > numbers certainly look good. > > > The current patch set probably needs to get partly squashed, since there are a > > few patches that undo changes from a previous patch. I left them like this in > > order to hopefully make the step by step mutation more clear for review. > > Yes, there's a bit of stuff. I think it's based off your previous > proposed patch too? Yes, in big part. I removed the API change, and all further optimizations and improvements are done step by step on top, like your suggestion to introduce the completion in __pump_messages and after that optimizing it further. Ideally I should maybe have tried to split up patch 2 a bit more. > > I had some doubts about patch 11, since it introduces 2 new members to struct > > spi_controller. I was trying to keep the pollution down, but I couldn't find a > > better way to do this optimization. Any suggestions? Maybe a better name/place > > for these flags? > > Not really - I'm not too concerned about individual flags since we don't > have so many SPI controllers in a system, it's not like it's a per task > overhead or similar. Ok, then we leave it as is. I was looking for a place that grouped "private" or "internal" struct members, but couldn't fine one really. SPI drivers looking at these wouldn't make sense I guess. > > Ideally this would get as much different hardware testing as possible before > > going further upstream. Do you have access to some platforms suitable for > > stressing SPI with multiple clients simultaneously? Known "problematic" > > controllers maybe? > > Well, the fastest way to get it into a wide range of CI is for me to > apply it so people who test -next will start covering it... I was going > to kick it into my test branch KernelCI once I've got it reviewed > properly which will get at least some boot testing on a bunch of > platforms. Ah, great. I will see if I can get it tested on some more other platforms from our side. > For testing the main thing that'd be nice for testing would probably be > coverage of controllers that don't block in transfer_one_message() and > those that complete in interrupt context while blocking in there. Ah, yes, that would be ideal. spi-pl022.c and spi-axi-spi-engine.c do this AFAIK. Also, if someone could make some independent performance comparisons of before/after this series and the per-cpu stats patch, that would be very interesting. I don't like people having to trust me on my word about the gains ;-) Best regards,
On Thu, 16 Jun 2022 17:30:03 +0200 David Jander <david@protonic.nl> wrote: > [...] > > > Ideally this would get as much different hardware testing as possible before > > > going further upstream. Do you have access to some platforms suitable for > > > stressing SPI with multiple clients simultaneously? Known "problematic" > > > controllers maybe? > > > > Well, the fastest way to get it into a wide range of CI is for me to > > apply it so people who test -next will start covering it... I was going > > to kick it into my test branch KernelCI once I've got it reviewed > > properly which will get at least some boot testing on a bunch of > > platforms. > > Ah, great. I will see if I can get it tested on some more other platforms from > our side. Ok, I have managed to test kernel 5.19-rc2 with this patch set on an i.MX6DL machine. This is an arm cortex-a9 dual core (32bit), but it does have the same spi-imx controller that my 64-bit board has. It does have two SPI peripherals, each on a separate bus unfortunately: A tsc2046e touchscreen controller using the IIO ADC backend driver, and a SPI NOR flash chip. Everything works fine. Hammering the NOR flash and the touchscreen simultaneously works as expected. But I have some interesting performance figures for several different kernel configurations. First of all, the spi-imx driver seems to perform really, really bad in DMA mode in all cases. Probably something wrong with SDMA on this board, or the transfers are too small for DMA? The driver has two module parameters: use_dma and polling_limit_us. Disabling DMA gets the driver to "normal" performance levels, and setting polling_limit_us to a very big value forces the driver to work in polling mode, i.e. to not use IRQs. Since the system isn't doing anything apart from these tests, it is expected that forced polling mode will likely give the highest throughput, but could be considered cheating. Test procedure: Frist create some continuous stress on the SPI bus: $ (while true; do dd if=/dev/mtdblock2 of=/dev/null bs=1024; done) >/dev/null 2>&1 & The histogram shows this: $ for t in /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_*; do printf "%s:\t%d\n" $t `cat $t`; done /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_0-1: 228583 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_1024-2047: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_128-255: 1 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16-31: 1 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16384-32767: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2-3: 228579 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2048-4095: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_256-511: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32-63: 1 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32768-65535: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4-7: 1 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4096-8191: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_512-1023: 228576 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_64-127: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_65536+: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8-15: 0 /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8192-16383: 0 There are no spi_async() transfers involved. The spi-nor driver apparently only issues sync calls in this scenario, which makes sense. Note that even though I specified "bs=1024" to the dd command, the spi-nor driver seems to want to do 512 byte transfers anyway. Making "bs" very small doesn't change this either, so we cannot simulate many very small transfers this way unfortunately. This means that performance impact is likely very limited in this scenario since the overhead of a single transfer isn't that much compared to the size of it. The below data shows the CPU load distribution while running the above command (this is the topmost line of the "top" tool I have on this board) and the throughput of a single call to dd: Tested kernel builds: 1. Kernel 5.19-rc2 vanilla 1.1. use_dma = true (default): CPU: 0.4% usr 20.4% sys 0.0% nic 0.0% idle 77.5% io 0.0% irq 1.5% sirq 3604480 bytes (3.4MB) copied, 3.685375 seconds, 955.1KB/s 1.2. use_dma = false: CPU: 0.5% usr 55.9% sys 0.0% nic 0.0% idle 43.2% io 0.0% irq 0.2% sirq 3604480 bytes (3.4MB) copied, 2.017914 seconds, 1.7MB/s 1.3. forced polling: CPU: 0.3% usr 99.4% sys 0.0% nic 0.0% idle 0.0% io 0.0% irq 0.1% sirq 3604480 bytes (3.4MB) copied, 1.330517 seconds, 2.6MB/s 2. Kernel 5.19-rc2 with only the per-cpu stat patch: 2.1. use_dma = true (default): CPU: 0.2% usr 20.1% sys 0.0% nic 0.0% idle 78.3% io 0.0% irq 1.3% sirq 3604480 bytes (3.4MB) copied, 3.682393 seconds, 955.9KB/s 2.2. use_dma = false: CPU: 0.5% usr 56.8% sys 0.0% nic 0.0% idle 42.3% io 0.0% irq 0.2% sirq 3604480 bytes (3.4MB) copied, 2.015509 seconds, 1.7MB/s 2.3. forced polling: CPU: 0.3% usr 99.4% sys 0.0% nic 0.0% idle 0.0% io 0.0% irq 0.1% sirq 3604480 bytes (3.4MB) copied, 1.319247 seconds, 2.6MB/s 3. Kernel 5.19-rc2 with both per-cpu stats and optimized sync path patches: 3.1. use_dma = true (default): CPU: 0.0% usr 17.8% sys 0.0% nic 0.0% idle 81.0% io 0.0% irq 1.1% sirq 3604480 bytes (3.4MB) copied, 3.650646 seconds, 964.2KB/s 3.2. use_dma = false: CPU: 0.6% usr 43.8% sys 0.0% nic 0.0% idle 55.2% io 0.0% irq 0.3% sirq 3604480 bytes (3.4MB) copied, 1.971478 seconds, 1.7MB/s 3.3. forced polling: CPU: 0.3% usr 99.6% sys 0.0% nic 0.0% idle 0.0% io 0.0% irq 0.0% sirq 3604480 bytes (3.4MB) copied, 1.314353 seconds, 2.6MB/s Of course I measured each value several times. Results were very consistent, and I picked the median result for each of the cases above. The differences are small, as is expected given the fact that data transfers are rather big, but specially if you look at the "seconds" value returned by "dd", there is an incremental improvement for each of the patches. The total improvement seems to be around 1.5-2% less CPU overhead in total... which IMHO is still remarkable given how insensitive this test is for transfer overhead. Testing the impact on the touchscreen driver doesn't make much sense though, since it runs on a 1MHz clock rate and is already optimized to only do very large single SPI transfers, so the overhead will definitely be noise at best. Will report more results if I can test on other platforms. Best regards,
On Wed, Jun 15, 2022 at 04:13:56PM +0200, David Jander wrote: > Marc Kleine-Budde <mkl@pengutronix.de> wrote: > > Which git branch to use as the base? > Sorry, forgot to mention: spi for-next: > git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git > This is because it relies on previous patches that have been applied there. It looks like something changed underneath the series unfortunately - I'm getting failures applying even the first patch. Can you rebase please? It looks good to start trying to throw at CI, ideally we'd get more on list review but most of the active work recently has been around the MTD stuff which is all about trying to use hardware offload for flash operations and therefore should be minimally affected by the actual SPI data path.
On Mon, 20 Jun 2022 19:15:13 +0100 Mark Brown <broonie@kernel.org> wrote: > On Wed, Jun 15, 2022 at 04:13:56PM +0200, David Jander wrote: > > Marc Kleine-Budde <mkl@pengutronix.de> wrote: > > > > Which git branch to use as the base? > > > Sorry, forgot to mention: spi for-next: > > git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git > > > This is because it relies on previous patches that have been applied there. > > It looks like something changed underneath the series unfortunately - > I'm getting failures applying even the first patch. Can you rebase > please? It looks good to start trying to throw at CI, ideally we'd get > more on list review but most of the active work recently has been around > the MTD stuff which is all about trying to use hardware offload for > flash operations and therefore should be minimally affected by the > actual SPI data path. I just rebased v3 onto current spi-next branch. No other changes. Best regards,