Message ID | cover.1564825752.git.lukas@wunner.de (mailing list archive) |
---|---|
Headers | show |
Series | Raspberry Pi SPI speedups | expand |
Den 03.08.2019 12.10, skrev Lukas Wunner: > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only > transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on > the SPI core to convert them to full-duplex transfers by allocating > and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. > > Resolve by pre-allocating reusable DMA descriptors which cyclically > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO > (for RX-only transfers). Patch [07/10] provides some numbers for > the achieved latency improvement and CPU time reduction with an > SPI Ethernet controller. SPI displays should see a similar speedup. > I've also made an effort to reduce peripheral and memory bus accesses. > > The series is meant to be applied on top of broonie/for-next. > It can be applied to Linus' current tree if commit > 8d8bef503658 ("spi: bcm2835: Fix 3-wire mode if DMA is enabled") > is cherry-picked from broonie's repo beforehand. > > Please review and test. Thank you. > Tested-by: Noralf Trønnes <noralf@tronnes.org> I've tested on a 320x240 RGB565 display, flipping 2 framebuffers as fast as possible. The buffers are 320x240x2=150kB which are maxsize split by spi-bcm2835. I see a small increase in speed from ~34.75 to ~35.55 fps using this series. Details: $ ~/libdrm/tests/modetest/modetest -M mi0283qt -s 31:320x240@RG16 -v setting mode 320x240-0Hz@RG16 on connectors 31, crtc 34 freq: 30.36Hz freq: 35.32Hz freq: 35.56Hz freq: 35.57Hz freq: 35.54Hz freq: 35.55Hz freq: 35.60Hz freq: 35.62Hz freq: 35.50Hz freq: 35.23Hz freq: 35.49Hz freq: 35.44Hz freq: 35.39Hz freq: 35.58Hz $ od --endian big -tu4 -An /sys/bus/spi/devices/spi0.0/of_node/spi-max-frequency 64000000 <skip those that are zero> pi@pi2835:~$ tail -n +1 /sys/bus/spi/devices/spi0.0/statistics/* ==> /sys/bus/spi/devices/spi0.0/statistics/bytes <== 131644720 ==> /sys/bus/spi/devices/spi0.0/statistics/bytes_rx <== 2 ==> /sys/bus/spi/devices/spi0.0/statistics/bytes_tx <== 131644718 ==> /sys/bus/spi/devices/spi0.0/statistics/messages <== 5188 ==> /sys/bus/spi/devices/spi0.0/statistics/spi_sync <== 5188 ==> /sys/bus/spi/devices/spi0.0/statistics/spi_sync_immediate <== 5188 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_0-1 <== 2609 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_16384-32767 <== 857 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_2-3 <== 5 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_32768-65535 <== 1714 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_4-7 <== 1717 ==> /sys/bus/spi/devices/spi0.0/statistics/transfer_bytes_histo_8-15 <== 2 ==> /sys/bus/spi/devices/spi0.0/statistics/transfers <== 6904 ==> /sys/bus/spi/devices/spi0.0/statistics/transfers_split_maxsize <== 857 Noralf.
Hi Lukas, Am 03.08.19 um 12:10 schrieb Lukas Wunner: > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only > transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on > the SPI core to convert them to full-duplex transfers by allocating > and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. > > Resolve by pre-allocating reusable DMA descriptors which cyclically > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO > (for RX-only transfers). Patch [07/10] provides some numbers for > the achieved latency improvement and CPU time reduction with an > SPI Ethernet controller. SPI displays should see a similar speedup. > I've also made an effort to reduce peripheral and memory bus accesses. i know the BCM2711 / Raspberry Pi 4 isn't upstreamed yet, but this series hasn't been tested with RPi 4? I only want to know.
On Sun, Aug 11, 2019 at 09:50:17PM +0200, Stefan Wahren wrote: > Am 03.08.19 um 12:10 schrieb Lukas Wunner: > > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only > > transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on > > the SPI core to convert them to full-duplex transfers by allocating > > and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. > > > > Resolve by pre-allocating reusable DMA descriptors which cyclically > > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO > > (for RX-only transfers). Patch [07/10] provides some numbers for > > the achieved latency improvement and CPU time reduction with an > > SPI Ethernet controller. SPI displays should see a similar speedup. > > I've also made an effort to reduce peripheral and memory bus accesses. > > i know the BCM2711 / Raspberry Pi 4 isn't upstreamed yet, but this > series hasn't been tested with RPi 4? No, I only have the CM1 and CM3 at my disposal for testing, the series seemed to work fine on both. Thanks, Lukas
Am 03.08.19 um 12:10 schrieb Lukas Wunner: > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only > transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on > the SPI core to convert them to full-duplex transfers by allocating > and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. > > Resolve by pre-allocating reusable DMA descriptors which cyclically > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO > (for RX-only transfers). Patch [07/10] provides some numbers for > the achieved latency improvement and CPU time reduction with an > SPI Ethernet controller. SPI displays should see a similar speedup. > I've also made an effort to reduce peripheral and memory bus accesses. > > The series is meant to be applied on top of broonie/for-next. > It can be applied to Linus' current tree if commit > 8d8bef503658 ("spi: bcm2835: Fix 3-wire mode if DMA is enabled") > is cherry-picked from broonie's repo beforehand. > > Please review and test. Thank you. > > Lukas Wunner (10): > dmaengine: bcm2835: Allow reusable descriptors > dmaengine: bcm2835: Allow cyclic transactions without interrupt > spi: Guarantee cacheline alignment of driver-private data > spi: bcm2835: Drop dma_pending flag > spi: bcm2835: Work around DONE bit erratum > spi: bcm2835: Cache CS register value for ->prepare_message() > spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO > dmaengine: bcm2835: Document struct bcm2835_dmadev > dmaengine: bcm2835: Avoid accessing memory when copying zeroes > spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO > Acked-by: Stefan Wahren <wahrenst@gmx.net> Sorry, for this late reply
> On 03.08.2019, at 12:10, Lukas Wunner <lukas@wunner.de> wrote: > > Lukas Wunner (10): > dmaengine: bcm2835: Allow reusable descriptors > dmaengine: bcm2835: Allow cyclic transactions without interrupt > spi: Guarantee cacheline alignment of driver-private data > spi: bcm2835: Drop dma_pending flag > spi: bcm2835: Work around DONE bit erratum > spi: bcm2835: Cache CS register value for ->prepare_message() > spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO > dmaengine: bcm2835: Document struct bcm2835_dmadev > dmaengine: bcm2835: Avoid accessing memory when copying zeroes > spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO > Acked-by: Martin Sperl <kernel@martin.sperl.org>
Dear Marc, to alleviate you of having to add all the Acked-by and Tested-by tags to this series, I've prepared a "prêt-à-porter" branch complete with all tags which you can cherry-pick or merge from. If you decide to instead apply the patches yourself, you can double-check the result by comparing it to my branch with "git range-diff". If you have any comments on the series or would like to have anything changed, please let me know. Thanks! ---------------------------------------------------------------- The following changes since commit c55be305915974db160ce6472722ff74f45b8d4e: spi: spi-fsl-dspi: Use poll mode in case the platform IRQ is missing (2019-08-23 12:01:44 +0100) are available in the git repository at: https://github.com/l1k/linux bcm2835_spi_simplex_v1 for you to fetch changes up to 37ad33d4bee27d9a24f1deffd675e327d1bb899e: spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO (2019-08-24 11:54:11 +0200) ---------------------------------------------------------------- So far the BCM2835 SPI driver cannot cope with TX-only and RX-only transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on the SPI core to convert them to full-duplex transfers by allocating and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. Resolve by pre-allocating reusable DMA descriptors which cyclically clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO (for RX-only transfers). Commit fecf4ba3f248 provides some numbers for the achieved latency improvement and CPU time reduction with an SPI Ethernet controller. SPI displays should see a similar speedup. I've also made an effort to reduce peripheral and memory bus accesses. ---------------------------------------------------------------- Lukas Wunner (10): dmaengine: bcm2835: Allow reusable descriptors dmaengine: bcm2835: Allow cyclic transactions without interrupt spi: Guarantee cacheline alignment of driver-private data spi: bcm2835: Drop dma_pending flag spi: bcm2835: Work around DONE bit erratum spi: bcm2835: Cache CS register value for ->prepare_message() spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO dmaengine: bcm2835: Document struct bcm2835_dmadev dmaengine: bcm2835: Avoid accessing memory when copying zeroes spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO drivers/dma/bcm2835-dma.c | 38 ++++- drivers/spi/spi-bcm2835.c | 408 ++++++++++++++++++++++++++++++++++++++-------- drivers/spi/spi.c | 18 +- 3 files changed, 390 insertions(+), 74 deletions(-)
Dear Mark, On Sat, Aug 03, 2019 at 12:10:00PM +0200, Lukas Wunner wrote: > So far the BCM2835 SPI driver cannot cope with TX-only and RX-only > transfers (rx_buf or tx_buf is NULL) when using DMA: It relies on > the SPI core to convert them to full-duplex transfers by allocating > and DMA-mapping a dummy rx_buf or tx_buf. This costs performance. > > Resolve by pre-allocating reusable DMA descriptors which cyclically > clear the RX FIFO (for TX-only transfers) or zero-fill the TX FIFO > (for RX-only transfers). Patch [07/10] provides some numbers for > the achieved latency improvement and CPU time reduction with an > SPI Ethernet controller. SPI displays should see a similar speedup. > I've also made an effort to reduce peripheral and memory bus accesses. Just a gentle ping, this patch set was posted to the list 5 weeks ago, has all necessary acks and has been tested successfully by 2 people besides myself. Do you have any thoughts on it? Any objections? In case the patches no longer apply cleanly I've prepared this branch based on your for-next branch of Aug 23 from which you can merge if you prefer that: https://github.com/l1k/linux/commits/bcm2835_spi_simplex_v1 However I can also repost if necessary. (PS: Apologies for misspelling your name as "Marc" in my e-mail of Aug 24.) Thanks, Lukas > Lukas Wunner (10): > dmaengine: bcm2835: Allow reusable descriptors > dmaengine: bcm2835: Allow cyclic transactions without interrupt > spi: Guarantee cacheline alignment of driver-private data > spi: bcm2835: Drop dma_pending flag > spi: bcm2835: Work around DONE bit erratum > spi: bcm2835: Cache CS register value for ->prepare_message() > spi: bcm2835: Speed up TX-only DMA transfers by clearing RX FIFO > dmaengine: bcm2835: Document struct bcm2835_dmadev > dmaengine: bcm2835: Avoid accessing memory when copying zeroes > spi: bcm2835: Speed up RX-only DMA transfers by zero-filling TX FIFO > > drivers/dma/bcm2835-dma.c | 38 +++- > drivers/spi/spi-bcm2835.c | 408 ++++++++++++++++++++++++++++++++------ > drivers/spi/spi.c | 18 +- > 3 files changed, 390 insertions(+), 74 deletions(-) > > -- > 2.20.1
On Sat, Sep 07, 2019 at 11:06:37AM +0200, Lukas Wunner wrote: > Just a gentle ping, this patch set was posted to the list 5 weeks ago, > has all necessary acks and has been tested successfully by 2 people > besides myself. Please don't send content free pings and please allow a reasonable time for review. People get busy, go on holiday, attend conferences and so on so unless there is some reason for urgency (like critical bug fixes) please allow at least a couple of weeks for review. If there have been review comments then people may be waiting for those to be addressed. Sending content free pings adds to the mail volume (if they are seen at all) which is often the problem and since they can't be reviewed directly if something has gone wrong you'll have to resend the patches anyway, so sending again is generally a better approach though there are some other maintainers who like them - if in doubt look at how patches for the subsystem are normally handled.
On Sat, Sep 07, 2019 at 11:06:37AM +0200, Lukas Wunner wrote:
> Do you have any thoughts on it? Any objections?
Having found this in the archives something went quite wrong with
the posting, the patches appear to be in a random non-sequential
order:
20466 N T 08/03 Lukas Wunner (1.7K) [PATCH 00/10] Raspberry Pi SPI speedup
20467 N T 08/03 Lukas Wunner (2.0K) ├─>[PATCH 02/10] dmaengine: bcm2835: A
20468 N C 08/08 Vinod Koul (2.2K) │ └─>
20469 N T 08/03 Lukas Wunner (1.1K) ├─>[PATCH 01/10] dmaengine: bcm2835: A
20470 N C 08/08 Vinod Koul (1.3K) │ └─>
20471 N T 08/03 Lukas Wunner (3.9K) ├─>[PATCH 04/10] spi: bcm2835: Drop dm
20472 N T 08/03 Lukas Wunner (3.5K) ├─>[PATCH 05/10] spi: bcm2835: Work ar
I have no recollection of what happened with this but I'd expect
it's something to do with that and the lengthy discussion of some
bits.
It also doesn't apply any more, please resend. In addition it
looks like patch 5 is a fix but it's in the middle of the series
for some reason, in general bug fixes should always be sent as
the first so that they don't depend on anything else.
To repeat what I said in the standard mail please don't send
pings, they're just noise - I can't apply a ping.