Message ID | 20240219-mainline-spi-precook-message-v2-0-4a762c6701b9@baylibre.com (mailing list archive) |
---|---|
Headers | show |
Series | spi: add support for pre-cooking messages | expand |
On Mon, 19 Feb 2024 16:33:17 -0600, David Lechner wrote: > This is a follow-up to [1] where it was suggested to break down the > proposed SPI offload support into smaller series. > > This takes on the first suggested task of introducing an API to > "pre-cook" SPI messages. This idea was first discussed extensively in > 2013 [2][3] and revisited more briefly 2022 [4]. > > [...] Applied to https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next Thanks! [1/5] spi: add spi_optimize_message() APIs commit: 7b1d87af14d9ae902ed0c5dc5fabf4eea5abdf02 [2/5] spi: move splitting transfers to spi_optimize_message() commit: fab53fea21a909e4e0656764a8ee7c356fe89d6f [3/5] spi: stm32: move splitting transfers to optimize_message commit: c2bcfe7c6edf418d5adf731a7d60a8abd81e952f [4/5] spi: axi-spi-engine: move message compile to optimize_message commit: 7dba2adb063bcf7a293eacb88980e0975b1fb1fd All being well this means that it will be integrated into the linux-next tree (usually sometime in the next 24 hours) and sent to Linus during the next merge window (or sooner if it is a bug fix), however if problems are discovered then the patch may be dropped or reverted. You may get further e-mails resulting from automated or manual testing and review of the tree, please engage with people reporting problems and send followup patches addressing any issues that are reported if needed. If any updates are required or you are submitting further changes they should be sent as incremental updates against current git, existing patches will not be replaced. Please add any relevant lists and maintainers to the CCs when replying to this mail. Thanks, Mark
Hi David, On Mon, Feb 19, 2024 at 04:33:17PM -0600, David Lechner wrote: > This is a follow-up to [1] where it was suggested to break down the > proposed SPI offload support into smaller series. > > This takes on the first suggested task of introducing an API to > "pre-cook" SPI messages. This idea was first discussed extensively in > 2013 [2][3] and revisited more briefly 2022 [4]. > > The goal here is to be able to improve performance (higher throughput, > and reduced CPU usage) by allowing peripheral drivers that use the > same struct spi_message repeatedly to "pre-cook" the message once to > avoid repeating the same validation, and possibly other operations each > time the message is sent. > > This series includes __spi_validate() and the automatic splitting of > xfers in the optimizations. Another frequently suggested optimization > is doing DMA mapping only once. This is not included in this series, but > can be added later (preferably by someone with a real use case for it). > > To show how this all works and get some real-world measurements, this > series includes the core changes, optimization of a SPI controller > driver, and optimization of an ADC driver. This test case was only able > to take advantage of the single validation optimization, since it didn't > require splitting transfers. With these changes, CPU usage of the > threaded interrupt handler, which calls spi_sync(), was reduced from > 83% to 73% while at the same time the sample rate (frequency of SPI > xfers) was increased from 20kHz to 25kHz. > > [1]: https://lore.kernel.org/linux-spi/20240109-axi-spi-engine-series-3-v1-1-e42c6a986580@baylibre.com/T/ > [2]: https://lore.kernel.org/linux-spi/E81F4810-48DD-41EE-B110-D0D848B8A510@martin.sperl.org/T/ > [3]: https://lore.kernel.org/linux-spi/39DEC004-10A1-47EF-9D77-276188D2580C@martin.sperl.org/T/ > [4]: https://lore.kernel.org/linux-spi/20220525163946.48ea40c9@erd992/T/ I have a regression after this patch set. My system explodes with following trace: Unable to handle kernel paging request at virtual address 0064f69bfd319200 Mem abort info: ESR = 0x0000000086000004 EC = 0x21: IABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault [0064f69bfd319200] address between user and kernel address ranges Internal error: Oops: 0000000086000004 [#1] PREEMPT SMP Modules linked in: lan78xx etnaviv crct10dif_ce gpu_sched fsl_imx8_ddr_perf caam ina2xx ina3221 rtc_snvs error imx8mm_thermal imx_cpufreq_dt fuse CPU: 2 PID: 103 Comm: spi2 Tainted: G W 6.8.0-rc3-00118-g82b98fb8cd33 #36 Hardware name: Philips i.MX8MPlus Delta Evaluation Kit (DT) pstate: 20000005 (nzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : 0x64f69bfd319200 lr : spi_res_release+0x68/0xb8 sp : ffff800083a6bcb0 x29: ffff800083a6bcb0 x28: ffff800081906008 x27: ffff8000809feea8 x26: 0000000000000001 x25: dead000000000122 x24: dead000000000100 x23: ffff800083c7ba38 x22: ffff0000c3ca5000 x21: ffff800083c7b9e0 x20: ffff800083c7ba40 x19: ffff800083c7bb48 x18: 0020000000000000 x17: 0000000000901400 x16: 00000000b0891400 x15: 049d200e412d4100 x14: 00000090fff6c1f4 x13: 000000280b412d41 x12: 000ed3ddde0a5202 x11: 6e0194029343039e x10: 00000000000014c0 x9 : ffff8000809fd5dc x8 : ffff800081906008 x7 : ffff0000c3c9c41c x6 : ffff8000800f34b0 x5 : 0000000000000000 x4 : ffff800083a6bab8 x3 : 8f64f69bfd319200 x2 : ffff800083c7bb60 x1 : ffff800083c7b9e0 x0 : ffff0000c3ca5000 Call trace: 0x64f69bfd319200 spi_async+0xac/0x110 9H0[]6 spi_mux_transfer_one_message+0xb8/0xf0 __spi_pump_transfer_message+0x26c/0x650 __spi_pump_messages+0xe0/0x228 spi_pump_messages+0x20/0x38 kthread_worker_fn+0xe0/0x2a8 kthread+0x10c/0x120 ret_from_fork+0x10/0x20 Code: ???????? ???????? ???????? ???????? (????????) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: disabled CPU features: 0x0,00000008,00020000,2100421b Memory Limit: none Rebooting in 10 seconds.. It seems to be spi_mux specific. We have seen similar trace on other system with spi_mux. Regards, Oleksij
On 28.06.2024 11:49:38, Oleksij Rempel wrote: > It seems to be spi_mux specific. We have seen similar trace on other system > with spi_mux. Here is the other backtrace from another imx8mp system with a completely different workload. Both have in common that they use a spi_mux on the spi-imx driver. Unable to handle kernel NULL pointer dereference at virtual address 0000000000000dd0 Mem abort info: ESR = 0x0000000096000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 CM = 0, WnR = 0, TnD = 0, TagAccess = 0 GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046760000 [0000000000000dd0] pgd=0000000000000000, p4d=0000000000000000 Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP Modules linked in: can_raw can ti_ads7950 industrialio_triggered_buffer kfifo_buf spi_mux fsl_imx8_ddr_perf at24 flexcan caam can_dev error rtc_snvs imx8mm_thermal spi_imx capture_events_irq cfg80211 iio_trig_hrtimer industrialio_sw_trigger ind> CPU: 3 PID: 177 Comm: spi5 Not tainted 6.9.0 #1 Hardware name: xxxxxxxxxxxxxxxx (xxxxxxxxx) (DT) pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : spi_res_release+0x24/0xb8 lr : spi_async+0xac/0x118 sp : ffff8000823fbcc0 x29: ffff8000823fbcc0 x28: 0000000000000000 x27: 0000000000000000 x26: ffff8000807bef88 x25: ffff80008115c008 x24: 0000000000000000 x23: ffff8000826c3938 x22: 0000000000000000 x21: ffff0000076a9800 x20: 0000000000000000 x19: 0000000000000dc8 x18: 0000000000000000 x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff88c0e760 x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 x11: ffff8000815a1f98 x10: ffff8000823fbb40 x9 : ffff8000807b8420 x8 : ffff800081508000 x7 : 0000000000000004 x6 : 0000000003ce4c66 x5 : 0000000001000000 x4 : 0000000000000000 x3 : 0000000001000000 x2 : 0000000000000000 x1 : ffff8000826c38e0 x0 : ffff0000076a9800 Call trace: spi_res_release+0x24/0xb8 spi_async+0xac/0x118 spi_mux_transfer_one_message+0xb8/0xf0 [spi_mux] __spi_pump_transfer_message+0x260/0x5d8 __spi_pump_messages+0xdc/0x320 spi_pump_messages+0x20/0x38 kthread_worker_fn+0xdc/0x220 kthread+0x118/0x128 ret_from_fork+0x10/0x20 Code: a90153f3 a90363f7 91016037 f9403033 (f9400674) ---[ end trace 0000000000000000 ]--- regards, Marc
On Fri, Jun 28, 2024 at 11:49:38AM +0200, Oleksij Rempel wrote: > On Mon, Feb 19, 2024 at 04:33:17PM -0600, David Lechner wrote: > I have a regression after this patch set. My system explodes with > following trace: Not looked at this at all but were you able to try a bisect? > Call trace: > 0x64f69bfd319200 > spi_async+0xac/0x110 > 9H0[]6 spi_mux_transfer_one_message+0xb8/0xf0 Could you run this through addr2line please?
On 6/28/24 5:16 AM, Marc Kleine-Budde wrote: > On 28.06.2024 11:49:38, Oleksij Rempel wrote: >> It seems to be spi_mux specific. We have seen similar trace on other system >> with spi_mux. > > Here is the other backtrace from another imx8mp system with a completely > different workload. Both have in common that they use a spi_mux on the > spi-imx driver. > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000dd0 > Mem abort info: > ESR = 0x0000000096000004 > EC = 0x25: DABT (current EL), IL = 32 bits > SET = 0, FnV = 0 > EA = 0, S1PTW = 0 > FSC = 0x04: level 0 translation fault > Data abort info: > ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046760000 > [0000000000000dd0] pgd=0000000000000000, p4d=0000000000000000 > Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > Modules linked in: can_raw can ti_ads7950 industrialio_triggered_buffer kfifo_buf spi_mux fsl_imx8_ddr_perf at24 flexcan caam can_dev error rtc_snvs imx8mm_thermal spi_imx capture_events_irq cfg80211 iio_trig_hrtimer industrialio_sw_trigger ind> > CPU: 3 PID: 177 Comm: spi5 Not tainted 6.9.0 #1 > Hardware name: xxxxxxxxxxxxxxxx (xxxxxxxxx) (DT) > pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > pc : spi_res_release+0x24/0xb8 > lr : spi_async+0xac/0x118 > sp : ffff8000823fbcc0 > x29: ffff8000823fbcc0 x28: 0000000000000000 x27: 0000000000000000 > x26: ffff8000807bef88 x25: ffff80008115c008 x24: 0000000000000000 > x23: ffff8000826c3938 x22: 0000000000000000 x21: ffff0000076a9800 > x20: 0000000000000000 x19: 0000000000000dc8 x18: 0000000000000000 > x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff88c0e760 > x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 > x11: ffff8000815a1f98 x10: ffff8000823fbb40 x9 : ffff8000807b8420 > x8 : ffff800081508000 x7 : 0000000000000004 x6 : 0000000003ce4c66 > x5 : 0000000001000000 x4 : 0000000000000000 x3 : 0000000001000000 > x2 : 0000000000000000 x1 : ffff8000826c38e0 x0 : ffff0000076a9800 > Call trace: > spi_res_release+0x24/0xb8 > spi_async+0xac/0x118 > spi_mux_transfer_one_message+0xb8/0xf0 [spi_mux] > __spi_pump_transfer_message+0x260/0x5d8 > __spi_pump_messages+0xdc/0x320 > spi_pump_messages+0x20/0x38 > kthread_worker_fn+0xdc/0x220 > kthread+0x118/0x128 > ret_from_fork+0x10/0x20 > Code: a90153f3 a90363f7 91016037 f9403033 (f9400674) > ---[ end trace 0000000000000000 ]--- > > regards, > Marc > Hi Oleksij and Marc, I'm supposed to be on vacation so I didn't look into this deeply yet but I can see what is happening here. spi_mux_transfer_one_message() is calling spi_async() which is calling __spi_optimize_message() on an already optimized message. Then it also calls __spi_unoptimize_message() which tries to release resources. But this fails because the spi-mux driver has swapped out the pointer to the device in the SPI message. This causes the wrong ctlr to be passed to spi_res_release(), causing the crash. I don't know if a proper fix could be quite so simple, but here is something you could try (untested): --- diff --git a/drivers/spi/spi-mux.c b/drivers/spi/spi-mux.c index 5d72e3d59df8..ec837e28183d 100644 --- a/drivers/spi/spi-mux.c +++ b/drivers/spi/spi-mux.c @@ -42,6 +42,7 @@ struct spi_mux_priv { void (*child_msg_complete)(void *context); void *child_msg_context; struct spi_device *child_msg_dev; + bool child_msg_pre_optimized; struct mux_control *mux; }; @@ -94,6 +95,7 @@ static void spi_mux_complete_cb(void *context) m->complete = priv->child_msg_complete; m->context = priv->child_msg_context; m->spi = priv->child_msg_dev; + m->pre_optimized = priv->child_msg_pre_optimized; spi_finalize_current_message(ctlr); mux_control_deselect(priv->mux); } @@ -116,10 +118,12 @@ static int spi_mux_transfer_one_message(struct spi_controller *ctlr, priv->child_msg_complete = m->complete; priv->child_msg_context = m->context; priv->child_msg_dev = m->spi; + priv->child_msg_pre_optimized = m->pre_optimized; m->complete = spi_mux_complete_cb; m->context = priv; m->spi = priv->spi; + m->pre_optimized = true; /* do the transfer */ return spi_async(priv->spi, m);
Hi David, On Fri, Jun 28, 2024 at 10:27:28AM -0500, David Lechner wrote: > Hi Oleksij and Marc, > > I'm supposed to be on vacation so I didn't look into this deeply yet > but I can see what is happening here. > > spi_mux_transfer_one_message() is calling spi_async() which is calling > __spi_optimize_message() on an already optimized message. > > Then it also calls __spi_unoptimize_message() which tries to release > resources. But this fails because the spi-mux driver has swapped > out the pointer to the device in the SPI message. This causes the > wrong ctlr to be passed to spi_res_release(), causing the crash. > > I don't know if a proper fix could be quite so simple, but here is > something you could try (untested): Thx! I'll test it at Monday. Regards, Oleksij
On Fri, Jun 28, 2024 at 10:27:28AM -0500, David Lechner wrote: > On 6/28/24 5:16 AM, Marc Kleine-Budde wrote: > > On 28.06.2024 11:49:38, Oleksij Rempel wrote: > >> It seems to be spi_mux specific. We have seen similar trace on other system > >> with spi_mux. > > > > Here is the other backtrace from another imx8mp system with a completely > > different workload. Both have in common that they use a spi_mux on the > > spi-imx driver. > > > > Unable to handle kernel NULL pointer dereference at virtual address 0000000000000dd0 > > Mem abort info: > > ESR = 0x0000000096000004 > > EC = 0x25: DABT (current EL), IL = 32 bits > > SET = 0, FnV = 0 > > EA = 0, S1PTW = 0 > > FSC = 0x04: level 0 translation fault > > Data abort info: > > ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000 > > CM = 0, WnR = 0, TnD = 0, TagAccess = 0 > > GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 > > user pgtable: 4k pages, 48-bit VAs, pgdp=0000000046760000 > > [0000000000000dd0] pgd=0000000000000000, p4d=0000000000000000 > > Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP > > Modules linked in: can_raw can ti_ads7950 industrialio_triggered_buffer kfifo_buf spi_mux fsl_imx8_ddr_perf at24 flexcan caam can_dev error rtc_snvs imx8mm_thermal spi_imx capture_events_irq cfg80211 iio_trig_hrtimer industrialio_sw_trigger ind> > > CPU: 3 PID: 177 Comm: spi5 Not tainted 6.9.0 #1 > > Hardware name: xxxxxxxxxxxxxxxx (xxxxxxxxx) (DT) > > pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > > pc : spi_res_release+0x24/0xb8 > > lr : spi_async+0xac/0x118 > > sp : ffff8000823fbcc0 > > x29: ffff8000823fbcc0 x28: 0000000000000000 x27: 0000000000000000 > > x26: ffff8000807bef88 x25: ffff80008115c008 x24: 0000000000000000 > > x23: ffff8000826c3938 x22: 0000000000000000 x21: ffff0000076a9800 > > x20: 0000000000000000 x19: 0000000000000dc8 x18: 0000000000000000 > > x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff88c0e760 > > x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000 > > x11: ffff8000815a1f98 x10: ffff8000823fbb40 x9 : ffff8000807b8420 > > x8 : ffff800081508000 x7 : 0000000000000004 x6 : 0000000003ce4c66 > > x5 : 0000000001000000 x4 : 0000000000000000 x3 : 0000000001000000 > > x2 : 0000000000000000 x1 : ffff8000826c38e0 x0 : ffff0000076a9800 > > Call trace: > > spi_res_release+0x24/0xb8 > > spi_async+0xac/0x118 > > spi_mux_transfer_one_message+0xb8/0xf0 [spi_mux] > > __spi_pump_transfer_message+0x260/0x5d8 > > __spi_pump_messages+0xdc/0x320 > > spi_pump_messages+0x20/0x38 > > kthread_worker_fn+0xdc/0x220 > > kthread+0x118/0x128 > > ret_from_fork+0x10/0x20 > > Code: a90153f3 a90363f7 91016037 f9403033 (f9400674) > > ---[ end trace 0000000000000000 ]--- > > > > regards, > > Marc > > > > > Hi Oleksij and Marc, > > I'm supposed to be on vacation so I didn't look into this deeply yet > but I can see what is happening here. > > spi_mux_transfer_one_message() is calling spi_async() which is calling > __spi_optimize_message() on an already optimized message. > > Then it also calls __spi_unoptimize_message() which tries to release > resources. But this fails because the spi-mux driver has swapped > out the pointer to the device in the SPI message. This causes the > wrong ctlr to be passed to spi_res_release(), causing the crash. > > I don't know if a proper fix could be quite so simple, but here is > something you could try (untested): This issue is still reproducible with following trace: [ 19.566433] Call trace: [ 19.568882] spi_async+0x90/0x118 [ 19.572204] spi_mux_transfer_one_message+0xd4/0x110 [ 19.577175] __spi_pump_transfer_message+0x2bc/0x6e8 [ 19.582149] __spi_pump_messages+0xe0/0x228 [ 19.586339] spi_pump_messages+0x20/0x38 [ 19.590271] kthread_worker_fn+0xe0/0x2e8 [ 19.594286] kthread+0x10c/0x120 [ 19.597518] ret_from_fork+0x10/0x20 addr2line for spi_async+0x90/0x118 -> drivers/spi/spi.c:2136 2132 static void __spi_unoptimize_message(struct spi_message *msg) 2133 { 2134 struct spi_controller *ctlr = msg->spi->controller; 2135 2136 if (ctlr->unoptimize_message) 2137 ctlr->unoptimize_message(msg); 2138 2139 spi_res_release(ctlr, msg); 2140 2141 msg->optimized = false; 2142 msg->opt_state = NULL; 2143 } Regards, Oleksij