Message ID | 20250102-b4-rkisp-noncoherent-v1-1-bba164f7132c@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | media: rkisp1: allow non-coherent video capture buffers | expand |
On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > Currently, the rkisp1 driver always uses coherent DMA allocations for > video capture buffers. However, on some platforms, using non-coherent > buffers can improve performance, especially when CPU processing of > MMAP'ed video buffers is required. > > For example, on the Rockchip RK3399 running at maximum CPU frequency, > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > non-coherent DMA allocation. CPU usage also decreases accordingly. What's the time taken by the cache management operations ? > This change allows userspace to request the allocation of non-coherent > buffers. Note that the behavior for existing users will remain unchanged > unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when > allocating buffers. > > Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com> > --- > drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644 > --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c > @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap) > q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC; > q->lock = &node->vlock; > q->dev = cap->rkisp1->dev; > + q->allow_cache_hints = 1; > ret = vb2_queue_init(q); > if (ret) { > dev_err(cap->rkisp1->dev, > > --- > base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c > change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Hi Laurent, On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: >> Currently, the rkisp1 driver always uses coherent DMA allocations for >> video capture buffers. However, on some platforms, using non-coherent >> buffers can improve performance, especially when CPU processing of >> MMAP'ed video buffers is required. >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using >> non-coherent DMA allocation. CPU usage also decreases accordingly. > > What's the time taken by the cache management operations ? Sorry for the late reply, your question turned out a little more interesting than I expected initially. :) When capturing using Yavta with MMAP buffers under the conditions mentioned in the commit message, ftrace gives 437.6 +- 1.1 us for dma_sync_sgtable_for_cpu and 409 +- 14 us for dma_sync_sgtable_for_device. Thus, it looks like using non-coherent buffers in this case is more CPU-efficient even when considering cache management overhead. When trying to do the same measurements with libcamera, I failed. In a typical libcamera use case when MMAP buffers are allocated from a device, exported as dmabufs and then used for capture on the same device with DMABUF memory type, cache management in kernel is skipped [1] [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so DMA_BUF_IOCTL_SYNC from userspace does not work either. So it looks like to make this change really useful, the above issue of cache management for libcamera/DMABUF/videobuf2-dma-contig has to be solved. I'm not an expert in this area, so any advice is kindly welcome. :) [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 -- Best regards, Mikhail Rudenko
Hi Mikhail and Laurent, On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: > > > Hi Laurent, > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > >> Currently, the rkisp1 driver always uses coherent DMA allocations for > >> video capture buffers. However, on some platforms, using non-coherent > >> buffers can improve performance, especially when CPU processing of > >> MMAP'ed video buffers is required. > >> > >> For example, on the Rockchip RK3399 running at maximum CPU frequency, > >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > >> non-coherent DMA allocation. CPU usage also decreases accordingly. > > > > What's the time taken by the cache management operations ? > > Sorry for the late reply, your question turned out a little more > interesting than I expected initially. :) > > When capturing using Yavta with MMAP buffers under the conditions mentioned > in the commit message, ftrace gives 437.6 +- 1.1 us for > dma_sync_sgtable_for_cpu and 409 +- 14 us for > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent > buffers in this case is more CPU-efficient even when considering cache > management overhead. > > When trying to do the same measurements with libcamera, I failed. In a > typical libcamera use case when MMAP buffers are allocated from a > device, exported as dmabufs and then used for capture on the same device > with DMABUF memory type, cache management in kernel is skipped [1] > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so > DMA_BUF_IOCTL_SYNC from userspace does not work either. Oops, so I believe this is a bug. When an MMAP buffer is allocated in the non-coherent mode, those ops should perform proper cache maintenance. Let me send a patch to fix this in a couple of days unless someone does it earlier. Best regards, Tomasz > > So it looks like to make this change really useful, the above issue of > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be > solved. I'm not an expert in this area, so any advice is kindly welcome. :) > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 > > -- > Best regards, > Mikhail Rudenko >
Hi Tomasz, On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote: > Hi Mikhail and Laurent, > > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: >> >> >> Hi Laurent, >> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: >> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for >> >> video capture buffers. However, on some platforms, using non-coherent >> >> buffers can improve performance, especially when CPU processing of >> >> MMAP'ed video buffers is required. >> >> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using >> >> non-coherent DMA allocation. CPU usage also decreases accordingly. >> > >> > What's the time taken by the cache management operations ? >> >> Sorry for the late reply, your question turned out a little more >> interesting than I expected initially. :) >> >> When capturing using Yavta with MMAP buffers under the conditions mentioned >> in the commit message, ftrace gives 437.6 +- 1.1 us for >> dma_sync_sgtable_for_cpu and 409 +- 14 us for >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent >> buffers in this case is more CPU-efficient even when considering cache >> management overhead. >> >> When trying to do the same measurements with libcamera, I failed. In a >> typical libcamera use case when MMAP buffers are allocated from a >> device, exported as dmabufs and then used for capture on the same device >> with DMABUF memory type, cache management in kernel is skipped [1] >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so >> DMA_BUF_IOCTL_SYNC from userspace does not work either. > > Oops, so I believe this is a bug. When an MMAP buffer is allocated in > the non-coherent mode, those ops should perform proper cache > maintenance. Thanks for pointing this out! > Let me send a patch to fix this in a couple of days unless someone > does it earlier. Now that we know that this is a bug, not an API misuse from my side, I can fix this myself and send a v2. Would this be okay for you? > Best regards, > Tomasz > >> >> So it looks like to make this change really useful, the above issue of >> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be >> solved. I'm not an expert in this area, so any advice is kindly welcome. :) >> >> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 >> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 >> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 >> >> -- >> Best regards, >> Mikhail Rudenko >> -- Best regards, Mikhail Rudenko
On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: > > Hi Tomasz, > > On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote: > > > Hi Mikhail and Laurent, > > > > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: > >> > >> > >> Hi Laurent, > >> > >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > >> > >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for > >> >> video capture buffers. However, on some platforms, using non-coherent > >> >> buffers can improve performance, especially when CPU processing of > >> >> MMAP'ed video buffers is required. > >> >> > >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, > >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > >> >> non-coherent DMA allocation. CPU usage also decreases accordingly. > >> > > >> > What's the time taken by the cache management operations ? > >> > >> Sorry for the late reply, your question turned out a little more > >> interesting than I expected initially. :) > >> > >> When capturing using Yavta with MMAP buffers under the conditions mentioned > >> in the commit message, ftrace gives 437.6 +- 1.1 us for > >> dma_sync_sgtable_for_cpu and 409 +- 14 us for > >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent > >> buffers in this case is more CPU-efficient even when considering cache > >> management overhead. > >> > >> When trying to do the same measurements with libcamera, I failed. In a > >> typical libcamera use case when MMAP buffers are allocated from a > >> device, exported as dmabufs and then used for capture on the same device > >> with DMABUF memory type, cache management in kernel is skipped [1] > >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so > >> DMA_BUF_IOCTL_SYNC from userspace does not work either. > > > > Oops, so I believe this is a bug. When an MMAP buffer is allocated in > > the non-coherent mode, those ops should perform proper cache > > maintenance. > > Thanks for pointing this out! > > > Let me send a patch to fix this in a couple of days unless someone > > does it earlier. > > Now that we know that this is a bug, not an API misuse from my side, I > can fix this myself and send a v2. Would this be okay for you? I'd be more than happy :) > > > Best regards, > > Tomasz > > > >> > >> So it looks like to make this change really useful, the above issue of > >> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be > >> solved. I'm not an expert in this area, so any advice is kindly welcome. :) > >> > >> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 > >> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 > >> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 > >> > >> -- > >> Best regards, > >> Mikhail Rudenko > >> > > > -- > Best regards, > Mikhail Rudenko
On 2025-01-15 at 23:46 +09, Tomasz Figa <tfiga@chromium.org> wrote: > On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: >> >> Hi Tomasz, >> >> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote: >> >> > Hi Mikhail and Laurent, >> > >> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote: >> >> >> >> >> >> Hi Laurent, >> >> >> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: >> >> >> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: >> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for >> >> >> video capture buffers. However, on some platforms, using non-coherent >> >> >> buffers can improve performance, especially when CPU processing of >> >> >> MMAP'ed video buffers is required. >> >> >> >> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency, >> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a >> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using >> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly. >> >> > >> >> > What's the time taken by the cache management operations ? >> >> >> >> Sorry for the late reply, your question turned out a little more >> >> interesting than I expected initially. :) >> >> >> >> When capturing using Yavta with MMAP buffers under the conditions mentioned >> >> in the commit message, ftrace gives 437.6 +- 1.1 us for >> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for >> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent >> >> buffers in this case is more CPU-efficient even when considering cache >> >> management overhead. >> >> >> >> When trying to do the same measurements with libcamera, I failed. In a >> >> typical libcamera use case when MMAP buffers are allocated from a >> >> device, exported as dmabufs and then used for capture on the same device >> >> with DMABUF memory type, cache management in kernel is skipped [1] >> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so >> >> DMA_BUF_IOCTL_SYNC from userspace does not work either. >> > >> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in >> > the non-coherent mode, those ops should perform proper cache >> > maintenance. >> >> Thanks for pointing this out! >> >> > Let me send a patch to fix this in a couple of days unless someone >> > does it earlier. >> >> Now that we know that this is a bug, not an API misuse from my side, I >> can fix this myself and send a v2. Would this be okay for you? > > I'd be more than happy :) Done, see [1]. A review would be appreciated. :) [1] https://lore.kernel.org/all/20250115-b4-rkisp-noncoherent-v2-0-0853e1a24012@gmail.com/ -- Best regards, Mikhail Rudenko
Le mardi 14 janvier 2025 à 19:00 +0300, Mikhail Rudenko a écrit : > Hi Laurent, > > On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > > > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote: > > > Currently, the rkisp1 driver always uses coherent DMA allocations for > > > video capture buffers. However, on some platforms, using non-coherent > > > buffers can improve performance, especially when CPU processing of > > > MMAP'ed video buffers is required. > > > > > > For example, on the Rockchip RK3399 running at maximum CPU frequency, > > > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a > > > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using > > > non-coherent DMA allocation. CPU usage also decreases accordingly. > > > > What's the time taken by the cache management operations ? > > Sorry for the late reply, your question turned out a little more > interesting than I expected initially. :) > > When capturing using Yavta with MMAP buffers under the conditions mentioned > in the commit message, ftrace gives 437.6 +- 1.1 us for > dma_sync_sgtable_for_cpu and 409 +- 14 us for > dma_sync_sgtable_for_device. Thus, it looks like using non-coherent > buffers in this case is more CPU-efficient even when considering cache > management overhead. > > When trying to do the same measurements with libcamera, I failed. In a > typical libcamera use case when MMAP buffers are allocated from a > device, exported as dmabufs and then used for capture on the same device > with DMABUF memory type, cache management in kernel is skipped [1] > [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so > DMA_BUF_IOCTL_SYNC from userspace does not work either. > > So it looks like to make this change really useful, the above issue of > cache management for libcamera/DMABUF/videobuf2-dma-contig has to be > solved. I'm not an expert in this area, so any advice is kindly welcome. :) The manual coherency hints are not implemented for DMABuf, and libcamera only do dmabuf. Someone will have to look into that. This is also why this API have very low adoption, its breaks easily. > > [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411 > [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829 > [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426 > > -- > Best regards, > Mikhail Rudenko >
diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644 --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap) q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC; q->lock = &node->vlock; q->dev = cap->rkisp1->dev; + q->allow_cache_hints = 1; ret = vb2_queue_init(q); if (ret) { dev_err(cap->rkisp1->dev,
Currently, the rkisp1 driver always uses coherent DMA allocations for video capture buffers. However, on some platforms, using non-coherent buffers can improve performance, especially when CPU processing of MMAP'ed video buffers is required. For example, on the Rockchip RK3399 running at maximum CPU frequency, the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using non-coherent DMA allocation. CPU usage also decreases accordingly. This change allows userspace to request the allocation of non-coherent buffers. Note that the behavior for existing users will remain unchanged unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when allocating buffers. Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com> --- drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 + 1 file changed, 1 insertion(+) --- base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba Best regards,