diff mbox series

media: rkisp1: allow non-coherent video capture buffers

Message ID 20250102-b4-rkisp-noncoherent-v1-1-bba164f7132c@gmail.com (mailing list archive)
State New
Headers show
Series media: rkisp1: allow non-coherent video capture buffers | expand

Commit Message

Mikhail Rudenko Jan. 2, 2025, 3:35 p.m. UTC
Currently, the rkisp1 driver always uses coherent DMA allocations for
video capture buffers. However, on some platforms, using non-coherent
buffers can improve performance, especially when CPU processing of
MMAP'ed video buffers is required.

For example, on the Rockchip RK3399 running at maximum CPU frequency,
the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
non-coherent DMA allocation. CPU usage also decreases accordingly.

This change allows userspace to request the allocation of non-coherent
buffers. Note that the behavior for existing users will remain unchanged
unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
allocating buffers.

Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
---
 drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
 1 file changed, 1 insertion(+)


---
base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba

Best regards,

Comments

Laurent Pinchart Jan. 3, 2025, 3:23 p.m. UTC | #1
On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> Currently, the rkisp1 driver always uses coherent DMA allocations for
> video capture buffers. However, on some platforms, using non-coherent
> buffers can improve performance, especially when CPU processing of
> MMAP'ed video buffers is required.
> 
> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> non-coherent DMA allocation. CPU usage also decreases accordingly.

What's the time taken by the cache management operations ?

> This change allows userspace to request the allocation of non-coherent
> buffers. Note that the behavior for existing users will remain unchanged
> unless they explicitly set the V4L2_MEMORY_FLAG_NON_COHERENT flag when
> allocating buffers.
> 
> Signed-off-by: Mikhail Rudenko <mike.rudenko@gmail.com>
> ---
>  drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
> --- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> +++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
> @@ -1563,6 +1563,7 @@ static int rkisp1_register_capture(struct rkisp1_capture *cap)
>  	q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
>  	q->lock = &node->vlock;
>  	q->dev = cap->rkisp1->dev;
> +	q->allow_cache_hints = 1;
>  	ret = vb2_queue_init(q);
>  	if (ret) {
>  		dev_err(cap->rkisp1->dev,
> 
> ---
> base-commit: 40ed9e9b2808beeb835bd0ed971fb364c285d39c
> change-id: 20241231-b4-rkisp-noncoherent-ad6e7c7a68ba
Mikhail Rudenko Jan. 14, 2025, 4 p.m. UTC | #2
Hi Laurent,

On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:

> On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> video capture buffers. However, on some platforms, using non-coherent
>> buffers can improve performance, especially when CPU processing of
>> MMAP'ed video buffers is required.
>>
>> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> non-coherent DMA allocation. CPU usage also decreases accordingly.
>
> What's the time taken by the cache management operations ?

Sorry for the late reply, your question turned out a little more
interesting than I expected initially. :)

When capturing using Yavta with MMAP buffers under the conditions mentioned
in the commit message, ftrace gives 437.6 +- 1.1 us for
dma_sync_sgtable_for_cpu and 409 +- 14 us for
dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
buffers in this case is more CPU-efficient even when considering cache
management overhead.

When trying to do the same measurements with libcamera, I failed. In a
typical libcamera use case when MMAP buffers are allocated from a
device, exported as dmabufs and then used for capture on the same device
with DMABUF memory type, cache management in kernel is skipped [1]
[2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
DMA_BUF_IOCTL_SYNC from userspace does not work either.

So it looks like to make this change really useful, the above issue of
cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
solved. I'm not an expert in this area, so any advice is kindly welcome. :)

[1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
[2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
[3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426

--
Best regards,
Mikhail Rudenko
Tomasz Figa Jan. 15, 2025, 8:31 a.m. UTC | #3
Hi Mikhail and Laurent,

On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>
>
> Hi Laurent,
>
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> video capture buffers. However, on some platforms, using non-coherent
> >> buffers can improve performance, especially when CPU processing of
> >> MMAP'ed video buffers is required.
> >>
> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >
> > What's the time taken by the cache management operations ?
>
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
>
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
>
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.

Oops, so I believe this is a bug. When an MMAP buffer is allocated in
the non-coherent mode, those ops should perform proper cache
maintenance.

Let me send a patch to fix this in a couple of days unless someone
does it earlier.

Best regards,
Tomasz

>
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>
> --
> Best regards,
> Mikhail Rudenko
>
Mikhail Rudenko Jan. 15, 2025, 1:24 p.m. UTC | #4
Hi Tomasz,

On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:

> Hi Mikhail and Laurent,
>
> On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>>
>>
>> Hi Laurent,
>>
>> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>>
>> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> video capture buffers. However, on some platforms, using non-coherent
>> >> buffers can improve performance, especially when CPU processing of
>> >> MMAP'ed video buffers is required.
>> >>
>> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >
>> > What's the time taken by the cache management operations ?
>>
>> Sorry for the late reply, your question turned out a little more
>> interesting than I expected initially. :)
>>
>> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> buffers in this case is more CPU-efficient even when considering cache
>> management overhead.
>>
>> When trying to do the same measurements with libcamera, I failed. In a
>> typical libcamera use case when MMAP buffers are allocated from a
>> device, exported as dmabufs and then used for capture on the same device
>> with DMABUF memory type, cache management in kernel is skipped [1]
>> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>
> Oops, so I believe this is a bug. When an MMAP buffer is allocated in
> the non-coherent mode, those ops should perform proper cache
> maintenance.

Thanks for pointing this out!

> Let me send a patch to fix this in a couple of days unless someone
> does it earlier.

Now that we know that this is a bug, not an API misuse from my side, I
can fix this myself and send a v2. Would this be okay for you?

> Best regards,
> Tomasz
>
>>
>> So it looks like to make this change really useful, the above issue of
>> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
>> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
>>
>> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
>> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
>> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
>>
>> --
>> Best regards,
>> Mikhail Rudenko
>>


--
Best regards,
Mikhail Rudenko
Tomasz Figa Jan. 15, 2025, 2:46 p.m. UTC | #5
On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>
> Hi Tomasz,
>
> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
>
> > Hi Mikhail and Laurent,
> >
> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
> >>
> >>
> >> Hi Laurent,
> >>
> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> >>
> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
> >> >> video capture buffers. However, on some platforms, using non-coherent
> >> >> buffers can improve performance, especially when CPU processing of
> >> >> MMAP'ed video buffers is required.
> >> >>
> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
> >> >
> >> > What's the time taken by the cache management operations ?
> >>
> >> Sorry for the late reply, your question turned out a little more
> >> interesting than I expected initially. :)
> >>
> >> When capturing using Yavta with MMAP buffers under the conditions mentioned
> >> in the commit message, ftrace gives 437.6 +- 1.1 us for
> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> >> buffers in this case is more CPU-efficient even when considering cache
> >> management overhead.
> >>
> >> When trying to do the same measurements with libcamera, I failed. In a
> >> typical libcamera use case when MMAP buffers are allocated from a
> >> device, exported as dmabufs and then used for capture on the same device
> >> with DMABUF memory type, cache management in kernel is skipped [1]
> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> >> DMA_BUF_IOCTL_SYNC from userspace does not work either.
> >
> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in
> > the non-coherent mode, those ops should perform proper cache
> > maintenance.
>
> Thanks for pointing this out!
>
> > Let me send a patch to fix this in a couple of days unless someone
> > does it earlier.
>
> Now that we know that this is a bug, not an API misuse from my side, I
> can fix this myself and send a v2. Would this be okay for you?

I'd be more than happy :)

>
> > Best regards,
> > Tomasz
> >
> >>
> >> So it looks like to make this change really useful, the above issue of
> >> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> >> solved. I'm not an expert in this area, so any advice is kindly welcome. :)
> >>
> >> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> >> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> >> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> >>
> >> --
> >> Best regards,
> >> Mikhail Rudenko
> >>
>
>
> --
> Best regards,
> Mikhail Rudenko
Mikhail Rudenko Jan. 15, 2025, 5:29 p.m. UTC | #6
On 2025-01-15 at 23:46 +09, Tomasz Figa <tfiga@chromium.org> wrote:

> On Wed, Jan 15, 2025 at 10:30 PM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>>
>> Hi Tomasz,
>>
>> On 2025-01-15 at 17:31 +09, Tomasz Figa <tfiga@chromium.org> wrote:
>>
>> > Hi Mikhail and Laurent,
>> >
>> > On Wed, Jan 15, 2025 at 2:07 AM Mikhail Rudenko <mike.rudenko@gmail.com> wrote:
>> >>
>> >>
>> >> Hi Laurent,
>> >>
>> >> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
>> >>
>> >> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
>> >> >> Currently, the rkisp1 driver always uses coherent DMA allocations for
>> >> >> video capture buffers. However, on some platforms, using non-coherent
>> >> >> buffers can improve performance, especially when CPU processing of
>> >> >> MMAP'ed video buffers is required.
>> >> >>
>> >> >> For example, on the Rockchip RK3399 running at maximum CPU frequency,
>> >> >> the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
>> >> >> malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
>> >> >> non-coherent DMA allocation. CPU usage also decreases accordingly.
>> >> >
>> >> > What's the time taken by the cache management operations ?
>> >>
>> >> Sorry for the late reply, your question turned out a little more
>> >> interesting than I expected initially. :)
>> >>
>> >> When capturing using Yavta with MMAP buffers under the conditions mentioned
>> >> in the commit message, ftrace gives 437.6 +- 1.1 us for
>> >> dma_sync_sgtable_for_cpu and 409 +- 14 us for
>> >> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
>> >> buffers in this case is more CPU-efficient even when considering cache
>> >> management overhead.
>> >>
>> >> When trying to do the same measurements with libcamera, I failed. In a
>> >> typical libcamera use case when MMAP buffers are allocated from a
>> >> device, exported as dmabufs and then used for capture on the same device
>> >> with DMABUF memory type, cache management in kernel is skipped [1]
>> >> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
>> >> DMA_BUF_IOCTL_SYNC from userspace does not work either.
>> >
>> > Oops, so I believe this is a bug. When an MMAP buffer is allocated in
>> > the non-coherent mode, those ops should perform proper cache
>> > maintenance.
>>
>> Thanks for pointing this out!
>>
>> > Let me send a patch to fix this in a couple of days unless someone
>> > does it earlier.
>>
>> Now that we know that this is a bug, not an API misuse from my side, I
>> can fix this myself and send a v2. Would this be okay for you?
>
> I'd be more than happy :)

Done, see [1]. A review would be appreciated. :)

[1] https://lore.kernel.org/all/20250115-b4-rkisp-noncoherent-v2-0-0853e1a24012@gmail.com/


--
Best regards,
Mikhail Rudenko
Nicolas Dufresne Jan. 15, 2025, 7:13 p.m. UTC | #7
Le mardi 14 janvier 2025 à 19:00 +0300, Mikhail Rudenko a écrit :
> Hi Laurent,
> 
> On 2025-01-03 at 17:23 +02, Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote:
> 
> > On Thu, Jan 02, 2025 at 06:35:00PM +0300, Mikhail Rudenko wrote:
> > > Currently, the rkisp1 driver always uses coherent DMA allocations for
> > > video capture buffers. However, on some platforms, using non-coherent
> > > buffers can improve performance, especially when CPU processing of
> > > MMAP'ed video buffers is required.
> > > 
> > > For example, on the Rockchip RK3399 running at maximum CPU frequency,
> > > the time to memcpy a frame from a 1280x720 XRGB32 MMAP'ed buffer to a
> > > malloc'ed userspace buffer decreases from 7.7 ms to 1.1 ms when using
> > > non-coherent DMA allocation. CPU usage also decreases accordingly.
> > 
> > What's the time taken by the cache management operations ?
> 
> Sorry for the late reply, your question turned out a little more
> interesting than I expected initially. :)
> 
> When capturing using Yavta with MMAP buffers under the conditions mentioned
> in the commit message, ftrace gives 437.6 +- 1.1 us for
> dma_sync_sgtable_for_cpu and 409 +- 14 us for
> dma_sync_sgtable_for_device. Thus, it looks like using non-coherent
> buffers in this case is more CPU-efficient even when considering cache
> management overhead.
> 
> When trying to do the same measurements with libcamera, I failed. In a
> typical libcamera use case when MMAP buffers are allocated from a
> device, exported as dmabufs and then used for capture on the same device
> with DMABUF memory type, cache management in kernel is skipped [1]
> [2]. Also, vb2_dc_dmabuf_ops_{begin,end}_cpu_access are no-ops [3], so
> DMA_BUF_IOCTL_SYNC from userspace does not work either.
> 
> So it looks like to make this change really useful, the above issue of
> cache management for libcamera/DMABUF/videobuf2-dma-contig has to be
> solved. I'm not an expert in this area, so any advice is kindly welcome. :)


The manual coherency hints are not implemented for DMABuf, and libcamera only do
dmabuf. Someone will have to look into that. This is also why this API have very
low adoption, its breaks easily.

> 
> [1] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n411
> [2] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-core.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n829
> [3] https://git.linuxtv.org/media.git/tree/drivers/media/common/videobuf2/videobuf2-dma-contig.c?id=94794b5ce4d90ab134b0b101a02fddf6e74c437d#n426
> 
> --
> Best regards,
> Mikhail Rudenko
>
diff mbox series

Patch

diff --git a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
index 6dcefd144d5abe358323e37ac6133c6134ac636e..c94f7d1d73a92646457a27da20726ec6f92e7717 100644
--- a/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
+++ b/drivers/media/platform/rockchip/rkisp1/rkisp1-capture.c
@@ -1563,6 +1563,7 @@  static int rkisp1_register_capture(struct rkisp1_capture *cap)
 	q->timestamp_flags = V4L2_BUF_FLAG_TIMESTAMP_MONOTONIC;
 	q->lock = &node->vlock;
 	q->dev = cap->rkisp1->dev;
+	q->allow_cache_hints = 1;
 	ret = vb2_queue_init(q);
 	if (ret) {
 		dev_err(cap->rkisp1->dev,