[7/7] media: cedrus: Improve H264 memory efficiency

Message ID	20190530211516.1891-8-jernej.skrabec@siol.net (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> sender: 031275009) by mail.siol.net (Postfix) with ESMTPSA id C8CFD522867; Thu, 30 May 2019 23:15:44 +0200 (CEST) From: Jernej Skrabec <jernej.skrabec@siol.net> To: paul.kocialkowski@bootlin.com, maxime.ripard@bootlin.com Subject: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency Date: Thu, 30 May 2019 23:15:16 +0200 Message-Id: <20190530211516.1891-8-jernej.skrabec@siol.net> In-Reply-To: <20190530211516.1891-1-jernej.skrabec@siol.net> References: <20190530211516.1891-1-jernej.skrabec@siol.net> MIME-Version: 1.0 Precedence: list Cc: devel@driverdev.osuosl.org, gregkh@linuxfoundation.org, linux-kernel@vger.kernel.org, wens@csie.org, mchehab@kernel.org, linux-arm-kernel@lists.infradead.org, linux-media@vger.kernel.org Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	media: cedrus: Improvements/cleanup \| expand [0/7] media: cedrus: Improvements/cleanup [1/7] media: cedrus: Disable engine after each slice decoding [2/7] media: cedrus: Fix H264 default reference index count [3/7] media: cedrus: Fix decoding for some H264 videos [4/7] media: cedrus: Remove dst_bufs from context [5/7] media: cedrus: Don't set chroma size for scale & rotation [6/7] media: cedrus: Add infra for extra buffers connected to capture buffers [7/7] media: cedrus: Improve H264 memory efficiency

Message ID

20190530211516.1891-8-jernej.skrabec@siol.net (mailing list archive)

State

New, archived

Headers

From: Jernej Skrabec <jernej.skrabec@siol.net>
To: paul.kocialkowski@bootlin.com,
	maxime.ripard@bootlin.com
Subject: [PATCH 7/7] media: cedrus: Improve H264 memory efficiency
Date: Thu, 30 May 2019 23:15:16 +0200
Message-Id: <20190530211516.1891-8-jernej.skrabec@siol.net>
In-Reply-To: <20190530211516.1891-1-jernej.skrabec@siol.net>
References: <20190530211516.1891-1-jernej.skrabec@siol.net>
MIME-Version: 1.0
Precedence: list
Cc: devel@driverdev.osuosl.org, gregkh@linuxfoundation.org,
 linux-kernel@vger.kernel.org, wens@csie.org, mchehab@kernel.org,
 linux-arm-kernel@lists.infradead.org, linux-media@vger.kernel.org
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org>
Errors-To: 
 linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Series

media: cedrus: Improvements/cleanup | expand

Commit Message

Jernej Škrabec May 30, 2019, 9:15 p.m. UTC

H264 decoder driver preallocated pretty big worst case mv col buffer
pool. It turns out that pool is most of the time much bigger than it
needs to be.

Solution implemented here is to allocate memory only if capture buffer
is actually used and only as much as it is really necessary.

This is also preparation for 4K video decoding support, which will be
implemented later.

Signed-off-by: Jernej Skrabec <jernej.skrabec@siol.net>
---
 drivers/staging/media/sunxi/cedrus/cedrus.h   |  4 -
 .../staging/media/sunxi/cedrus/cedrus_h264.c  | 81 +++++++------------
 2 files changed, 28 insertions(+), 57 deletions(-)

Comments

Maxime Ripard June 3, 2019, 12:23 p.m. UTC | #1

On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> H264 decoder driver preallocated pretty big worst case mv col buffer
> pool. It turns out that pool is most of the time much bigger than it
> needs to be.
>
> Solution implemented here is to allocate memory only if capture buffer
> is actually used and only as much as it is really necessary.
>
> This is also preparation for 4K video decoding support, which will be
> implemented later.

What is it doing exactly to prepare for 4k?

> Signed-off-by: Jernej Skrabec <jernej.skrabec@siol.net>
> ---
>  drivers/staging/media/sunxi/cedrus/cedrus.h   |  4 -
>  .../staging/media/sunxi/cedrus/cedrus_h264.c  | 81 +++++++------------
>  2 files changed, 28 insertions(+), 57 deletions(-)
>
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
> index 16c1bdfd243a..fcbbbef65494 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> @@ -106,10 +106,6 @@ struct cedrus_ctx {
>
>  	union {
>  		struct {
> -			void		*mv_col_buf;
> -			dma_addr_t	mv_col_buf_dma;
> -			ssize_t		mv_col_buf_field_size;
> -			ssize_t		mv_col_buf_size;
>  			void		*pic_info_buf;
>  			dma_addr_t	pic_info_buf_dma;
>  			void		*neighbor_info_buf;
> diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> index b2290f98d81a..758fd0049e8f 100644
> --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev *dev,
>  		cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
>  }
>
> -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> -					      unsigned int position,
> +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
>  					      unsigned int field)
>  {
> -	dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> -
> -	/* Adjust for the position */
> -	addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> +	dma_addr_t addr = buf->extra_buf_dma;
>
>  	/* Adjust for the field */
> -	addr += field * ctx->codec.h264.mv_col_buf_field_size;
> +	if (field)
> +		addr += buf->extra_buf_size / 2;
>
>  	return addr;
>  }
> @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
>  				struct cedrus_h264_sram_ref_pic *pic)
>  {
>  	struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> -	unsigned int position = buf->codec.h264.position;
>
>  	pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
>  	pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
>
>  	pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 0));
>  	pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 1));
> -	pic->mv_col_top_ptr =
> -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 0));
> -	pic->mv_col_bot_ptr =
> -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 1));
> +	pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 0));
> +	pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 1));
>  }
>
>  static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
>  	output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
>  	output_buf->codec.h264.position = position;
>
> +	if (!output_buf->extra_buf_size) {
> +		const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> +		unsigned int field_size;
> +
> +		field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> +			DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> +		if (!(sps->flags & V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> +			field_size = field_size * 2;
> +		if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> +			field_size = field_size * 2;
> +
> +		output_buf->extra_buf_size = field_size * 2;
> +		output_buf->extra_buf =
> +			dma_alloc_coherent(dev->dev,
> +					   output_buf->extra_buf_size,
> +					   &output_buf->extra_buf_dma,
> +					   GFP_KERNEL);
> +
> +		if (!output_buf->extra_buf)
> +			output_buf->extra_buf_size = 0;
> +	}
> +

That also means that instead of allocating that buffer exactly once,
you now allocate it for each output buffer?

I guess that it will cleaned up by your previous patch at
buffer_cleanup time, so after it's no longer a reference frame?

What is the average memory usage before, and after that change during
a playback, and what is the runtime cost of doing it multiple times
instead of once?

Maxime

--
Maxime Ripard, Bootlin
Embedded Linux and Kernel engineering
https://bootlin.com

Jernej Škrabec June 3, 2019, 4:37 p.m. UTC | #2

Dne ponedeljek, 03. junij 2019 ob 14:23:28 CEST je Maxime Ripard napisal(a):
> On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> > H264 decoder driver preallocated pretty big worst case mv col buffer
> > pool. It turns out that pool is most of the time much bigger than it
> > needs to be.
> > 
> > Solution implemented here is to allocate memory only if capture buffer
> > is actually used and only as much as it is really necessary.
> > 
> > This is also preparation for 4K video decoding support, which will be
> > implemented later.
> 
> What is it doing exactly to prepare for 4k?

Well, with that change 4K videos can be actually watched with 256 MiB CMA 
pool, but I can drop this statement in next version.

I concentrated on 256 MiB CMA pool, because it's maximum memory size supported 
by older VPU versions, but I think they don't support 4K decoding. I don't 
have them, so I can't test that hypothesis.

> 
> > Signed-off-by: Jernej Skrabec <jernej.skrabec@siol.net>
> > ---
> > 
> >  drivers/staging/media/sunxi/cedrus/cedrus.h   |  4 -
> >  .../staging/media/sunxi/cedrus/cedrus_h264.c  | 81 +++++++------------
> >  2 files changed, 28 insertions(+), 57 deletions(-)
> > 
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > 16c1bdfd243a..fcbbbef65494 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > @@ -106,10 +106,6 @@ struct cedrus_ctx {
> > 
> >  	union {
> >  	
> >  		struct {
> > 
> > -			void		*mv_col_buf;
> > -			dma_addr_t	mv_col_buf_dma;
> > -			ssize_t		mv_col_buf_field_size;
> > -			ssize_t		mv_col_buf_size;
> > 
> >  			void		*pic_info_buf;
> >  			dma_addr_t	pic_info_buf_dma;
> >  			void		*neighbor_info_buf;
> > 
> > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c index
> > b2290f98d81a..758fd0049e8f 100644
> > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev
> > *dev,> 
> >  		cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> >  
> >  }
> > 
> > -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> > -					      unsigned int 
position,
> > +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
> > 
> >  					      unsigned int 
field)
> >  
> >  {
> > 
> > -	dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> > -
> > -	/* Adjust for the position */
> > -	addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> > +	dma_addr_t addr = buf->extra_buf_dma;
> > 
> >  	/* Adjust for the field */
> > 
> > -	addr += field * ctx->codec.h264.mv_col_buf_field_size;
> > +	if (field)
> > +		addr += buf->extra_buf_size / 2;
> > 
> >  	return addr;
> >  
> >  }
> > 
> > @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > 
> >  				struct cedrus_h264_sram_ref_pic 
*pic)
> >  
> >  {
> >  
> >  	struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> > 
> > -	unsigned int position = buf->codec.h264.position;
> > 
> >  	pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
> >  	pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> > 
> > @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > 
> >  	pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 
0));
> >  	pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 
1));
> > 
> > -	pic->mv_col_top_ptr =
> > -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 
0));
> > -	pic->mv_col_bot_ptr =
> > -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 
1));
> > +	pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 
0));
> > +	pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 
1));
> > 
> >  }
> >  
> >  static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> > 
> > @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx
> > *ctx,> 
> >  	output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> >  	output_buf->codec.h264.position = position;
> > 
> > +	if (!output_buf->extra_buf_size) {
> > +		const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> > +		unsigned int field_size;
> > +
> > +		field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> > +			DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> > +		if (!(sps->flags & 
V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> > +			field_size = field_size * 2;
> > +		if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> > +			field_size = field_size * 2;
> > +
> > +		output_buf->extra_buf_size = field_size * 2;
> > +		output_buf->extra_buf =
> > +			dma_alloc_coherent(dev->dev,
> > +					   output_buf-
>extra_buf_size,
> > +					   &output_buf-
>extra_buf_dma,
> > +					   GFP_KERNEL);
> > +
> > +		if (!output_buf->extra_buf)
> > +			output_buf->extra_buf_size = 0;
> > +	}
> > +
> 
> That also means that instead of allocating that buffer exactly once,
> you now allocate it for each output buffer?

It's not completely the same. I'm allocating multiple times, yes, but much 
smaller chunks and only if needed.

Still, this slight overhead happens only when buffer is used for the first time. 
When buffer is reused, this MV buffer is also reused.

> 
> I guess that it will cleaned up by your previous patch at
> buffer_cleanup time, so after it's no longer a reference frame?

Yes, it will be deallocated in buffer_cleanup, but only after capture buffers 
are freed, which usually happens when device file descriptor is closed.

Buffers which holds reference frames are later reused, together with it's MV 
buffer, so there's no overhead.

> 
> What is the average memory usage before, and after that change during
> a playback, and what is the runtime cost of doing it multiple times
> instead of once?

As I already said, overhead is present only when buffer is used for the first 
time, which is not ideal, but allows to calculate minimal buffer size needed 
and even doesn't allocate anything if capture buffer is not used at all.

I didn't collect any exact numbers, but with this change I can play H264 and 
HEVC (with similar modification) 4K video samples with 256 MiB CMA pool. 
Without this change, it's not really possible. You can argue "but what if 4K 
video use 16 reference frames", then yes, only solution is to increase CMA 
pool, but why reserve extra memory which will never be used?

I've been using this optimization for past ~6 month with no issues noticed. If 
you feel better, I can change this to be a bit conservative and allocate MV 
buffer when buffer_init is called. This will consume a bit more memory as SPS is 
not available at that time (worst case buffer size estimation), but it still 
won't allocate MV buffers for unallocated capture frames.

Best regards,
Jernej

Paul Kocialkowski June 5, 2019, 9:12 p.m. UTC | #3

Hi,

Le lundi 03 juin 2019 à 18:37 +0200, Jernej Škrabec a écrit :
> Dne ponedeljek, 03. junij 2019 ob 14:23:28 CEST je Maxime Ripard napisal(a):
> > On Thu, May 30, 2019 at 11:15:16PM +0200, Jernej Skrabec wrote:
> > > H264 decoder driver preallocated pretty big worst case mv col buffer
> > > pool. It turns out that pool is most of the time much bigger than it
> > > needs to be.
> > > 
> > > Solution implemented here is to allocate memory only if capture buffer
> > > is actually used and only as much as it is really necessary.
> > > 
> > > This is also preparation for 4K video decoding support, which will be
> > > implemented later.
> > 
> > What is it doing exactly to prepare for 4k?
> 
> Well, with that change 4K videos can be actually watched with 256 MiB CMA 
> pool, but I can drop this statement in next version.
> 
> I concentrated on 256 MiB CMA pool, because it's maximum memory size supported 
> by older VPU versions, but I think they don't support 4K decoding. I don't 
> have them, so I can't test that hypothesis.

I think it's a fair goal to try and optimize the CMA pool usage, maybe
it should be presented as that and I guess it's fine to connect that to
4K decoding if you like :)

Either way, I think we should have per-codec framework callbacks for
these kinds of things.

Cheers,

Paul

> > > Signed-off-by: Jernej Skrabec <jernej.skrabec@siol.net>
> > > ---
> > > 
> > >  drivers/staging/media/sunxi/cedrus/cedrus.h   |  4 -
> > >  .../staging/media/sunxi/cedrus/cedrus_h264.c  | 81 +++++++------------
> > >  2 files changed, 28 insertions(+), 57 deletions(-)
> > > 
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > b/drivers/staging/media/sunxi/cedrus/cedrus.h index
> > > 16c1bdfd243a..fcbbbef65494 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
> > > @@ -106,10 +106,6 @@ struct cedrus_ctx {
> > > 
> > >  	union {
> > >  	
> > >  		struct {
> > > 
> > > -			void		*mv_col_buf;
> > > -			dma_addr_t	mv_col_buf_dma;
> > > -			ssize_t		mv_col_buf_field_size;
> > > -			ssize_t		mv_col_buf_size;
> > > 
> > >  			void		*pic_info_buf;
> > >  			dma_addr_t	pic_info_buf_dma;
> > >  			void		*neighbor_info_buf;
> > > 
> > > diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c index
> > > b2290f98d81a..758fd0049e8f 100644
> > > --- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > +++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
> > > @@ -54,17 +54,14 @@ static void cedrus_h264_write_sram(struct cedrus_dev
> > > *dev,> 
> > >  		cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
> > >  
> > >  }
> > > 
> > > -static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
> > > -					      unsigned int 
> position,
> > > +static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
> > > 
> > >  					      unsigned int 
> field)
> > >  
> > >  {
> > > 
> > > -	dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
> > > -
> > > -	/* Adjust for the position */
> > > -	addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
> > > +	dma_addr_t addr = buf->extra_buf_dma;
> > > 
> > >  	/* Adjust for the field */
> > > 
> > > -	addr += field * ctx->codec.h264.mv_col_buf_field_size;
> > > +	if (field)
> > > +		addr += buf->extra_buf_size / 2;
> > > 
> > >  	return addr;
> > >  
> > >  }
> > > 
> > > @@ -76,7 +73,6 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > > 
> > >  				struct cedrus_h264_sram_ref_pic 
> *pic)
> > >  
> > >  {
> > >  
> > >  	struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
> > > 
> > > -	unsigned int position = buf->codec.h264.position;
> > > 
> > >  	pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
> > >  	pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
> > > 
> > > @@ -84,10 +80,8 @@ static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
> > > 
> > >  	pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 
> 0));
> > >  	pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 
> 1));
> > > -	pic->mv_col_top_ptr =
> > > -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 
> 0));
> > > -	pic->mv_col_bot_ptr =
> > > -		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 
> 1));
> > > +	pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 
> 0));
> > > +	pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 
> 1));
> > >  }
> > >  
> > >  static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
> > > 
> > > @@ -142,6 +136,28 @@ static void cedrus_write_frame_list(struct cedrus_ctx
> > > *ctx,> 
> > >  	output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
> > >  	output_buf->codec.h264.position = position;
> > > 
> > > +	if (!output_buf->extra_buf_size) {
> > > +		const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
> > > +		unsigned int field_size;
> > > +
> > > +		field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
> > > +			DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
> > > +		if (!(sps->flags & 
> V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
> > > +			field_size = field_size * 2;
> > > +		if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
> > > +			field_size = field_size * 2;
> > > +
> > > +		output_buf->extra_buf_size = field_size * 2;
> > > +		output_buf->extra_buf =
> > > +			dma_alloc_coherent(dev->dev,
> > > +					   output_buf-
> > extra_buf_size,
> > > +					   &output_buf-
> > extra_buf_dma,
> > > +					   GFP_KERNEL);
> > > +
> > > +		if (!output_buf->extra_buf)
> > > +			output_buf->extra_buf_size = 0;
> > > +	}
> > > +
> > 
> > That also means that instead of allocating that buffer exactly once,
> > you now allocate it for each output buffer?
> 
> It's not completely the same. I'm allocating multiple times, yes, but much 
> smaller chunks and only if needed.
> 
> Still, this slight overhead happens only when buffer is used for the first time. 
> When buffer is reused, this MV buffer is also reused.
> 
> > I guess that it will cleaned up by your previous patch at
> > buffer_cleanup time, so after it's no longer a reference frame?
> 
> Yes, it will be deallocated in buffer_cleanup, but only after capture buffers 
> are freed, which usually happens when device file descriptor is closed.
> 
> Buffers which holds reference frames are later reused, together with it's MV 
> buffer, so there's no overhead.
> 
> > What is the average memory usage before, and after that change during
> > a playback, and what is the runtime cost of doing it multiple times
> > instead of once?
> 
> As I already said, overhead is present only when buffer is used for the first 
> time, which is not ideal, but allows to calculate minimal buffer size needed 
> and even doesn't allocate anything if capture buffer is not used at all.
> 
> I didn't collect any exact numbers, but with this change I can play H264 and 
> HEVC (with similar modification) 4K video samples with 256 MiB CMA pool. 
> Without this change, it's not really possible. You can argue "but what if 4K 
> video use 16 reference frames", then yes, only solution is to increase CMA 
> pool, but why reserve extra memory which will never be used?
> 
> I've been using this optimization for past ~6 month with no issues noticed. If 
> you feel better, I can change this to be a bit conservative and allocate MV 
> buffer when buffer_init is called. This will consume a bit more memory as SPS is 
> not available at that time (worst case buffer size estimation), but it still 
> won't allocate MV buffers for unallocated capture frames.
> 
> Best regards,
> Jernej
> 
>

diff --git a/drivers/staging/media/sunxi/cedrus/cedrus.h b/drivers/staging/media/sunxi/cedrus/cedrus.h
index 16c1bdfd243a..fcbbbef65494 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus.h
+++ b/drivers/staging/media/sunxi/cedrus/cedrus.h
@@ -106,10 +106,6 @@  struct cedrus_ctx {
 
 	union {
 		struct {
-			void		*mv_col_buf;
-			dma_addr_t	mv_col_buf_dma;
-			ssize_t		mv_col_buf_field_size;
-			ssize_t		mv_col_buf_size;
 			void		*pic_info_buf;
 			dma_addr_t	pic_info_buf_dma;
 			void		*neighbor_info_buf;
diff --git a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
index b2290f98d81a..758fd0049e8f 100644
--- a/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
+++ b/drivers/staging/media/sunxi/cedrus/cedrus_h264.c
@@ -54,17 +54,14 @@  static void cedrus_h264_write_sram(struct cedrus_dev *dev,
 		cedrus_write(dev, VE_AVC_SRAM_PORT_DATA, *buffer++);
 }
 
-static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_ctx *ctx,
-					      unsigned int position,
+static dma_addr_t cedrus_h264_mv_col_buf_addr(struct cedrus_buffer *buf,
 					      unsigned int field)
 {
-	dma_addr_t addr = ctx->codec.h264.mv_col_buf_dma;
-
-	/* Adjust for the position */
-	addr += position * ctx->codec.h264.mv_col_buf_field_size * 2;
+	dma_addr_t addr = buf->extra_buf_dma;
 
 	/* Adjust for the field */
-	addr += field * ctx->codec.h264.mv_col_buf_field_size;
+	if (field)
+		addr += buf->extra_buf_size / 2;
 
 	return addr;
 }
@@ -76,7 +73,6 @@  static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
 				struct cedrus_h264_sram_ref_pic *pic)
 {
 	struct vb2_buffer *vbuf = &buf->m2m_buf.vb.vb2_buf;
-	unsigned int position = buf->codec.h264.position;
 
 	pic->top_field_order_cnt = cpu_to_le32(top_field_order_cnt);
 	pic->bottom_field_order_cnt = cpu_to_le32(bottom_field_order_cnt);
@@ -84,10 +80,8 @@  static void cedrus_fill_ref_pic(struct cedrus_ctx *ctx,
 
 	pic->luma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 0));
 	pic->chroma_ptr = cpu_to_le32(cedrus_buf_addr(vbuf, &ctx->dst_fmt, 1));
-	pic->mv_col_top_ptr =
-		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 0));
-	pic->mv_col_bot_ptr =
-		cpu_to_le32(cedrus_h264_mv_col_buf_addr(ctx, position, 1));
+	pic->mv_col_top_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 0));
+	pic->mv_col_bot_ptr = cpu_to_le32(cedrus_h264_mv_col_buf_addr(buf, 1));
 }
 
 static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
@@ -142,6 +136,28 @@  static void cedrus_write_frame_list(struct cedrus_ctx *ctx,
 	output_buf = vb2_to_cedrus_buffer(&run->dst->vb2_buf);
 	output_buf->codec.h264.position = position;
 
+	if (!output_buf->extra_buf_size) {
+		const struct v4l2_ctrl_h264_sps *sps = run->h264.sps;
+		unsigned int field_size;
+
+		field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
+			DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
+		if (!(sps->flags & V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE))
+			field_size = field_size * 2;
+		if (!(sps->flags & V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY))
+			field_size = field_size * 2;
+
+		output_buf->extra_buf_size = field_size * 2;
+		output_buf->extra_buf =
+			dma_alloc_coherent(dev->dev,
+					   output_buf->extra_buf_size,
+					   &output_buf->extra_buf_dma,
+					   GFP_KERNEL);
+
+		if (!output_buf->extra_buf)
+			output_buf->extra_buf_size = 0;
+	}
+
 	if (slice->flags & V4L2_H264_SLICE_FLAG_FIELD_PIC)
 		output_buf->codec.h264.pic_type = CEDRUS_H264_PIC_TYPE_FIELD;
 	else if (sps->flags & V4L2_H264_SPS_FLAG_MB_ADAPTIVE_FRAME_FIELD)
@@ -476,8 +492,6 @@  static void cedrus_h264_setup(struct cedrus_ctx *ctx,
 static int cedrus_h264_start(struct cedrus_ctx *ctx)
 {
 	struct cedrus_dev *dev = ctx->dev;
-	unsigned int field_size;
-	unsigned int mv_col_size;
 	int ret;
 
 	/*
@@ -509,44 +523,8 @@  static int cedrus_h264_start(struct cedrus_ctx *ctx)
 		goto err_pic_buf;
 	}
 
-	field_size = DIV_ROUND_UP(ctx->src_fmt.width, 16) *
-		DIV_ROUND_UP(ctx->src_fmt.height, 16) * 16;
-
-	/*
-	 * FIXME: This is actually conditional to
-	 * V4L2_H264_SPS_FLAG_DIRECT_8X8_INFERENCE not being set, we
-	 * might have to rework this if memory efficiency ever is
-	 * something we need to work on.
-	 */
-	field_size = field_size * 2;
-
-	/*
-	 * FIXME: This is actually conditional to
-	 * V4L2_H264_SPS_FLAG_FRAME_MBS_ONLY not being set, we might
-	 * have to rework this if memory efficiency ever is something
-	 * we need to work on.
-	 */
-	field_size = field_size * 2;
-	ctx->codec.h264.mv_col_buf_field_size = field_size;
-
-	mv_col_size = field_size * 2 * CEDRUS_H264_FRAME_NUM;
-	ctx->codec.h264.mv_col_buf_size = mv_col_size;
-	ctx->codec.h264.mv_col_buf = dma_alloc_coherent(dev->dev,
-							ctx->codec.h264.mv_col_buf_size,
-							&ctx->codec.h264.mv_col_buf_dma,
-							GFP_KERNEL);
-	if (!ctx->codec.h264.mv_col_buf) {
-		ret = -ENOMEM;
-		goto err_neighbor_buf;
-	}
-
 	return 0;
 
-err_neighbor_buf:
-	dma_free_coherent(dev->dev, CEDRUS_NEIGHBOR_INFO_BUF_SIZE,
-			  ctx->codec.h264.neighbor_info_buf,
-			  ctx->codec.h264.neighbor_info_buf_dma);
-
 err_pic_buf:
 	dma_free_coherent(dev->dev, CEDRUS_PIC_INFO_BUF_SIZE,
 			  ctx->codec.h264.pic_info_buf,
@@ -558,9 +536,6 @@  static void cedrus_h264_stop(struct cedrus_ctx *ctx)
 {
 	struct cedrus_dev *dev = ctx->dev;
 
-	dma_free_coherent(dev->dev, ctx->codec.h264.mv_col_buf_size,
-			  ctx->codec.h264.mv_col_buf,
-			  ctx->codec.h264.mv_col_buf_dma);
 	dma_free_coherent(dev->dev, CEDRUS_NEIGHBOR_INFO_BUF_SIZE,
 			  ctx->codec.h264.neighbor_info_buf,
 			  ctx->codec.h264.neighbor_info_buf_dma);

[7/7] media: cedrus: Improve H264 memory efficiency

Commit Message

Comments

Patch