[PATCHv3,RFC,4/4] media: Catch null pipes on pipeline stop

Message ID	1481651984-7687-5-git-send-email-kieran.bingham+renesas@ideasonboard.com (mailing list archive)
State	Not Applicable
Delegated to:	Geert Uytterhoeven
Headers	show Return-Path: <linux-renesas-soc-owner@kernel.org> From: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> To: laurent.pinchart@ideasonboard.com Cc: linux-renesas-soc@vger.kernel.org, linux-media@vger.kernel.org, Kieran Bingham <kieran.bingham+renesas@ideasonboard.com> Subject: [PATCHv3 RFC 4/4] media: Catch null pipes on pipeline stop Date: Tue, 13 Dec 2016 17:59:44 +0000 Message-Id: <1481651984-7687-5-git-send-email-kieran.bingham+renesas@ideasonboard.com> In-Reply-To: <1481651984-7687-1-git-send-email-kieran.bingham+renesas@ideasonboard.com> References: <1481651984-7687-1-git-send-email-kieran.bingham+renesas@ideasonboard.com> Sender: linux-renesas-soc-owner@vger.kernel.org Precedence: bulk

Kieran Bingham Dec. 13, 2016, 5:59 p.m. UTC

media_entity_pipeline_stop() can be called through error paths with a
NULL entity pipe object. In this instance, stopping is a no-op, so
simply return without any action

Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
---

I've marked this patch as RFC, although if deemed suitable, by all means
integrate it as is.

When testing suspend/resume operations on VSP1, I encountered a segfault on the
WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple
protection fix is to return early in this instance, as this patch does however:

A) Does this early return path warrant a WARN() statement itself, to identify
drivers which are incorrectly calling media_entity_pipeline_stop() with an
invalid entity, or would this just be noise ...

and therefore..

B) I also partly assume this patch could simply get NAK'd with a request to go
and dig out the root cause of calling media_entity_pipeline_stop() with an
invalid entity. 

My brief investigation so far here so far shows that it's almost a second order
fault - where the first suspend resume cycle completes but leaves the entity in
an invalid state having followed an error path - and then on a second
suspend/resume - the stop fails with the affected segfault.

If statement A) or B) apply here, please drop this patch from the series, and
don't consider it a blocking issue for the other 3 patches.

Kieran


 drivers/media/media-entity.c | 2 ++
 1 file changed, 2 insertions(+)

Sakari Ailus Dec. 14, 2016, 7:28 a.m. UTC | #1

Hi Kieran,

On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote:
> media_entity_pipeline_stop() can be called through error paths with a
> NULL entity pipe object. In this instance, stopping is a no-op, so
> simply return without any action

The approach of returning silently is wrong; the streaming count is indeed a
count: you have to decrement it the exactly same number of times it's been
increased. Code that attempts to call __media_entity_pipeline_stop() on an
entity that's not streaming is simply buggy.

I've got a patch here that adds a warning to graph traversal on streaming
count being zero. I sent a pull request including that some days ago:

<URL:http://www.spinics.net/lists/linux-media/msg108980.html>
<URL:http://www.spinics.net/lists/linux-media/msg108995.html>

I think the check here could simply be removed as the check is done for
every entity in the pipeline with the above patch.

If there's still a wish to check for the pipe field which should not be
written by drivers, it should be done during pipeline traversal so that it
applies to all entities in the pipeline, not just where traversal starts.

> 
> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
> ---
> 
> I've marked this patch as RFC, although if deemed suitable, by all means
> integrate it as is.
> 
> When testing suspend/resume operations on VSP1, I encountered a segfault on the
> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple
> protection fix is to return early in this instance, as this patch does however:
> 
> A) Does this early return path warrant a WARN() statement itself, to identify
> drivers which are incorrectly calling media_entity_pipeline_stop() with an
> invalid entity, or would this just be noise ...
> 
> and therefore..
> 
> B) I also partly assume this patch could simply get NAK'd with a request to go
> and dig out the root cause of calling media_entity_pipeline_stop() with an
> invalid entity. 
> 
> My brief investigation so far here so far shows that it's almost a second order
> fault - where the first suspend resume cycle completes but leaves the entity in
> an invalid state having followed an error path - and then on a second
> suspend/resume - the stop fails with the affected segfault.
> 
> If statement A) or B) apply here, please drop this patch from the series, and
> don't consider it a blocking issue for the other 3 patches.
> 
> Kieran
> 
> 
>  drivers/media/media-entity.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c
> index c68239e60487..93c9cbf4bf46 100644
> --- a/drivers/media/media-entity.c
> +++ b/drivers/media/media-entity.c
> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity)
>  	struct media_entity_graph *graph = &entity->pipe->graph;
>  	struct media_pipeline *pipe = entity->pipe;
>  
> +	if (!pipe)
> +		return;
>  
>  	WARN_ON(!pipe->streaming_count);
>  	media_entity_graph_walk_start(graph, entity);

Kieran Bingham Dec. 14, 2016, 12:27 p.m. UTC | #2

Hi Sakari,

On 14/12/16 07:28, Sakari Ailus wrote:
> Hi Kieran,
> 
> On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote:
>> media_entity_pipeline_stop() can be called through error paths with a
>> NULL entity pipe object. In this instance, stopping is a no-op, so
>> simply return without any action
> 
> The approach of returning silently is wrong; the streaming count is indeed a
> count: you have to decrement it the exactly same number of times it's been
> increased. Code that attempts to call __media_entity_pipeline_stop() on an
> entity that's not streaming is simply buggy.

Ok, Thanks for the heads up on where to look, as I suspected we are in
section B) of my comments below :D

> I've got a patch here that adds a warning to graph traversal on streaming
> count being zero. I sent a pull request including that some days ago:
> 
> <URL:http://www.spinics.net/lists/linux-media/msg108980.html>
> <URL:http://www.spinics.net/lists/linux-media/msg108995.html>

Excellent, thanks, I've merged your branch into mine, and I'll
investigate where the error is actually coming from.

--
Ok - so I've merged your patches, (and dropped this patch) but they
don't fire any warning before I hit my null-pointer dereference in
__media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count);

So I think this is a case of calling stop on an invalid entity rather
than an unbalanced start/stop somehow:

[  613.830334] vsp1 fea38000.vsp: failed to reset wpf.0
[  613.838137] PM: resume of devices complete after 124.410 msecs
[  613.847390] PM: Finishing wakeup.
[  613.852247] Restarting tasks ... done.
[  615.864801] ravb e6800000.ethernet eth0: Link is Up - 100Mbps/Full -
flow control rx/tx
[  617.011299] vsp1 fea38000.vsp: failed to reset wpf.0
[  617.017777] vsp1 fea38000.vsp: DRM pipeline stop timeout
[  617.024649] Unable to handle kernel NULL pointer dereference at
virtual address 00000000
[  617.034273] pgd = ffffff80098c5000
[  617.039171] [00000000] *pgd=000000073fffe003[  617.043336] ,
*pud=000000073fffe003
, *pmd=0000000000000000[  617.050353]
[  617.053282] Internal error: Oops: 96000005 [#1] PREEMPT SMP
[  617.060309] Modules linked in:
[  617.064793] CPU: 1 PID: 683 Comm: kworker/1:2 Not tainted
4.9.0-00506-g4d2a939ea654 #350
[  617.074320] Hardware name: Renesas Salvator-X board based on r8a7795 (DT)
[  617.082578] Workqueue: events drm_mode_rmfb_work_fn
[  617.088884] task: ffffffc6fabd2b00 task.stack: ffffffc6f9d90000
[  617.096246] PC is at __media_pipeline_stop+0x24/0xe8
[  617.102644] LR is at media_pipeline_stop+0x34/0x48

<regs and stack snipped>

[  617.863386] [<ffffff800853e724>] __media_pipeline_stop+0x24/0xe8
[  617.870417] [<ffffff800853e81c>] media_pipeline_stop+0x34/0x48
[  617.877272] [<ffffff8008568000>] vsp1_du_setup_lif+0x68/0x5a8
[  617.884047] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38
[  617.890898] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8
[  617.897749] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38
[  617.904683] [<ffffff80084ac9b0>]
drm_atomic_helper_commit_modeset_disables+0x1b0/0x3d8
[  617.913634] [<ffffff80084db9cc>] rcar_du_atomic_complete+0x34/0xc8
[  617.920828] [<ffffff80084dbd20>] rcar_du_atomic_commit+0x2c0/0x2d0
[  617.928012] [<ffffff80084ceffc>] drm_atomic_commit+0x5c/0x68
[  617.934663] [<ffffff80084ad2e0>] drm_atomic_helper_set_config+0x90/0xd0
[  617.942288] [<ffffff80084c06a0>] drm_mode_set_config_internal+0x70/0x100
[  617.949996] [<ffffff80084c0760>] drm_crtc_force_disable+0x30/0x40
[  617.957084] [<ffffff80084d0b10>] drm_framebuffer_remove+0x100/0x128
[  617.964347] [<ffffff80084d0b80>] drm_mode_rmfb_work_fn+0x48/0x60
[  617.971352] [<ffffff80080e9770>] process_one_work+0x1e0/0x738
[  617.978084] [<ffffff80080e9f24>] worker_thread+0x25c/0x4a0
[  617.984546] [<ffffff80080f11ac>] kthread+0xd4/0xe8
[  617.990305] [<ffffff8008083680>] ret_from_fork+0x10/0x50


So, I'll have to schedule some time to investigate this deeper and find
out where we are calling stop on an invalid entity object.

I agree that this patch should simply be dropped though, it was more to
promote a bit of discussion on the area to help me understand what was
going on, which it has!.

Thankyou Sakari!

--
Regards

Kieran


> I think the check here could simply be removed as the check is done for
> every entity in the pipeline with the above patch.
> 
> If there's still a wish to check for the pipe field which should not be
> written by drivers, it should be done during pipeline traversal so that it
> applies to all entities in the pipeline, not just where traversal starts.
> 
>>
>> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
>> ---
>>
>> I've marked this patch as RFC, although if deemed suitable, by all means
>> integrate it as is.
>>
>> When testing suspend/resume operations on VSP1, I encountered a segfault on the
>> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple
>> protection fix is to return early in this instance, as this patch does however:
>>
>> A) Does this early return path warrant a WARN() statement itself, to identify
>> drivers which are incorrectly calling media_entity_pipeline_stop() with an
>> invalid entity, or would this just be noise ...
>>
>> and therefore..
>>
>> B) I also partly assume this patch could simply get NAK'd with a request to go
>> and dig out the root cause of calling media_entity_pipeline_stop() with an
>> invalid entity. 
>>
>> My brief investigation so far here so far shows that it's almost a second order
>> fault - where the first suspend resume cycle completes but leaves the entity in
>> an invalid state having followed an error path - and then on a second
>> suspend/resume - the stop fails with the affected segfault.
>>
>> If statement A) or B) apply here, please drop this patch from the series, and
>> don't consider it a blocking issue for the other 3 patches.
>>
>> Kieran
>>
>>
>>  drivers/media/media-entity.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c
>> index c68239e60487..93c9cbf4bf46 100644
>> --- a/drivers/media/media-entity.c
>> +++ b/drivers/media/media-entity.c
>> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity)
>>  	struct media_entity_graph *graph = &entity->pipe->graph;
>>  	struct media_pipeline *pipe = entity->pipe;
>>  
>> +	if (!pipe)
>> +		return;
>>  
>>  	WARN_ON(!pipe->streaming_count);
>>  	media_entity_graph_walk_start(graph, entity);
>

Sakari Ailus Dec. 14, 2016, 12:43 p.m. UTC | #3

Hi Kieran,

On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote:
> Hi Sakari,
> 
> On 14/12/16 07:28, Sakari Ailus wrote:
> > Hi Kieran,
> > 
> > On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote:
> >> media_entity_pipeline_stop() can be called through error paths with a
> >> NULL entity pipe object. In this instance, stopping is a no-op, so
> >> simply return without any action
> > 
> > The approach of returning silently is wrong; the streaming count is indeed a
> > count: you have to decrement it the exactly same number of times it's been
> > increased. Code that attempts to call __media_entity_pipeline_stop() on an
> > entity that's not streaming is simply buggy.
> 
> Ok, Thanks for the heads up on where to look, as I suspected we are in
> section B) of my comments below :D
> 
> > I've got a patch here that adds a warning to graph traversal on streaming
> > count being zero. I sent a pull request including that some days ago:
> > 
> > <URL:http://www.spinics.net/lists/linux-media/msg108980.html>
> > <URL:http://www.spinics.net/lists/linux-media/msg108995.html>
> 
> Excellent, thanks, I've merged your branch into mine, and I'll
> investigate where the error is actually coming from.
> 
> --
> Ok - so I've merged your patches, (and dropped this patch) but they
> don't fire any warning before I hit my null-pointer dereference in
> __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count);
> 
> So I think this is a case of calling stop on an invalid entity rather
> than an unbalanced start/stop somehow:

Not necessarily. The pipe is set to NULL if the count goes to zero.

This check should be dropped, it's ill-placed anyway: if streaming count is
zero the pipe is expected to be NULL anyway in which case you get a NULL
pointer exception (that you saw there). With the check removed, you should
get the warning instead.

> 
> [  613.830334] vsp1 fea38000.vsp: failed to reset wpf.0
> [  613.838137] PM: resume of devices complete after 124.410 msecs
> [  613.847390] PM: Finishing wakeup.
> [  613.852247] Restarting tasks ... done.
> [  615.864801] ravb e6800000.ethernet eth0: Link is Up - 100Mbps/Full -
> flow control rx/tx
> [  617.011299] vsp1 fea38000.vsp: failed to reset wpf.0
> [  617.017777] vsp1 fea38000.vsp: DRM pipeline stop timeout
> [  617.024649] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000
> [  617.034273] pgd = ffffff80098c5000
> [  617.039171] [00000000] *pgd=000000073fffe003[  617.043336] ,
> *pud=000000073fffe003
> , *pmd=0000000000000000[  617.050353]
> [  617.053282] Internal error: Oops: 96000005 [#1] PREEMPT SMP
> [  617.060309] Modules linked in:
> [  617.064793] CPU: 1 PID: 683 Comm: kworker/1:2 Not tainted
> 4.9.0-00506-g4d2a939ea654 #350
> [  617.074320] Hardware name: Renesas Salvator-X board based on r8a7795 (DT)
> [  617.082578] Workqueue: events drm_mode_rmfb_work_fn
> [  617.088884] task: ffffffc6fabd2b00 task.stack: ffffffc6f9d90000
> [  617.096246] PC is at __media_pipeline_stop+0x24/0xe8
> [  617.102644] LR is at media_pipeline_stop+0x34/0x48
> 
> <regs and stack snipped>
> 
> [  617.863386] [<ffffff800853e724>] __media_pipeline_stop+0x24/0xe8
> [  617.870417] [<ffffff800853e81c>] media_pipeline_stop+0x34/0x48
> [  617.877272] [<ffffff8008568000>] vsp1_du_setup_lif+0x68/0x5a8
> [  617.884047] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38
> [  617.890898] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8
> [  617.897749] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38
> [  617.904683] [<ffffff80084ac9b0>]
> drm_atomic_helper_commit_modeset_disables+0x1b0/0x3d8
> [  617.913634] [<ffffff80084db9cc>] rcar_du_atomic_complete+0x34/0xc8
> [  617.920828] [<ffffff80084dbd20>] rcar_du_atomic_commit+0x2c0/0x2d0
> [  617.928012] [<ffffff80084ceffc>] drm_atomic_commit+0x5c/0x68
> [  617.934663] [<ffffff80084ad2e0>] drm_atomic_helper_set_config+0x90/0xd0
> [  617.942288] [<ffffff80084c06a0>] drm_mode_set_config_internal+0x70/0x100
> [  617.949996] [<ffffff80084c0760>] drm_crtc_force_disable+0x30/0x40
> [  617.957084] [<ffffff80084d0b10>] drm_framebuffer_remove+0x100/0x128
> [  617.964347] [<ffffff80084d0b80>] drm_mode_rmfb_work_fn+0x48/0x60
> [  617.971352] [<ffffff80080e9770>] process_one_work+0x1e0/0x738
> [  617.978084] [<ffffff80080e9f24>] worker_thread+0x25c/0x4a0
> [  617.984546] [<ffffff80080f11ac>] kthread+0xd4/0xe8
> [  617.990305] [<ffffff8008083680>] ret_from_fork+0x10/0x50
> 
> 
> So, I'll have to schedule some time to investigate this deeper and find
> out where we are calling stop on an invalid entity object.
> 
> I agree that this patch should simply be dropped though, it was more to
> promote a bit of discussion on the area to help me understand what was
> going on, which it has!.
> 
> Thankyou Sakari!
> 
> --
> Regards
> 
> Kieran
> 
> 
> > I think the check here could simply be removed as the check is done for
> > every entity in the pipeline with the above patch.
> > 
> > If there's still a wish to check for the pipe field which should not be
> > written by drivers, it should be done during pipeline traversal so that it
> > applies to all entities in the pipeline, not just where traversal starts.
> > 
> >>
> >> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
> >> ---
> >>
> >> I've marked this patch as RFC, although if deemed suitable, by all means
> >> integrate it as is.
> >>
> >> When testing suspend/resume operations on VSP1, I encountered a segfault on the
> >> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple
> >> protection fix is to return early in this instance, as this patch does however:
> >>
> >> A) Does this early return path warrant a WARN() statement itself, to identify
> >> drivers which are incorrectly calling media_entity_pipeline_stop() with an
> >> invalid entity, or would this just be noise ...
> >>
> >> and therefore..
> >>
> >> B) I also partly assume this patch could simply get NAK'd with a request to go
> >> and dig out the root cause of calling media_entity_pipeline_stop() with an
> >> invalid entity. 
> >>
> >> My brief investigation so far here so far shows that it's almost a second order
> >> fault - where the first suspend resume cycle completes but leaves the entity in
> >> an invalid state having followed an error path - and then on a second
> >> suspend/resume - the stop fails with the affected segfault.
> >>
> >> If statement A) or B) apply here, please drop this patch from the series, and
> >> don't consider it a blocking issue for the other 3 patches.
> >>
> >> Kieran
> >>
> >>
> >>  drivers/media/media-entity.c | 2 ++
> >>  1 file changed, 2 insertions(+)
> >>
> >> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c
> >> index c68239e60487..93c9cbf4bf46 100644
> >> --- a/drivers/media/media-entity.c
> >> +++ b/drivers/media/media-entity.c
> >> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity)
> >>  	struct media_entity_graph *graph = &entity->pipe->graph;
> >>  	struct media_pipeline *pipe = entity->pipe;
> >>  
> >> +	if (!pipe)
> >> +		return;
> >>  
> >>  	WARN_ON(!pipe->streaming_count);
> >>  	media_entity_graph_walk_start(graph, entity);
> >

Kieran Bingham Dec. 14, 2016, 5:53 p.m. UTC | #4

Hi Sakari,

On 14/12/16 12:43, Sakari Ailus wrote:
> Hi Kieran,
> 
> On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote:
>> Hi Sakari,
>>
>> On 14/12/16 07:28, Sakari Ailus wrote:
>>> Hi Kieran,
>>>
>>> On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote:
>>>> media_entity_pipeline_stop() can be called through error paths with a
>>>> NULL entity pipe object. In this instance, stopping is a no-op, so
>>>> simply return without any action
>>>
>>> The approach of returning silently is wrong; the streaming count is indeed a
>>> count: you have to decrement it the exactly same number of times it's been
>>> increased. Code that attempts to call __media_entity_pipeline_stop() on an
>>> entity that's not streaming is simply buggy.
>>
>> Ok, Thanks for the heads up on where to look, as I suspected we are in
>> section B) of my comments below :D
>>
>>> I've got a patch here that adds a warning to graph traversal on streaming
>>> count being zero. I sent a pull request including that some days ago:
>>>
>>> <URL:http://www.spinics.net/lists/linux-media/msg108980.html>
>>> <URL:http://www.spinics.net/lists/linux-media/msg108995.html>
>>
>> Excellent, thanks, I've merged your branch into mine, and I'll
>> investigate where the error is actually coming from.
>>
>> --
>> Ok - so I've merged your patches, (and dropped this patch) but they
>> don't fire any warning before I hit my null-pointer dereference in
>> __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count);
>>
>> So I think this is a case of calling stop on an invalid entity rather
>> than an unbalanced start/stop somehow:
> 
> Not necessarily. The pipe is set to NULL if the count goes to zero.
> 
> This check should be dropped, it's ill-placed anyway: if streaming count is
> zero the pipe is expected to be NULL anyway in which case you get a NULL
> pointer exception (that you saw there). With the check removed, you should
> get the warning instead.

Ahh, yes, I'd missed the part there that was setting it to NULL.

However, it will still segfault ... (though hopefully after the warning)
as the last line of the function states:

	if (!--pipe->streaming_count)
		media_entity_graph_walk_cleanup(graph);

So we will hit a null deref on the pipe->streaming_count there, which
still leaves an unbalanced stop as a kernel panic.

In fact, I've just tested removing that WARN_ON(!pipe->streaming_count);
 - it doesn't even get that far - and segfaults in

 		media_entity_graph_walk_cleanup(graph);

[   80.916558] Unable to handle kernel NULL pointer dereference at
virtual address 00000110
....
[   81.769492] [<ffffff800853e278>] media_graph_walk_start+0x20/0xf0
[   81.776615] [<ffffff800853e73c>] __media_pipeline_stop+0x3c/0xd8
[   81.783644] [<ffffff800853e80c>] media_pipeline_stop+0x34/0x48
[   81.790505] [<ffffff8008567ff0>] vsp1_du_setup_lif+0x68/0x5a8
[   81.797279] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38
[   81.804137] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8
[   81.810984] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38

due to the fact that 'graph' is dereferenced through the 'pipe'.

{
	/* Entity->pipe is NULL, therefore 'graph' is invalid */
	struct media_graph *graph = &entity->pipe->graph;
	struct media_pipeline *pipe = entity->pipe;

	media_graph_walk_start(graph, entity);
...
}

So I suspect we do still need a warning or check in there somewhere.
Something to tell the developer that there is an unbalanced stop, would
be much better than a panic/segfault...

(And of course, we still need to look at the actual unbalanced stop in
vsp1_du_setup_lif!)

--
Kieran

Kieran Bingham Dec. 14, 2016, 7:56 p.m. UTC | #5

Hello Me.

Ok, so a bit of investigation into *why* we have an unbalanced
media_pipeline stop call.

After a suspend/resume cycle, the function vsp1_pm_runtime_resume() is
called by the PM framework.

The hardware is unable to reset successfully and the reset call returns
-ETIMEDOUT which gets passed back to the PM framework as a failure and
thus the device is not 'resumed'

This process is commenced, as the DU driver tries to enable the VSP, we
get the following call stack:

rcar_du_crtc_resume()
  rcar_du_vsp_enable()
    vsp1_du_setup_lif() // returns void
      vsp1_device_get() // returns -ETIMEDOUT,

As the vsp1_du_setup_lif() returns void, the fact that the hardware
failed to start is ignored.

Later we get a call to  rcar_du_vsp_disable(), which again calls into
vsp1_du_setup_lif(), this time to disable the pipeline. And it is here,
that the call to media_pipeline_stop() is an unbalanced call, as the
corresponding media_pipeline_start() would only have been called if the
vsp1_device_get() had succeeded, thus we find ourselves with a kernel
panic on a null dereference.

Sorry for the terse notes, they are possibly/probably really for me if I
get tasked to look back at this.
--
Regards

Kieran


On 13/12/16 17:59, Kieran Bingham wrote:
> media_entity_pipeline_stop() can be called through error paths with a
> NULL entity pipe object. In this instance, stopping is a no-op, so
> simply return without any action
> 
> Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
> ---
> 
> I've marked this patch as RFC, although if deemed suitable, by all means
> integrate it as is.
> 
> When testing suspend/resume operations on VSP1, I encountered a segfault on the
> WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The simple
> protection fix is to return early in this instance, as this patch does however:
> 
> A) Does this early return path warrant a WARN() statement itself, to identify
> drivers which are incorrectly calling media_entity_pipeline_stop() with an
> invalid entity, or would this just be noise ...
> 
> and therefore..
> 
> B) I also partly assume this patch could simply get NAK'd with a request to go
> and dig out the root cause of calling media_entity_pipeline_stop() with an
> invalid entity. 
> 
> My brief investigation so far here so far shows that it's almost a second order
> fault - where the first suspend resume cycle completes but leaves the entity in
> an invalid state having followed an error path - and then on a second
> suspend/resume - the stop fails with the affected segfault.
> 
> If statement A) or B) apply here, please drop this patch from the series, and
> don't consider it a blocking issue for the other 3 patches.
> 
> Kieran
> 
> 
>  drivers/media/media-entity.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c
> index c68239e60487..93c9cbf4bf46 100644
> --- a/drivers/media/media-entity.c
> +++ b/drivers/media/media-entity.c
> @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity *entity)
>  	struct media_entity_graph *graph = &entity->pipe->graph;
>  	struct media_pipeline *pipe = entity->pipe;
>  
> +	if (!pipe)
> +		return;
>  
>  	WARN_ON(!pipe->streaming_count);
>  	media_entity_graph_walk_start(graph, entity);
>

Laurent Pinchart Dec. 14, 2016, 9:16 p.m. UTC | #6

Hi Kieran,

On Wednesday 14 Dec 2016 19:56:51 Kieran Bingham wrote:
> Hello Me.
> 
> Ok, so a bit of investigation into *why* we have an unbalanced
> media_pipeline stop call.
> 
> After a suspend/resume cycle, the function vsp1_pm_runtime_resume() is
> called by the PM framework.
> 
> The hardware is unable to reset successfully and the reset call returns
> -ETIMEDOUT which gets passed back to the PM framework as a failure and
> thus the device is not 'resumed'
> 
> This process is commenced, as the DU driver tries to enable the VSP, we
> get the following call stack:
> 
> rcar_du_crtc_resume()
>   rcar_du_vsp_enable()
>     vsp1_du_setup_lif() // returns void
>       vsp1_device_get() // returns -ETIMEDOUT,

I suspect the call stack to not be entirely accurate as the 
rcar_du_crtc_resume() is never called :-)

> As the vsp1_du_setup_lif() returns void, the fact that the hardware
> failed to start is ignored.
> 
> Later we get a call to  rcar_du_vsp_disable(), which again calls into
> vsp1_du_setup_lif(), this time to disable the pipeline. And it is here,
> that the call to media_pipeline_stop() is an unbalanced call, as the
> corresponding media_pipeline_start() would only have been called if the
> vsp1_device_get() had succeeded, thus we find ourselves with a kernel
> panic on a null dereference.
> 
> Sorry for the terse notes, they are possibly/probably really for me if I
> get tasked to look back at this.
> --
> Regards
> 
> Kieran
> 
> On 13/12/16 17:59, Kieran Bingham wrote:
> > media_entity_pipeline_stop() can be called through error paths with a
> > NULL entity pipe object. In this instance, stopping is a no-op, so
> > simply return without any action
> > 
> > Signed-off-by: Kieran Bingham <kieran.bingham+renesas@ideasonboard.com>
> > ---
> > 
> > I've marked this patch as RFC, although if deemed suitable, by all means
> > integrate it as is.
> > 
> > When testing suspend/resume operations on VSP1, I encountered a segfault
> > on the WARN_ON(!pipe->streaming_count) line, where 'pipe == NULL'. The
> > simple protection fix is to return early in this instance, as this patch
> > does however:
> > 
> > A) Does this early return path warrant a WARN() statement itself, to
> > identify drivers which are incorrectly calling
> > media_entity_pipeline_stop() with an invalid entity, or would this just
> > be noise ...
> > 
> > and therefore..
> > 
> > B) I also partly assume this patch could simply get NAK'd with a request
> > to go and dig out the root cause of calling media_entity_pipeline_stop()
> > with an invalid entity.
> > 
> > My brief investigation so far here so far shows that it's almost a second
> > order fault - where the first suspend resume cycle completes but leaves
> > the entity in an invalid state having followed an error path - and then
> > on a second suspend/resume - the stop fails with the affected segfault.
> > 
> > If statement A) or B) apply here, please drop this patch from the series,
> > and don't consider it a blocking issue for the other 3 patches.
> > 
> > Kieran
> > 
> >  drivers/media/media-entity.c | 2 ++
> >  1 file changed, 2 insertions(+)
> > 
> > diff --git a/drivers/media/media-entity.c b/drivers/media/media-entity.c
> > index c68239e60487..93c9cbf4bf46 100644
> > --- a/drivers/media/media-entity.c
> > +++ b/drivers/media/media-entity.c
> > @@ -508,6 +508,8 @@ void __media_entity_pipeline_stop(struct media_entity
> > *entity)> 
> >  	struct media_entity_graph *graph = &entity->pipe->graph;
> >  	struct media_pipeline *pipe = entity->pipe;
> > 
> > +	if (!pipe)
> > +		return;
> > 
> >  	WARN_ON(!pipe->streaming_count);
> >  	media_entity_graph_walk_start(graph, entity);

Sakari Ailus Dec. 15, 2016, 7:14 a.m. UTC | #7

On Wed, Dec 14, 2016 at 05:53:00PM +0000, Kieran Bingham wrote:
> Hi Sakari,
> 
> On 14/12/16 12:43, Sakari Ailus wrote:
> > Hi Kieran,
> > 
> > On Wed, Dec 14, 2016 at 12:27:37PM +0000, Kieran Bingham wrote:
> >> Hi Sakari,
> >>
> >> On 14/12/16 07:28, Sakari Ailus wrote:
> >>> Hi Kieran,
> >>>
> >>> On Tue, Dec 13, 2016 at 05:59:44PM +0000, Kieran Bingham wrote:
> >>>> media_entity_pipeline_stop() can be called through error paths with a
> >>>> NULL entity pipe object. In this instance, stopping is a no-op, so
> >>>> simply return without any action
> >>>
> >>> The approach of returning silently is wrong; the streaming count is indeed a
> >>> count: you have to decrement it the exactly same number of times it's been
> >>> increased. Code that attempts to call __media_entity_pipeline_stop() on an
> >>> entity that's not streaming is simply buggy.
> >>
> >> Ok, Thanks for the heads up on where to look, as I suspected we are in
> >> section B) of my comments below :D
> >>
> >>> I've got a patch here that adds a warning to graph traversal on streaming
> >>> count being zero. I sent a pull request including that some days ago:
> >>>
> >>> <URL:http://www.spinics.net/lists/linux-media/msg108980.html>
> >>> <URL:http://www.spinics.net/lists/linux-media/msg108995.html>
> >>
> >> Excellent, thanks, I've merged your branch into mine, and I'll
> >> investigate where the error is actually coming from.
> >>
> >> --
> >> Ok - so I've merged your patches, (and dropped this patch) but they
> >> don't fire any warning before I hit my null-pointer dereference in
> >> __media_pipeline_stop(), on the WARN_ON(!pipe->streaming_count);
> >>
> >> So I think this is a case of calling stop on an invalid entity rather
> >> than an unbalanced start/stop somehow:
> > 
> > Not necessarily. The pipe is set to NULL if the count goes to zero.
> > 
> > This check should be dropped, it's ill-placed anyway: if streaming count is
> > zero the pipe is expected to be NULL anyway in which case you get a NULL
> > pointer exception (that you saw there). With the check removed, you should
> > get the warning instead.
> 
> Ahh, yes, I'd missed the part there that was setting it to NULL.
> 
> However, it will still segfault ... (though hopefully after the warning)
> as the last line of the function states:
> 
> 	if (!--pipe->streaming_count)
> 		media_entity_graph_walk_cleanup(graph);
> 
> So we will hit a null deref on the pipe->streaming_count there, which
> still leaves an unbalanced stop as a kernel panic.

Right, indeed. The graph is part of the pipeline, so if pipe was NULL you
don't have graph either. How about changing the check to:

if (WARN_ON(!pipe))
	return;

Instead of removing it. Nothing in that function really makes sense if
there's no pipeline. The driver bug still exists. That's just to make the
framework behaving a little bit better in the presence of that sort of bug.

> 
> In fact, I've just tested removing that WARN_ON(!pipe->streaming_count);
>  - it doesn't even get that far - and segfaults in
> 
>  		media_entity_graph_walk_cleanup(graph);
> 
> [   80.916558] Unable to handle kernel NULL pointer dereference at
> virtual address 00000110
> ....
> [   81.769492] [<ffffff800853e278>] media_graph_walk_start+0x20/0xf0
> [   81.776615] [<ffffff800853e73c>] __media_pipeline_stop+0x3c/0xd8
> [   81.783644] [<ffffff800853e80c>] media_pipeline_stop+0x34/0x48
> [   81.790505] [<ffffff8008567ff0>] vsp1_du_setup_lif+0x68/0x5a8
> [   81.797279] [<ffffff80084de9d4>] rcar_du_vsp_disable+0x2c/0x38
> [   81.804137] [<ffffff80084da710>] rcar_du_crtc_stop+0x198/0x1e8
> [   81.810984] [<ffffff80084da780>] rcar_du_crtc_disable+0x20/0x38
> 
> due to the fact that 'graph' is dereferenced through the 'pipe'.
> 
> {
> 	/* Entity->pipe is NULL, therefore 'graph' is invalid */
> 	struct media_graph *graph = &entity->pipe->graph;
> 	struct media_pipeline *pipe = entity->pipe;
> 
> 	media_graph_walk_start(graph, entity);
> ...
> }
> 
> So I suspect we do still need a warning or check in there somewhere.
> Something to tell the developer that there is an unbalanced stop, would
> be much better than a panic/segfault...
> 
> (And of course, we still need to look at the actual unbalanced stop in
> vsp1_du_setup_lif!)

Indeed.

>

[PATCHv3,RFC,4/4] media: Catch null pipes on pipeline stop

Commit Message

Comments

Patch