[1/5] drm: Kernel Crash in drm_unlock

Message ID	1429798078-18987-2-git-send-email-peter.antoine@intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> From: Peter Antoine <peter.antoine@intel.com> To: intel-gfx@lists.freedesktop.org Date: Thu, 23 Apr 2015 15:07:54 +0100 Message-Id: <1429798078-18987-2-git-send-email-peter.antoine@intel.com> In-Reply-To: <1429798078-18987-1-git-send-email-peter.antoine@intel.com> References: <1429798078-18987-1-git-send-email-peter.antoine@intel.com> Cc: airlied@redhat.com, dri-devel@lists.freedesktop.org, daniel.vetter@ffwll.ch Subject: [Intel-gfx] [PATCH 1/5] drm: Kernel Crash in drm_unlock Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>

Peter Antoine April 23, 2015, 2:07 p.m. UTC

This patch fixes a possible kernel crash when drm_unlock (DRM_IOCTL_UNLOCK)
is called by a application that has not had a lock created by it. This
crash can be caused by any application from all users.

Issue: VIZ-5485
Signed-off-by: Peter Antoine <peter.antoine@intel.com>
---
 drivers/gpu/drm/drm_lock.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Chris Wilson April 23, 2015, 2:19 p.m. UTC | #1

On Thu, Apr 23, 2015 at 03:07:54PM +0100, Peter Antoine wrote:
> This patch fixes a possible kernel crash when drm_unlock (DRM_IOCTL_UNLOCK)
> is called by a application that has not had a lock created by it. This
> crash can be caused by any application from all users.

Crashing the application is still a crash...
-Chris

Peter Antoine April 23, 2015, 2:34 p.m. UTC | #2

Before the patch the system required rebooting (driver crash and/or kernel panic).
Now the application gets terminated.

This follows the pattern in the other parts of this code that checks that pointer.

Peter.

-----Original Message-----
From: Chris Wilson [mailto:chris@chris-wilson.co.uk] 
Sent: Thursday, April 23, 2015 3:20 PM
To: Antoine, Peter
Cc: intel-gfx@lists.freedesktop.org; airlied@redhat.com; dri-devel@lists.freedesktop.org; daniel.vetter@ffwll.ch
Subject: Re: [Intel-gfx] [PATCH 1/5] drm: Kernel Crash in drm_unlock

On Thu, Apr 23, 2015 at 03:07:54PM +0100, Peter Antoine wrote:
> This patch fixes a possible kernel crash when drm_unlock 
> (DRM_IOCTL_UNLOCK) is called by a application that has not had a lock 
> created by it. This crash can be caused by any application from all users.

Crashing the application is still a crash...
-Chris

--
Chris Wilson, Intel Open Source Technology Centre

Chris Wilson April 23, 2015, 2:39 p.m. UTC | #3

On Thu, Apr 23, 2015 at 02:34:24PM +0000, Antoine, Peter wrote:
> Before the patch the system required rebooting (driver crash and/or kernel panic).
> Now the application gets terminated.

It should have an GPF which should not have required a reboot, but
terminated the application. Unless you set it to automatically reboot.
-Chris

Peter Antoine April 24, 2015, 5:52 a.m. UTC | #4

I picked up this work due to the following Jira ticket created by the
security team (on Android) and was asked to give it a second look and
found a few more issues with the hw lock code.

https://jira01.devtools.intel.com/browse/GMINL-5388
I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)

It also stops Linux as it kills the driver, I guess it might be possible
to reload the gfx driver. On a unpatched system the test that is
included in the issue or the igt test that has been posted for the issue
will show the problem.

I ran the test on an unpatched system here and the gui stopped and the
keyboard stopped responding, so I rebooted. With the patched system I
did not need to reboot.

Should I change the SIGTERM to SIGSEGV, not quite the same thing but
tooling is better at handling a segfault than a SIGTERM and the
application that calls this IOCTL is using an uninitialised hw lock so
it is kind of the same as differencing an uninitialised pointer (kind
of). Or, I could just remove it, but the bug has been in the code for at
least two years (and known about), and I would guess that any code that
is calling this is fuzzing the IOCTLs (as this is how the security team
found it) and we should reward them with a application exit.

Peter. 

On Thu, 2015-04-23 at 15:39 +0100, Chris Wilson wrote:
> On Thu, Apr 23, 2015 at 02:34:24PM +0000, Antoine, Peter wrote:

> > Before the patch the system required rebooting (driver crash and/or kernel panic).

> > Now the application gets terminated.

> 

> It should have an GPF which should not have required a reboot, but

> terminated the application. Unless you set it to automatically reboot.

> -Chris

>

Dave Gordon April 28, 2015, 9:21 a.m. UTC | #5

On 24/04/15 06:52, Antoine, Peter wrote:
> I picked up this work due to the following Jira ticket created by the
> security team (on Android) and was asked to give it a second look and
> found a few more issues with the hw lock code.
> 
> https://jira01.devtools.intel.com/browse/GMINL-5388
> I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)
> 
> It also stops Linux as it kills the driver, I guess it might be possible
> to reload the gfx driver. On a unpatched system the test that is
> included in the issue or the igt test that has been posted for the issue
> will show the problem.
> 
> I ran the test on an unpatched system here and the gui stopped and the
> keyboard stopped responding, so I rebooted. With the patched system I
> did not need to reboot.
> 
> Should I change the SIGTERM to SIGSEGV, not quite the same thing but
> tooling is better at handling a segfault than a SIGTERM and the
> application that calls this IOCTL is using an uninitialised hw lock so
> it is kind of the same as differencing an uninitialised pointer (kind
> of). Or, I could just remove it, but the bug has been in the code for at
> least two years (and known about), and I would guess that any code that
> is calling this is fuzzing the IOCTLs (as this is how the security team
> found it) and we should reward them with a application exit.
> 
> Peter. 

SIGSEGV would be a better choice.

SIGTERM is normally sent by a user -- it's the default signal sent by
kill(1). It's also commonly used to tell a long-running daemon process
to tidy up and exit cleanly.

SIGSEGV commonly means "you accessed something that doesn't exist/isn't
mapped/you don't have permissions for". There are specific subcases that
can be indicated via the siginfo data; this is from the sigaction(1)
manpage:

    The following values can be placed in si_code for a SIGSEGV signal:

        SEGV_MAPERR    address not mapped to object

        SEGV_ACCERR    invalid permissions for mapped object

SIGBUS would also be a possibility but that's generally taken to mean
that an access got all the way to some physical bus and then faulted,
whereas SIGSEGV suggests the access was rejected during the
virtual-to-physical mapping process.

.Dave.

Chris Wilson April 28, 2015, 9:52 a.m. UTC | #6

On Tue, Apr 28, 2015 at 10:21:49AM +0100, Dave Gordon wrote:
> On 24/04/15 06:52, Antoine, Peter wrote:
> > I picked up this work due to the following Jira ticket created by the
> > security team (on Android) and was asked to give it a second look and
> > found a few more issues with the hw lock code.
> > 
> > https://jira01.devtools.intel.com/browse/GMINL-5388
> > I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)
> > 
> > It also stops Linux as it kills the driver, I guess it might be possible
> > to reload the gfx driver. On a unpatched system the test that is
> > included in the issue or the igt test that has been posted for the issue
> > will show the problem.
> > 
> > I ran the test on an unpatched system here and the gui stopped and the
> > keyboard stopped responding, so I rebooted. With the patched system I
> > did not need to reboot.
> > 
> > Should I change the SIGTERM to SIGSEGV, not quite the same thing but
> > tooling is better at handling a segfault than a SIGTERM and the
> > application that calls this IOCTL is using an uninitialised hw lock so
> > it is kind of the same as differencing an uninitialised pointer (kind
> > of). Or, I could just remove it, but the bug has been in the code for at
> > least two years (and known about), and I would guess that any code that
> > is calling this is fuzzing the IOCTLs (as this is how the security team
> > found it) and we should reward them with a application exit.
> > 
> > Peter. 
> 
> SIGSEGV would be a better choice.
> 
> SIGTERM is normally sent by a user -- it's the default signal sent by
> kill(1). It's also commonly used to tell a long-running daemon process
> to tidy up and exit cleanly.
> 
> SIGSEGV commonly means "you accessed something that doesn't exist/isn't
> mapped/you don't have permissions for". There are specific subcases that
> can be indicated via the siginfo data; this is from the sigaction(1)
> manpage:
> 
>     The following values can be placed in si_code for a SIGSEGV signal:
> 
>         SEGV_MAPERR    address not mapped to object
> 
>         SEGV_ACCERR    invalid permissions for mapped object
> 
> SIGBUS would also be a possibility but that's generally taken to mean
> that an access got all the way to some physical bus and then faulted,
> whereas SIGSEGV suggests the access was rejected during the
> virtual-to-physical mapping process.

None of the above. Just return -EINVAL, -EPERM, -EACCESS as appropriate.
-Chris

Dave Gordon April 28, 2015, 2:56 p.m. UTC | #7

On 28/04/15 10:21, Dave Gordon wrote:
> On 24/04/15 06:52, Antoine, Peter wrote:
>> I picked up this work due to the following Jira ticket created by the
>> security team (on Android) and was asked to give it a second look and
>> found a few more issues with the hw lock code.
>>
>> https://jira01.devtools.intel.com/browse/GMINL-5388
>> I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)
>>
>> It also stops Linux as it kills the driver, I guess it might be possible
>> to reload the gfx driver. On a unpatched system the test that is
>> included in the issue or the igt test that has been posted for the issue
>> will show the problem.
>>
>> I ran the test on an unpatched system here and the gui stopped and the
>> keyboard stopped responding, so I rebooted. With the patched system I
>> did not need to reboot.
>>
>> Should I change the SIGTERM to SIGSEGV, not quite the same thing but
>> tooling is better at handling a segfault than a SIGTERM and the
>> application that calls this IOCTL is using an uninitialised hw lock so
>> it is kind of the same as differencing an uninitialised pointer (kind
>> of). Or, I could just remove it, but the bug has been in the code for at
>> least two years (and known about), and I would guess that any code that
>> is calling this is fuzzing the IOCTLs (as this is how the security team
>> found it) and we should reward them with a application exit.
>>
>> Peter. 
> 
> SIGSEGV would be a better choice.

[snip]

Nope, I've changed my mind about this. I thought the problematic case
was just a process releasing the lock without having acquired it, but on
further examination it's really that the DRM master process has gone
away or otherwise deleted (or never created?) the lock. And THEN the
(non-master?) process tries to release the (now) non-existent lock.

But more importantly, there's existing code in the acquire-lock path
that sends SIGTERM for this case, so obviously the release-lock code
should be as similar as possible.

.Dave.

Daniel Vetter May 4, 2015, 1:52 p.m. UTC | #8

On Tue, Apr 28, 2015 at 10:52:32AM +0100, chris@chris-wilson.co.uk wrote:
> On Tue, Apr 28, 2015 at 10:21:49AM +0100, Dave Gordon wrote:
> > On 24/04/15 06:52, Antoine, Peter wrote:
> > > I picked up this work due to the following Jira ticket created by the
> > > security team (on Android) and was asked to give it a second look and
> > > found a few more issues with the hw lock code.
> > > 
> > > https://jira01.devtools.intel.com/browse/GMINL-5388
> > > I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)
> > > 
> > > It also stops Linux as it kills the driver, I guess it might be possible
> > > to reload the gfx driver. On a unpatched system the test that is
> > > included in the issue or the igt test that has been posted for the issue
> > > will show the problem.
> > > 
> > > I ran the test on an unpatched system here and the gui stopped and the
> > > keyboard stopped responding, so I rebooted. With the patched system I
> > > did not need to reboot.
> > > 
> > > Should I change the SIGTERM to SIGSEGV, not quite the same thing but
> > > tooling is better at handling a segfault than a SIGTERM and the
> > > application that calls this IOCTL is using an uninitialised hw lock so
> > > it is kind of the same as differencing an uninitialised pointer (kind
> > > of). Or, I could just remove it, but the bug has been in the code for at
> > > least two years (and known about), and I would guess that any code that
> > > is calling this is fuzzing the IOCTLs (as this is how the security team
> > > found it) and we should reward them with a application exit.
> > > 
> > > Peter. 
> > 
> > SIGSEGV would be a better choice.
> > 
> > SIGTERM is normally sent by a user -- it's the default signal sent by
> > kill(1). It's also commonly used to tell a long-running daemon process
> > to tidy up and exit cleanly.
> > 
> > SIGSEGV commonly means "you accessed something that doesn't exist/isn't
> > mapped/you don't have permissions for". There are specific subcases that
> > can be indicated via the siginfo data; this is from the sigaction(1)
> > manpage:
> > 
> >     The following values can be placed in si_code for a SIGSEGV signal:
> > 
> >         SEGV_MAPERR    address not mapped to object
> > 
> >         SEGV_ACCERR    invalid permissions for mapped object
> > 
> > SIGBUS would also be a possibility but that's generally taken to mean
> > that an access got all the way to some physical bus and then faulted,
> > whereas SIGSEGV suggests the access was rejected during the
> > virtual-to-physical mapping process.
> 
> None of the above. Just return -EINVAL, -EPERM, -EACCESS as appropriate.

Seconded, we really don't want to be in the business of fixing up the drm
design mistakes of the past 15 years. As long as we can fully lock out
this particular dragon when running i915 we're imo good enough. The dri1
design of a kernel shim driver cooperating with the ums driver for hw
ownership is fundamentally unfixable.

Also we can't change any of it for drivers actually using it since it'll
break them, which is a big no-go.
-Daniel

Peter Antoine May 5, 2015, 6:37 a.m. UTC | #9

On Mon, 2015-05-04 at 15:52 +0200, Daniel Vetter wrote:
> On Tue, Apr 28, 2015 at 10:52:32AM +0100, chris@chris-wilson.co.uk wrote:

> > On Tue, Apr 28, 2015 at 10:21:49AM +0100, Dave Gordon wrote:

> > > On 24/04/15 06:52, Antoine, Peter wrote:

> > > > I picked up this work due to the following Jira ticket created by the

> > > > security team (on Android) and was asked to give it a second look and

> > > > found a few more issues with the hw lock code.

> > > > 

> > > > https://jira01.devtools.intel.com/browse/GMINL-5388

> > > > I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)

> > > > 

> > > > It also stops Linux as it kills the driver, I guess it might be possible

> > > > to reload the gfx driver. On a unpatched system the test that is

> > > > included in the issue or the igt test that has been posted for the issue

> > > > will show the problem.

> > > > 

> > > > I ran the test on an unpatched system here and the gui stopped and the

> > > > keyboard stopped responding, so I rebooted. With the patched system I

> > > > did not need to reboot.

> > > > 

> > > > Should I change the SIGTERM to SIGSEGV, not quite the same thing but

> > > > tooling is better at handling a segfault than a SIGTERM and the

> > > > application that calls this IOCTL is using an uninitialised hw lock so

> > > > it is kind of the same as differencing an uninitialised pointer (kind

> > > > of). Or, I could just remove it, but the bug has been in the code for at

> > > > least two years (and known about), and I would guess that any code that

> > > > is calling this is fuzzing the IOCTLs (as this is how the security team

> > > > found it) and we should reward them with a application exit.

> > > > 

> > > > Peter. 

> > > 

> > > SIGSEGV would be a better choice.

> > > 

> > > SIGTERM is normally sent by a user -- it's the default signal sent by

> > > kill(1). It's also commonly used to tell a long-running daemon process

> > > to tidy up and exit cleanly.

> > > 

> > > SIGSEGV commonly means "you accessed something that doesn't exist/isn't

> > > mapped/you don't have permissions for". There are specific subcases that

> > > can be indicated via the siginfo data; this is from the sigaction(1)

> > > manpage:

> > > 

> > >     The following values can be placed in si_code for a SIGSEGV signal:

> > > 

> > >         SEGV_MAPERR    address not mapped to object

> > > 

> > >         SEGV_ACCERR    invalid permissions for mapped object

> > > 

> > > SIGBUS would also be a possibility but that's generally taken to mean

> > > that an access got all the way to some physical bus and then faulted,

> > > whereas SIGSEGV suggests the access was rejected during the

> > > virtual-to-physical mapping process.

> > 

> > None of the above. Just return -EINVAL, -EPERM, -EACCESS as appropriate.

> 

> Seconded, we really don't want to be in the business of fixing up the drm

> design mistakes of the past 15 years. As long as we can fully lock out

> this particular dragon when running i915 we're imo good enough. The dri1

> design of a kernel shim driver cooperating with the ums driver for hw

> ownership is fundamentally unfixable.

> 

> Also we can't change any of it for drivers actually using it since it'll

> break them, which is a big no-go.

> -Daniel


I will remove it. But, If you are using this code path the driver/kernel
will have crashed. It covers a NULL pointer deference, so we are not
changing the API that anyone is actually using.

Peter.

Daniel Vetter May 5, 2015, 7:20 a.m. UTC | #10

On Tue, May 05, 2015 at 06:37:51AM +0000, Antoine, Peter wrote:
> On Mon, 2015-05-04 at 15:52 +0200, Daniel Vetter wrote:
> > On Tue, Apr 28, 2015 at 10:52:32AM +0100, chris@chris-wilson.co.uk wrote:
> > > On Tue, Apr 28, 2015 at 10:21:49AM +0100, Dave Gordon wrote:
> > > > On 24/04/15 06:52, Antoine, Peter wrote:
> > > > > I picked up this work due to the following Jira ticket created by the
> > > > > security team (on Android) and was asked to give it a second look and
> > > > > found a few more issues with the hw lock code.
> > > > > 
> > > > > https://jira01.devtools.intel.com/browse/GMINL-5388
> > > > > I/O control on /dev/dri/card0 crashes the kernel (0x4008642b)
> > > > > 
> > > > > It also stops Linux as it kills the driver, I guess it might be possible
> > > > > to reload the gfx driver. On a unpatched system the test that is
> > > > > included in the issue or the igt test that has been posted for the issue
> > > > > will show the problem.
> > > > > 
> > > > > I ran the test on an unpatched system here and the gui stopped and the
> > > > > keyboard stopped responding, so I rebooted. With the patched system I
> > > > > did not need to reboot.
> > > > > 
> > > > > Should I change the SIGTERM to SIGSEGV, not quite the same thing but
> > > > > tooling is better at handling a segfault than a SIGTERM and the
> > > > > application that calls this IOCTL is using an uninitialised hw lock so
> > > > > it is kind of the same as differencing an uninitialised pointer (kind
> > > > > of). Or, I could just remove it, but the bug has been in the code for at
> > > > > least two years (and known about), and I would guess that any code that
> > > > > is calling this is fuzzing the IOCTLs (as this is how the security team
> > > > > found it) and we should reward them with a application exit.
> > > > > 
> > > > > Peter. 
> > > > 
> > > > SIGSEGV would be a better choice.
> > > > 
> > > > SIGTERM is normally sent by a user -- it's the default signal sent by
> > > > kill(1). It's also commonly used to tell a long-running daemon process
> > > > to tidy up and exit cleanly.
> > > > 
> > > > SIGSEGV commonly means "you accessed something that doesn't exist/isn't
> > > > mapped/you don't have permissions for". There are specific subcases that
> > > > can be indicated via the siginfo data; this is from the sigaction(1)
> > > > manpage:
> > > > 
> > > >     The following values can be placed in si_code for a SIGSEGV signal:
> > > > 
> > > >         SEGV_MAPERR    address not mapped to object
> > > > 
> > > >         SEGV_ACCERR    invalid permissions for mapped object
> > > > 
> > > > SIGBUS would also be a possibility but that's generally taken to mean
> > > > that an access got all the way to some physical bus and then faulted,
> > > > whereas SIGSEGV suggests the access was rejected during the
> > > > virtual-to-physical mapping process.
> > > 
> > > None of the above. Just return -EINVAL, -EPERM, -EACCESS as appropriate.
> > 
> > Seconded, we really don't want to be in the business of fixing up the drm
> > design mistakes of the past 15 years. As long as we can fully lock out
> > this particular dragon when running i915 we're imo good enough. The dri1
> > design of a kernel shim driver cooperating with the ums driver for hw
> > ownership is fundamentally unfixable.
> > 
> > Also we can't change any of it for drivers actually using it since it'll
> > break them, which is a big no-go.
> > -Daniel
> 
> I will remove it. But, If you are using this code path the driver/kernel
> will have crashed. It covers a NULL pointer deference, so we are not
> changing the API that anyone is actually using.

With ums/dri1 userspace can crash the kernel. That's pretty much by
design, and it's pretty much unfixable. Trying to fix up the obvious ones
like here is just cargo-culting. The only real option is to make
absolutely sure we can't ever reach these codepaths at all from kms
drivers like i915. Which tends to be a lot more work for auditing, but
will be much more useful.
-Daniel

[1/5] drm: Kernel Crash in drm_unlock

Commit Message

Comments

Patch