diff mbox

KVM: release anon file in failure path of vm creation

Message ID 20160714164647.GD14480@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro July 14, 2016, 4:46 p.m. UTC
On Tue, Jul 12, 2016 at 12:24:39PM +0200, Paolo Bonzini wrote:

> On 12/07/2016 11:38, Liu Shuo wrote:
> > The failure of create debugfs of VM will return directly without release
> > the anon file. It will leak memory and file descriptors, even through
> > be not serious.
> > 
> > Signed-off-by: Liu Shuo <shuo.a.liu@intel.com>

This is broken.

> > @@ -3067,6 +3068,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
> >  
> >  	if (kvm_create_vm_debugfs(kvm, r) < 0) {
> >  		kvm_put_kvm(kvm);
> > +		sys_close(r);

You have no warranty whatsoever that descriptor table has not been changed
by that point.  You should *NEVER* use sys_close() on failure exit paths
like that.  Moreover, this kvm_put_kvm() becomes a double-put, since
closing the damn file will drop that reference to kvm.

> >  		return -ENOMEM;
> >  	}
> >  
> > 
> 
> Thanks, applied to kvm/master.

Please, revert.  anon_inode_getfd() should be used only when there's no
possible failures past its call.  What you need in this case (due to fucking
ugly API - decriptor number used as part of debugfs pathname) is something
like this (against mainline):

[kvm] don't use anon_inode_getfd() before possible failures

Once anon_inode_getfd() has succeeded, it's impossible to undo in a clean
way and no, sys_close() is not usable in such cases.  Use anon_inode_getfile()
and get_unused_fd_flags() to get struct file and descriptor and do *not*
install the file into descriptor table until after the last possible failure
exit.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Paolo Bonzini July 14, 2016, 4:56 p.m. UTC | #1
On 14/07/2016 18:46, Al Viro wrote:
> On Tue, Jul 12, 2016 at 12:24:39PM +0200, Paolo Bonzini wrote:
> 
>> On 12/07/2016 11:38, Liu Shuo wrote:
>>> The failure of create debugfs of VM will return directly without release
>>> the anon file. It will leak memory and file descriptors, even through
>>> be not serious.
>>>
>>> Signed-off-by: Liu Shuo <shuo.a.liu@intel.com>
> 
> This is broken.
> 
>>> @@ -3067,6 +3068,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
>>>  
>>>  	if (kvm_create_vm_debugfs(kvm, r) < 0) {
>>>  		kvm_put_kvm(kvm);
>>> +		sys_close(r);
> 
> You have no warranty whatsoever that descriptor table has not been changed
> by that point.  You should *NEVER* use sys_close() on failure exit paths
> like that.  Moreover, this kvm_put_kvm() becomes a double-put, since
> closing the damn file will drop that reference to kvm.
> 
>>>  		return -ENOMEM;
>>>  	}
>>>  
>>>
>>
>> Thanks, applied to kvm/master.
> 
> Please, revert.  anon_inode_getfd() should be used only when there's no
> possible failures past its call.  What you need in this case (due to fucking
> ugly API - decriptor number used as part of debugfs pathname) is something
> like this (against mainline):

Al,

thanks very much for the explanation and the patch.  I've reverted the
original patch and installed yours instead.

Paolo

> [kvm] don't use anon_inode_getfd() before possible failures
> 
> Once anon_inode_getfd() has succeeded, it's impossible to undo in a clean
> way and no, sys_close() is not usable in such cases.  Use anon_inode_getfile()
> and get_unused_fd_flags() to get struct file and descriptor and do *not*
> install the file into descriptor table until after the last possible failure
> exit.
> 
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 48bd520..7f82c6c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3048,6 +3048,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
>  {
>  	int r;
>  	struct kvm *kvm;
> +	struct file *file;
>  
>  	kvm = kvm_create_vm(type);
>  	if (IS_ERR(kvm))
> @@ -3059,17 +3060,25 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
>  		return r;
>  	}
>  #endif
> -	r = anon_inode_getfd("kvm-vm", &kvm_vm_fops, kvm, O_RDWR | O_CLOEXEC);
> +	r = get_unused_fd_flags(O_CLOEXEC);
>  	if (r < 0) {
>  		kvm_put_kvm(kvm);
>  		return r;
>  	}
> +	file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
> +	if (IS_ERR(file)) {
> +		put_unused_fd(r);
> +		kvm_put_kvm(kvm);
> +		return PTR_ERR(file);
> +	}
>  
>  	if (kvm_create_vm_debugfs(kvm, r) < 0) {
> -		kvm_put_kvm(kvm);
> +		put_unused_fd(r);
> +		fput(file);
>  		return -ENOMEM;
>  	}
>  
> +	fd_install(r, file);
>  	return r;git
>  }
>  
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Shuo July 15, 2016, 2:22 a.m. UTC | #2
On Thu 14.Jul'16 at 17:46:47 +0100, Al Viro wrote:
>On Tue, Jul 12, 2016 at 12:24:39PM +0200, Paolo Bonzini wrote:
>
>> On 12/07/2016 11:38, Liu Shuo wrote:
>> > The failure of create debugfs of VM will return directly without release
>> > the anon file. It will leak memory and file descriptors, even through
>> > be not serious.
>> >
>> > Signed-off-by: Liu Shuo <shuo.a.liu@intel.com>
>
>This is broken.
>
>> > @@ -3067,6 +3068,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
>> >
>> >  	if (kvm_create_vm_debugfs(kvm, r) < 0) {
>> >  		kvm_put_kvm(kvm);
>> > +		sys_close(r);
>
>You have no warranty whatsoever that descriptor table has not been changed
>by that point.  You should *NEVER* use sys_close() on failure exit paths
Could you please elaborate why we're not sure descriptor table's changing at the point?
>like that.  Moreover, this kvm_put_kvm() becomes a double-put, since
>closing the damn file will drop that reference to kvm.
You're right. Thanks for the correction on this point.
>
>> >  		return -ENOMEM;
>> >  	}
>> >
>> >
>>
>> Thanks, applied to kvm/master.
>
>Please, revert.  anon_inode_getfd() should be used only when there's no
>possible failures past its call.  What you need in this case (due to fucking
>ugly API - decriptor number used as part of debugfs pathname) is something
>like this (against mainline):
>
>[kvm] don't use anon_inode_getfd() before possible failures
>
>Once anon_inode_getfd() has succeeded, it's impossible to undo in a clean
>way and no, sys_close() is not usable in such cases.  Use anon_inode_getfile()
>and get_unused_fd_flags() to get struct file and descriptor and do *not*
>install the file into descriptor table until after the last possible failure
>exit.
>
>Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Thanks. It's better.
>---
>diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
>index 48bd520..7f82c6c 100644
>--- a/virt/kvm/kvm_main.c
>+++ b/virt/kvm/kvm_main.c
>@@ -3048,6 +3048,7 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
> {
> 	int r;
> 	struct kvm *kvm;
>+	struct file *file;
>
> 	kvm = kvm_create_vm(type);
> 	if (IS_ERR(kvm))
>@@ -3059,17 +3060,25 @@ static int kvm_dev_ioctl_create_vm(unsigned long type)
> 		return r;
> 	}
> #endif
>-	r = anon_inode_getfd("kvm-vm", &kvm_vm_fops, kvm, O_RDWR | O_CLOEXEC);
>+	r = get_unused_fd_flags(O_CLOEXEC);
> 	if (r < 0) {
> 		kvm_put_kvm(kvm);
> 		return r;
> 	}
>+	file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
>+	if (IS_ERR(file)) {
>+		put_unused_fd(r);
>+		kvm_put_kvm(kvm);
>+		return PTR_ERR(file);
>+	}
>
> 	if (kvm_create_vm_debugfs(kvm, r) < 0) {
>-		kvm_put_kvm(kvm);
>+		put_unused_fd(r);
>+		fput(file);
> 		return -ENOMEM;
> 	}
>
>+	fd_install(r, file);
> 	return r;
> }
>
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 15, 2016, 2:26 a.m. UTC | #3
On Fri, Jul 15, 2016 at 10:22:04AM +0800, Liu Shuo wrote:
> > You have no warranty whatsoever that descriptor table has not been changed
> > by that point.  You should *NEVER* use sys_close() on failure exit paths
> Could you please elaborate why we're not sure descriptor table's changing at the point?

Because that could be called by one thread while another (having guessed the
descriptor you are about to get) does close()/dup2()/etc.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Shuo July 15, 2016, 3:18 a.m. UTC | #4
On Fri 15.Jul'16 at  3:26:03 +0100, Al Viro wrote:
>On Fri, Jul 15, 2016 at 10:22:04AM +0800, Liu Shuo wrote:
>> > You have no warranty whatsoever that descriptor table has not been changed
>> > by that point.  You should *NEVER* use sys_close() on failure exit paths
>> Could you please elaborate why we're not sure descriptor table's changing at the point?
>
>Because that could be called by one thread while another (having guessed the
>descriptor you are about to get) does close()/dup2()/etc.
If there is no such thread (who operates the descriptor based on
guessing), i can think the changing is safe at the point. As the fd has
not been delivered to userspace. Am i right?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro July 15, 2016, 5:09 a.m. UTC | #5
On Fri, Jul 15, 2016 at 11:18:41AM +0800, Liu Shuo wrote:

> If there is no such thread (who operates the descriptor based on
> guessing), i can think the changing is safe at the point. As the fd has
> not been delivered to userspace. Am i right?

<wry>
Expecting nice behaviour from userland code is something best avoided, really.
</wry>

All jokes aside, this other thread doesn't have to be malicious - just being
buggy would suffice.  Besides, you never know if something like userns won't
be dumped into the kernel, making your ioctl accessible to genuinely
malicious code.

The only sane approach is to treat descriptor tables as shared data structures
and postpone the insertion of struct file reference into descriptor table
until you are past all failure exits.  Including the ones related to copying
to userland - e.g. pipe(2) creates a pipe, sets up two struct file associated
with it, reserves two descriptors, copies them into userland array and only if
everything has succeeded proceeds to fd_install().  In your case passing the
descriptor to userland is not an issue (return value of ioctl(2) goes via
register and that can't fail), so the last failure exit is that after failed
attempt to create debugfs stuff.  We have to reserve the descriptor before
that (it's used as a part of debugfs directory name), so anon_inode_getfd()
is not an option - it combines reserving descriptor with fd_install().
Such situations are exactly the reason why anon_inode_getfile() is there;
anon_inode_getfd() is usable only when it is the very last thing we do
before returning the descriptor to userland.

FWIW, original code was not unreasonable - it simply treated debugfs stuff as
optional and ignored those failures.  That way anon_inode_getfd() is fine -
there's no failure exits after it.  If we want to fail when debugfs had
been enabled and we'd failed to populate it, we need to use the real primitives
behind anon_inode_getfd(), though.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Liu Shuo July 15, 2016, 7:04 a.m. UTC | #6
On Fri 15.Jul'16 at  6:09:59 +0100, Al Viro wrote:
>On Fri, Jul 15, 2016 at 11:18:41AM +0800, Liu Shuo wrote:
>
>> If there is no such thread (who operates the descriptor based on
>> guessing), i can think the changing is safe at the point. As the fd has
>> not been delivered to userspace. Am i right?
>
><wry>
>Expecting nice behaviour from userland code is something best avoided, really.
></wry>
Got it! :)
>
>All jokes aside, this other thread doesn't have to be malicious - just being
>buggy would suffice.  Besides, you never know if something like userns won't
>be dumped into the kernel, making your ioctl accessible to genuinely
>malicious code.
>
>The only sane approach is to treat descriptor tables as shared data structures
>and postpone the insertion of struct file reference into descriptor table
>until you are past all failure exits.  Including the ones related to copying
>to userland - e.g. pipe(2) creates a pipe, sets up two struct file associated
>with it, reserves two descriptors, copies them into userland array and only if
>everything has succeeded proceeds to fd_install().  In your case passing the
>descriptor to userland is not an issue (return value of ioctl(2) goes via
>register and that can't fail), so the last failure exit is that after failed
>attempt to create debugfs stuff.  We have to reserve the descriptor before
>that (it's used as a part of debugfs directory name), so anon_inode_getfd()
>is not an option - it combines reserving descriptor with fd_install().
>Such situations are exactly the reason why anon_inode_getfile() is there;
>anon_inode_getfd() is usable only when it is the very last thing we do
>before returning the descriptor to userland.
Really appreciate your exhaustive explanations. Thanks.
>
>FWIW, original code was not unreasonable - it simply treated debugfs stuff as
>optional and ignored those failures.  That way anon_inode_getfd() is fine -
>there's no failure exits after it.  If we want to fail when debugfs had
>been enabled and we'd failed to populate it, we need to use the real primitives
>behind anon_inode_getfd(), though.
So, my firstly immature idea returns back.
Is the fd used by debugfs name very useful? Do we really need it here?
Alternative approach is putting the fd into the debugfs, then the code
will be clean here, as we can put anon_inode_getfd at very last of the
ioctl.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 48bd520..7f82c6c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3048,6 +3048,7 @@  static int kvm_dev_ioctl_create_vm(unsigned long type)
 {
 	int r;
 	struct kvm *kvm;
+	struct file *file;
 
 	kvm = kvm_create_vm(type);
 	if (IS_ERR(kvm))
@@ -3059,17 +3060,25 @@  static int kvm_dev_ioctl_create_vm(unsigned long type)
 		return r;
 	}
 #endif
-	r = anon_inode_getfd("kvm-vm", &kvm_vm_fops, kvm, O_RDWR | O_CLOEXEC);
+	r = get_unused_fd_flags(O_CLOEXEC);
 	if (r < 0) {
 		kvm_put_kvm(kvm);
 		return r;
 	}
+	file = anon_inode_getfile("kvm-vm", &kvm_vm_fops, kvm, O_RDWR);
+	if (IS_ERR(file)) {
+		put_unused_fd(r);
+		kvm_put_kvm(kvm);
+		return PTR_ERR(file);
+	}
 
 	if (kvm_create_vm_debugfs(kvm, r) < 0) {
-		kvm_put_kvm(kvm);
+		put_unused_fd(r);
+		fput(file);
 		return -ENOMEM;
 	}
 
+	fd_install(r, file);
 	return r;
 }