diff mbox series

block: fix error handling for device_add_disk

Message ID c614deb3-ce75-635e-a311-4f4fc7aa26e3@i-love.sakura.ne.jp (mailing list archive)
State New, archived
Headers show
Series block: fix error handling for device_add_disk | expand

Commit Message

Tetsuo Handa Dec. 16, 2021, 4 p.m. UTC
syzbot is reporting double kfree() bug in disk_release_events() [1], for
commit 9be68dd7ac0e13be ("md: add error handling support for add_disk()")
is calling blk_cleanup_disk() which will call disk_release_events() from
regular kobject_release() path when device_add_disk() from add_disk()
failed.

Since kobject_release() will be always called regardless of whether
device_add_disk() from add_disk() succeeds, we should leave
disk_release_events() to regular kobject_release() path.

Link: https://syzkaller.appspot.com/bug?extid=28a66a9fbc621c939000 [1]
Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
Fixes: 83cbce9574462c6b ("block: add error handling for device_add_disk / add_disk")
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
---
 block/genhd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Christoph Hellwig Dec. 16, 2021, 4:18 p.m. UTC | #1
On Fri, Dec 17, 2021 at 01:00:00AM +0900, Tetsuo Handa wrote:
> syzbot is reporting double kfree() bug in disk_release_events() [1], for
> commit 9be68dd7ac0e13be ("md: add error handling support for add_disk()")
> is calling blk_cleanup_disk() which will call disk_release_events() from
> regular kobject_release() path when device_add_disk() from add_disk()
> failed.
> 
> Since kobject_release() will be always called regardless of whether
> device_add_disk() from add_disk() succeeds, we should leave
> disk_release_events() to regular kobject_release() path.
> 
> Link: https://syzkaller.appspot.com/bug?extid=28a66a9fbc621c939000 [1]
> Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
> Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com>
> Fixes: 83cbce9574462c6b ("block: add error handling for device_add_disk / add_disk")
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> ---
>  block/genhd.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 30362aeacac4..47bb34ab967b 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -540,7 +540,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
>  out_device_del:
>  	device_del(ddev);
>  out_disk_release_events:
> -	disk_release_events(disk);
> +	/* disk_release() will call disk_release_events(). */
>  out_free_ext_minor:
>  	if (disk->major == BLOCK_EXT_MAJOR)
>  		blk_free_ext_minor(disk->first_minor);

.. actually while you're at it - blk_free_ext_minor is also done
by bdev_free_inode called from disk_release.

So we can just remove the out_disk_release_events and out_free_ext_minor
labels entirely. 

> -- 
> 2.32.0
---end quoted text---
Christoph Hellwig Dec. 16, 2021, 4:19 p.m. UTC | #2
On Thu, Dec 16, 2021 at 05:18:06PM +0100, Christoph Hellwig wrote:
> >  out_disk_release_events:
> > -	disk_release_events(disk);
> > +	/* disk_release() will call disk_release_events(). */
> >  out_free_ext_minor:
> >  	if (disk->major == BLOCK_EXT_MAJOR)
> >  		blk_free_ext_minor(disk->first_minor);
> 
> .. actually while you're at it - blk_free_ext_minor is also done
> by bdev_free_inode called from disk_release.
> 
> So we can just remove the out_disk_release_events and out_free_ext_minor
> labels entirely. 

... of course we can't.  But we should return after the
device_del instead of falling through to the other resource cleanups.
Tetsuo Handa Dec. 17, 2021, 10:37 a.m. UTC | #3
On 2021/12/17 1:19, Christoph Hellwig wrote:
> On Thu, Dec 16, 2021 at 05:18:06PM +0100, Christoph Hellwig wrote:
>>>  out_disk_release_events:
>>> -	disk_release_events(disk);
>>> +	/* disk_release() will call disk_release_events(). */
>>>  out_free_ext_minor:
>>>  	if (disk->major == BLOCK_EXT_MAJOR)
>>>  		blk_free_ext_minor(disk->first_minor);
>>
>> .. actually while you're at it - blk_free_ext_minor is also done
>> by bdev_free_inode called from disk_release.
>>
>> So we can just remove the out_disk_release_events and out_free_ext_minor
>> labels entirely. 
> 
> ... of course we can't.  But we should return after the
> device_del instead of falling through to the other resource cleanups.

Well, I don't think that we can remove this blk_free_ext_minor() call, for
this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev).

Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not
called when reaching the out_free_ext_minor label,

	if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
		blk_free_ext_minor(MINOR(bdev->bd_dev));

in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0
because bdev->bd_dev == 0.

I think we can apply this patch as-is...
Luis R. Rodriguez Dec. 19, 2021, 8 p.m. UTC | #4
On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
> I think we can apply this patch as-is...

Unfortunately I don't think so, don't we end up still with a race
in between the first part of device_add() and the kobject_add()
which adds the kobject to generic layer and in return enables the
disk_release() call for the disk? I count 5 error paths in between
including kobject_add() which can fail as well.

If correct then something like the following may be needed:

diff --git a/block/genhd.c b/block/genhd.c
index 3c139a1b6f04..08ab7ce63e57 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 out_device_del:
 	device_del(ddev);
 out_disk_release_events:
-	disk_release_events(disk);
+	if (!kobject_alive(&ddev->kobj))
+		disk_release_events(disk);
 out_free_ext_minor:
-	if (disk->major == BLOCK_EXT_MAJOR)
+	if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);
 	return ret;
 }
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index c740062b4b1a..4884aedbd4e0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -117,6 +117,23 @@ extern void kobject_get_ownership(struct kobject *kobj,
 				  kuid_t *uid, kgid_t *gid);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_alive - Returns whether a kobject_add() has succeeded
+ * @kobj: the object to test
+ *
+ * This will return whether a kobject has been successfully added already with
+ * kobject_add(). It is useful for subsystems which have a kobj_type with its
+ * own kobj_type release() routine and want to verify that it will be called
+ * as otherwise if kobject_add() failed the subsystem is in charge of doing that
+ * cleanup.
+ */
+static inline bool kobject_alive(struct kobject *kobj)
+{
+	if (!kobj || kref_read(&kobj->kref) == 0)
+		return false;
+	return true;
+}
+
 /**
  * kobject_has_children - Returns whether a kobject has children.
  * @kobj: the object to test
Tetsuo Handa Dec. 20, 2021, 8:23 a.m. UTC | #5
On 2021/12/20 5:00, Luis Chamberlain wrote:
> On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
>> I think we can apply this patch as-is...
> 
> Unfortunately I don't think so, don't we end up still with a race
> in between the first part of device_add() and the kobject_add()
> which adds the kobject to generic layer and in return enables the
> disk_release() call for the disk? I count 5 error paths in between
> including kobject_add() which can fail as well.

I can't catch which path you are talking about.
Will you explain more details using call trace (or line numbers in
https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L397 ) ?

> 
> If correct then something like the following may be needed:
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 3c139a1b6f04..08ab7ce63e57 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
>  out_device_del:
>  	device_del(ddev);
>  out_disk_release_events:
> -	disk_release_events(disk);
> +	if (!kobject_alive(&ddev->kobj))
> +		disk_release_events(disk);
>  out_free_ext_minor:
> -	if (disk->major == BLOCK_EXT_MAJOR)
> +	if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR)

How can kobject_alive() matter?

The minor id which was stored into disk->first_minor is allocated by
https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L432 ,
and that minor id is copied to bdev->bd_dev (which
https://elixir.bootlin.com/linux/v5.16-rc6/source/block/bdev.c#L415 will
release) by
https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L511 ,
doesn't it?

Unless there is a location (except genhd.c#L511) which copies that minor id
to bdev->bd_dev (which bdev.c#L415 will release), it is correct to
unconditionally undo allocation by genhd.c#L432 at genhd.c#L546 .

Will you explain what path you are talking about?
Luis R. Rodriguez Dec. 20, 2021, 7:16 p.m. UTC | #6
On Mon, Dec 20, 2021 at 05:23:08PM +0900, Tetsuo Handa wrote:
> On 2021/12/20 5:00, Luis Chamberlain wrote:
> > On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
> >> I think we can apply this patch as-is...
> > 
> > Unfortunately I don't think so, don't we end up still with a race
> > in between the first part of device_add() and the kobject_add()
> > which adds the kobject to generic layer and in return enables the
> > disk_release() call for the disk? I count 5 error paths in between
> > including kobject_add() which can fail as well.
> 
> I can't catch which path you are talking about.
> Will you explain more details using call trace (or line numbers in
> https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L397 ) ?

I mean right after disk_alloc_events():

https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L440

And right inside device_add()

https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L452

Within device_add(), there are about 5 things which can
from the beginning of device_add() on line 3275 up to where
kobject_add() completes successfully in line 3329:

https://elixir.bootlin.com/linux/v5.16-rc6/source/drivers/base/core.c#L3275
https://elixir.bootlin.com/linux/v5.16-rc6/source/drivers/base/core.c#L3329

> > If correct then something like the following may be needed:
> > 
> > diff --git a/block/genhd.c b/block/genhd.c
> > index 3c139a1b6f04..08ab7ce63e57 100644
> > --- a/block/genhd.c
> > +++ b/block/genhd.c
> > @@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
> >  out_device_del:
> >  	device_del(ddev);
> >  out_disk_release_events:
> > -	disk_release_events(disk);
> > +	if (!kobject_alive(&ddev->kobj))
> > +		disk_release_events(disk);
> >  out_free_ext_minor:
> > -	if (disk->major == BLOCK_EXT_MAJOR)
> > +	if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR)
> 
> How can kobject_alive() matter?

There are two hunks here. The first one I hope the above explains it.

As for the second one, the assumption is that if device_add() succeeded
the free_inode super op would do the respective blk_free_ext_minor()
however now I am not sure if this is true.

The kobject_alive() tells us if at least the device_add() had the
kobject_add() complete.

Since we have two hunks to consider I think we should be clear about
differentiating between both.

  Luis
Christoph Hellwig Dec. 21, 2021, 10:08 a.m. UTC | #7
On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
> Well, I don't think that we can remove this blk_free_ext_minor() call, for
> this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev).
> 
> Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not
> called when reaching the out_free_ext_minor label,
> 
> 	if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
> 		blk_free_ext_minor(MINOR(bdev->bd_dev));
> 
> in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0
> because bdev->bd_dev == 0.
> 
> I think we can apply this patch as-is...

With the patch as-is we'll still leak disk->ev if device_add fails.
Something like the patch below should solve that by moving the disk->ev
allocation later and always cleaning it up through disk->release:


diff --git a/block/genhd.c b/block/genhd.c
index 3c139a1b6f049..3e4bbfa3e1c24 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -442,10 +442,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		disk->first_minor = ret;
 	}
 
-	ret = disk_alloc_events(disk);
-	if (ret)
-		goto out_free_ext_minor;
-
 	/* delay uevents, until we scanned partition table */
 	dev_set_uevent_suppress(ddev, 1);
 
@@ -456,7 +452,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		ddev->devt = MKDEV(disk->major, disk->first_minor);
 	ret = device_add(ddev);
 	if (ret)
-		goto out_disk_release_events;
+		goto out_free_ext_minor;
+
+	ret = disk_alloc_events(disk);
+	if (ret)
+		goto out_device_del;
+
 	if (!sysfs_deprecated) {
 		ret = sysfs_create_link(block_depr, &ddev->kobj,
 					kobject_name(&ddev->kobj));
@@ -538,8 +539,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		sysfs_remove_link(block_depr, dev_name(ddev));
 out_device_del:
 	device_del(ddev);
-out_disk_release_events:
-	disk_release_events(disk);
+	return ret;
 out_free_ext_minor:
 	if (disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);
Christoph Hellwig Dec. 21, 2021, 10:15 a.m. UTC | #8
On Tue, Dec 21, 2021 at 11:08:11AM +0100, Christoph Hellwig wrote:
> On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
> > Well, I don't think that we can remove this blk_free_ext_minor() call, for
> > this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev).
> > 
> > Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not
> > called when reaching the out_free_ext_minor label,
> > 
> > 	if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
> > 		blk_free_ext_minor(MINOR(bdev->bd_dev));
> > 
> > in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0
> > because bdev->bd_dev == 0.
> > 
> > I think we can apply this patch as-is...
> 
> With the patch as-is we'll still leak disk->ev if device_add fails.
> Something like the patch below should solve that by moving the disk->ev
> allocation later and always cleaning it up through disk->release:

Sorry, this still had the extra return.

diff --git a/block/genhd.c b/block/genhd.c
index 3c139a1b6f049..603db5d6f10c0 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -442,10 +442,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		disk->first_minor = ret;
 	}
 
-	ret = disk_alloc_events(disk);
-	if (ret)
-		goto out_free_ext_minor;
-
 	/* delay uevents, until we scanned partition table */
 	dev_set_uevent_suppress(ddev, 1);
 
@@ -456,7 +452,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		ddev->devt = MKDEV(disk->major, disk->first_minor);
 	ret = device_add(ddev);
 	if (ret)
-		goto out_disk_release_events;
+		goto out_free_ext_minor;
+
+	ret = disk_alloc_events(disk);
+	if (ret)
+		goto out_device_del;
+
 	if (!sysfs_deprecated) {
 		ret = sysfs_create_link(block_depr, &ddev->kobj,
 					kobject_name(&ddev->kobj));
@@ -538,8 +539,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 		sysfs_remove_link(block_depr, dev_name(ddev));
 out_device_del:
 	device_del(ddev);
-out_disk_release_events:
-	disk_release_events(disk);
 out_free_ext_minor:
 	if (disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);
Tetsuo Handa Dec. 21, 2021, 10:21 a.m. UTC | #9
On 2021/12/21 19:15, Christoph Hellwig wrote:
> On Tue, Dec 21, 2021 at 11:08:11AM +0100, Christoph Hellwig wrote:
>> On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
>>> Well, I don't think that we can remove this blk_free_ext_minor() call, for
>>> this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev).
>>>
>>> Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not
>>> called when reaching the out_free_ext_minor label,
>>>
>>> 	if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR)
>>> 		blk_free_ext_minor(MINOR(bdev->bd_dev));
>>>
>>> in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0
>>> because bdev->bd_dev == 0.
>>>
>>> I think we can apply this patch as-is...
>>
>> With the patch as-is we'll still leak disk->ev if device_add fails.
>> Something like the patch below should solve that by moving the disk->ev
>> allocation later and always cleaning it up through disk->release:

Then what about this simple fix?

diff --git a/block/disk-events.c b/block/disk-events.c
index 8d5496e7592a..05b1249650ab 100644
--- a/block/disk-events.c
+++ b/block/disk-events.c
@@ -501,4 +501,5 @@ void disk_release_events(struct gendisk *disk)
 	/* the block count should be 1 from disk_del_events() */
 	WARN_ON_ONCE(disk->ev && disk->ev->block != 1);
 	kfree(disk->ev);
+	disk->ev = NULL;
 }
Tetsuo Handa Dec. 21, 2021, 11:41 a.m. UTC | #10
On 2021/12/21 4:16, Luis Chamberlain wrote:
> The kobject_alive() tells us if at least the device_add() had the
> kobject_add() complete.


Testing with error injection

@@ -3284,6 +3284,9 @@ int device_add(struct device *dev)
        if (!dev)
                goto done;

+       if (!strcmp(current->comm, "a.out"))
+               goto done;
+
        if (!dev->p) {
                error = device_private_init(dev);
                if (error)

told me that kref count is 1 when reaching the out_disk_release_events label.
Thus,

	if (!kobject_alive(&ddev->kobj))

seems wrong.

Christoph proposed deferring disk_alloc_events(). If it is safe to defer
disk_alloc_events(), that can be a fix. But I don't know if there is a path
which can invoke disk event functions (e.g. disk_block_events() from
blkdev_get_by_dev()) between device_add() and disk_alloc_events().
I didn't find a path, but at least the device file and some of sysfs files
are already visible before disk_alloc_events() is called...
Tetsuo Handa Dec. 21, 2021, 1:46 p.m. UTC | #11
On 2021/12/21 19:15, Christoph Hellwig wrote:
>>> I think we can apply this patch as-is...
>>
>> With the patch as-is we'll still leak disk->ev if device_add fails.
>> Something like the patch below should solve that by moving the disk->ev
>> allocation later and always cleaning it up through disk->release:
> 
> Sorry, this still had the extra return.
> 

OK. Since blkdev_get_no_open() from blkdev_get_by_dev() returns NULL until
disk_alloc_events() after device_add() completes, there is no race window for
unbalanced disk_block_events(disk)/disk_unblock_events(disk) pair.

Your patch seems to work. Please propose as a patch.
Luis R. Rodriguez Dec. 21, 2021, 9:50 p.m. UTC | #12
On Tue, Dec 21, 2021 at 08:41:18PM +0900, Tetsuo Handa wrote:
> On 2021/12/21 4:16, Luis Chamberlain wrote:
> > The kobject_alive() tells us if at least the device_add() had the
> > kobject_add() complete.
> 
> 
> Testing with error injection
> 
> @@ -3284,6 +3284,9 @@ int device_add(struct device *dev)
>         if (!dev)
>                 goto done;
> 
> +       if (!strcmp(current->comm, "a.out"))
> +               goto done;
> +
>         if (!dev->p) {
>                 error = device_private_init(dev);
>                 if (error)
> 
> told me that kref count is 1 when reaching the out_disk_release_events label.
> Thus,
> 
> 	if (!kobject_alive(&ddev->kobj))
> 
> seems wrong.

Hrm.... quite unexpected.

> Christoph proposed deferring disk_alloc_events(). If it is safe to defer
> disk_alloc_events(), that can be a fix. 

*If safe*, yes, agreed.

  Luis
diff mbox series

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 30362aeacac4..47bb34ab967b 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -540,7 +540,7 @@  int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
 out_device_del:
 	device_del(ddev);
 out_disk_release_events:
-	disk_release_events(disk);
+	/* disk_release() will call disk_release_events(). */
 out_free_ext_minor:
 	if (disk->major == BLOCK_EXT_MAJOR)
 		blk_free_ext_minor(disk->first_minor);