Message ID | c614deb3-ce75-635e-a311-4f4fc7aa26e3@i-love.sakura.ne.jp (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: fix error handling for device_add_disk | expand |
On Fri, Dec 17, 2021 at 01:00:00AM +0900, Tetsuo Handa wrote: > syzbot is reporting double kfree() bug in disk_release_events() [1], for > commit 9be68dd7ac0e13be ("md: add error handling support for add_disk()") > is calling blk_cleanup_disk() which will call disk_release_events() from > regular kobject_release() path when device_add_disk() from add_disk() > failed. > > Since kobject_release() will be always called regardless of whether > device_add_disk() from add_disk() succeeds, we should leave > disk_release_events() to regular kobject_release() path. > > Link: https://syzkaller.appspot.com/bug?extid=28a66a9fbc621c939000 [1] > Reported-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com> > Tested-by: syzbot <syzbot+28a66a9fbc621c939000@syzkaller.appspotmail.com> > Fixes: 83cbce9574462c6b ("block: add error handling for device_add_disk / add_disk") > Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> > --- > block/genhd.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/genhd.c b/block/genhd.c > index 30362aeacac4..47bb34ab967b 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -540,7 +540,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, > out_device_del: > device_del(ddev); > out_disk_release_events: > - disk_release_events(disk); > + /* disk_release() will call disk_release_events(). */ > out_free_ext_minor: > if (disk->major == BLOCK_EXT_MAJOR) > blk_free_ext_minor(disk->first_minor); .. actually while you're at it - blk_free_ext_minor is also done by bdev_free_inode called from disk_release. So we can just remove the out_disk_release_events and out_free_ext_minor labels entirely. > -- > 2.32.0 ---end quoted text---
On Thu, Dec 16, 2021 at 05:18:06PM +0100, Christoph Hellwig wrote: > > out_disk_release_events: > > - disk_release_events(disk); > > + /* disk_release() will call disk_release_events(). */ > > out_free_ext_minor: > > if (disk->major == BLOCK_EXT_MAJOR) > > blk_free_ext_minor(disk->first_minor); > > .. actually while you're at it - blk_free_ext_minor is also done > by bdev_free_inode called from disk_release. > > So we can just remove the out_disk_release_events and out_free_ext_minor > labels entirely. ... of course we can't. But we should return after the device_del instead of falling through to the other resource cleanups.
On 2021/12/17 1:19, Christoph Hellwig wrote: > On Thu, Dec 16, 2021 at 05:18:06PM +0100, Christoph Hellwig wrote: >>> out_disk_release_events: >>> - disk_release_events(disk); >>> + /* disk_release() will call disk_release_events(). */ >>> out_free_ext_minor: >>> if (disk->major == BLOCK_EXT_MAJOR) >>> blk_free_ext_minor(disk->first_minor); >> >> .. actually while you're at it - blk_free_ext_minor is also done >> by bdev_free_inode called from disk_release. >> >> So we can just remove the out_disk_release_events and out_free_ext_minor >> labels entirely. > > ... of course we can't. But we should return after the > device_del instead of falling through to the other resource cleanups. Well, I don't think that we can remove this blk_free_ext_minor() call, for this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev). Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not called when reaching the out_free_ext_minor label, if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR) blk_free_ext_minor(MINOR(bdev->bd_dev)); in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0 because bdev->bd_dev == 0. I think we can apply this patch as-is...
On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote:
> I think we can apply this patch as-is...
Unfortunately I don't think so, don't we end up still with a race
in between the first part of device_add() and the kobject_add()
which adds the kobject to generic layer and in return enables the
disk_release() call for the disk? I count 5 error paths in between
including kobject_add() which can fail as well.
If correct then something like the following may be needed:
diff --git a/block/genhd.c b/block/genhd.c
index 3c139a1b6f04..08ab7ce63e57 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk,
out_device_del:
device_del(ddev);
out_disk_release_events:
- disk_release_events(disk);
+ if (!kobject_alive(&ddev->kobj))
+ disk_release_events(disk);
out_free_ext_minor:
- if (disk->major == BLOCK_EXT_MAJOR)
+ if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR)
blk_free_ext_minor(disk->first_minor);
return ret;
}
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index c740062b4b1a..4884aedbd4e0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -117,6 +117,23 @@ extern void kobject_get_ownership(struct kobject *kobj,
kuid_t *uid, kgid_t *gid);
extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
+/**
+ * kobject_alive - Returns whether a kobject_add() has succeeded
+ * @kobj: the object to test
+ *
+ * This will return whether a kobject has been successfully added already with
+ * kobject_add(). It is useful for subsystems which have a kobj_type with its
+ * own kobj_type release() routine and want to verify that it will be called
+ * as otherwise if kobject_add() failed the subsystem is in charge of doing that
+ * cleanup.
+ */
+static inline bool kobject_alive(struct kobject *kobj)
+{
+ if (!kobj || kref_read(&kobj->kref) == 0)
+ return false;
+ return true;
+}
+
/**
* kobject_has_children - Returns whether a kobject has children.
* @kobj: the object to test
On 2021/12/20 5:00, Luis Chamberlain wrote: > On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote: >> I think we can apply this patch as-is... > > Unfortunately I don't think so, don't we end up still with a race > in between the first part of device_add() and the kobject_add() > which adds the kobject to generic layer and in return enables the > disk_release() call for the disk? I count 5 error paths in between > including kobject_add() which can fail as well. I can't catch which path you are talking about. Will you explain more details using call trace (or line numbers in https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L397 ) ? > > If correct then something like the following may be needed: > > diff --git a/block/genhd.c b/block/genhd.c > index 3c139a1b6f04..08ab7ce63e57 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, > out_device_del: > device_del(ddev); > out_disk_release_events: > - disk_release_events(disk); > + if (!kobject_alive(&ddev->kobj)) > + disk_release_events(disk); > out_free_ext_minor: > - if (disk->major == BLOCK_EXT_MAJOR) > + if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR) How can kobject_alive() matter? The minor id which was stored into disk->first_minor is allocated by https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L432 , and that minor id is copied to bdev->bd_dev (which https://elixir.bootlin.com/linux/v5.16-rc6/source/block/bdev.c#L415 will release) by https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L511 , doesn't it? Unless there is a location (except genhd.c#L511) which copies that minor id to bdev->bd_dev (which bdev.c#L415 will release), it is correct to unconditionally undo allocation by genhd.c#L432 at genhd.c#L546 . Will you explain what path you are talking about?
On Mon, Dec 20, 2021 at 05:23:08PM +0900, Tetsuo Handa wrote: > On 2021/12/20 5:00, Luis Chamberlain wrote: > > On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote: > >> I think we can apply this patch as-is... > > > > Unfortunately I don't think so, don't we end up still with a race > > in between the first part of device_add() and the kobject_add() > > which adds the kobject to generic layer and in return enables the > > disk_release() call for the disk? I count 5 error paths in between > > including kobject_add() which can fail as well. > > I can't catch which path you are talking about. > Will you explain more details using call trace (or line numbers in > https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L397 ) ? I mean right after disk_alloc_events(): https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L440 And right inside device_add() https://elixir.bootlin.com/linux/v5.16-rc6/source/block/genhd.c#L452 Within device_add(), there are about 5 things which can from the beginning of device_add() on line 3275 up to where kobject_add() completes successfully in line 3329: https://elixir.bootlin.com/linux/v5.16-rc6/source/drivers/base/core.c#L3275 https://elixir.bootlin.com/linux/v5.16-rc6/source/drivers/base/core.c#L3329 > > If correct then something like the following may be needed: > > > > diff --git a/block/genhd.c b/block/genhd.c > > index 3c139a1b6f04..08ab7ce63e57 100644 > > --- a/block/genhd.c > > +++ b/block/genhd.c > > @@ -539,9 +539,10 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, > > out_device_del: > > device_del(ddev); > > out_disk_release_events: > > - disk_release_events(disk); > > + if (!kobject_alive(&ddev->kobj)) > > + disk_release_events(disk); > > out_free_ext_minor: > > - if (disk->major == BLOCK_EXT_MAJOR) > > + if (!kobject_alive(&ddev->kobj) && disk->major == BLOCK_EXT_MAJOR) > > How can kobject_alive() matter? There are two hunks here. The first one I hope the above explains it. As for the second one, the assumption is that if device_add() succeeded the free_inode super op would do the respective blk_free_ext_minor() however now I am not sure if this is true. The kobject_alive() tells us if at least the device_add() had the kobject_add() complete. Since we have two hunks to consider I think we should be clear about differentiating between both. Luis
On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote: > Well, I don't think that we can remove this blk_free_ext_minor() call, for > this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev). > > Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not > called when reaching the out_free_ext_minor label, > > if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR) > blk_free_ext_minor(MINOR(bdev->bd_dev)); > > in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0 > because bdev->bd_dev == 0. > > I think we can apply this patch as-is... With the patch as-is we'll still leak disk->ev if device_add fails. Something like the patch below should solve that by moving the disk->ev allocation later and always cleaning it up through disk->release: diff --git a/block/genhd.c b/block/genhd.c index 3c139a1b6f049..3e4bbfa3e1c24 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -442,10 +442,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, disk->first_minor = ret; } - ret = disk_alloc_events(disk); - if (ret) - goto out_free_ext_minor; - /* delay uevents, until we scanned partition table */ dev_set_uevent_suppress(ddev, 1); @@ -456,7 +452,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, ddev->devt = MKDEV(disk->major, disk->first_minor); ret = device_add(ddev); if (ret) - goto out_disk_release_events; + goto out_free_ext_minor; + + ret = disk_alloc_events(disk); + if (ret) + goto out_device_del; + if (!sysfs_deprecated) { ret = sysfs_create_link(block_depr, &ddev->kobj, kobject_name(&ddev->kobj)); @@ -538,8 +539,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, sysfs_remove_link(block_depr, dev_name(ddev)); out_device_del: device_del(ddev); -out_disk_release_events: - disk_release_events(disk); + return ret; out_free_ext_minor: if (disk->major == BLOCK_EXT_MAJOR) blk_free_ext_minor(disk->first_minor);
On Tue, Dec 21, 2021 at 11:08:11AM +0100, Christoph Hellwig wrote: > On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote: > > Well, I don't think that we can remove this blk_free_ext_minor() call, for > > this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev). > > > > Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not > > called when reaching the out_free_ext_minor label, > > > > if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR) > > blk_free_ext_minor(MINOR(bdev->bd_dev)); > > > > in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0 > > because bdev->bd_dev == 0. > > > > I think we can apply this patch as-is... > > With the patch as-is we'll still leak disk->ev if device_add fails. > Something like the patch below should solve that by moving the disk->ev > allocation later and always cleaning it up through disk->release: Sorry, this still had the extra return. diff --git a/block/genhd.c b/block/genhd.c index 3c139a1b6f049..603db5d6f10c0 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -442,10 +442,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, disk->first_minor = ret; } - ret = disk_alloc_events(disk); - if (ret) - goto out_free_ext_minor; - /* delay uevents, until we scanned partition table */ dev_set_uevent_suppress(ddev, 1); @@ -456,7 +452,12 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, ddev->devt = MKDEV(disk->major, disk->first_minor); ret = device_add(ddev); if (ret) - goto out_disk_release_events; + goto out_free_ext_minor; + + ret = disk_alloc_events(disk); + if (ret) + goto out_device_del; + if (!sysfs_deprecated) { ret = sysfs_create_link(block_depr, &ddev->kobj, kobject_name(&ddev->kobj)); @@ -538,8 +539,6 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, sysfs_remove_link(block_depr, dev_name(ddev)); out_device_del: device_del(ddev); -out_disk_release_events: - disk_release_events(disk); out_free_ext_minor: if (disk->major == BLOCK_EXT_MAJOR) blk_free_ext_minor(disk->first_minor);
On 2021/12/21 19:15, Christoph Hellwig wrote: > On Tue, Dec 21, 2021 at 11:08:11AM +0100, Christoph Hellwig wrote: >> On Fri, Dec 17, 2021 at 07:37:43PM +0900, Tetsuo Handa wrote: >>> Well, I don't think that we can remove this blk_free_ext_minor() call, for >>> this call is releasing disk->first_minor rather than MINOR(bdev->bd_dev). >>> >>> Since bdev_add(disk->part0, MKDEV(disk->major, disk->first_minor)) is not >>> called when reaching the out_free_ext_minor label, >>> >>> if (MAJOR(bdev->bd_dev) == BLOCK_EXT_MAJOR) >>> blk_free_ext_minor(MINOR(bdev->bd_dev)); >>> >>> in bdev_free_inode() will not be called because MAJOR(bdev->bd_dev) == 0 >>> because bdev->bd_dev == 0. >>> >>> I think we can apply this patch as-is... >> >> With the patch as-is we'll still leak disk->ev if device_add fails. >> Something like the patch below should solve that by moving the disk->ev >> allocation later and always cleaning it up through disk->release: Then what about this simple fix? diff --git a/block/disk-events.c b/block/disk-events.c index 8d5496e7592a..05b1249650ab 100644 --- a/block/disk-events.c +++ b/block/disk-events.c @@ -501,4 +501,5 @@ void disk_release_events(struct gendisk *disk) /* the block count should be 1 from disk_del_events() */ WARN_ON_ONCE(disk->ev && disk->ev->block != 1); kfree(disk->ev); + disk->ev = NULL; }
On 2021/12/21 4:16, Luis Chamberlain wrote: > The kobject_alive() tells us if at least the device_add() had the > kobject_add() complete. Testing with error injection @@ -3284,6 +3284,9 @@ int device_add(struct device *dev) if (!dev) goto done; + if (!strcmp(current->comm, "a.out")) + goto done; + if (!dev->p) { error = device_private_init(dev); if (error) told me that kref count is 1 when reaching the out_disk_release_events label. Thus, if (!kobject_alive(&ddev->kobj)) seems wrong. Christoph proposed deferring disk_alloc_events(). If it is safe to defer disk_alloc_events(), that can be a fix. But I don't know if there is a path which can invoke disk event functions (e.g. disk_block_events() from blkdev_get_by_dev()) between device_add() and disk_alloc_events(). I didn't find a path, but at least the device file and some of sysfs files are already visible before disk_alloc_events() is called...
On 2021/12/21 19:15, Christoph Hellwig wrote: >>> I think we can apply this patch as-is... >> >> With the patch as-is we'll still leak disk->ev if device_add fails. >> Something like the patch below should solve that by moving the disk->ev >> allocation later and always cleaning it up through disk->release: > > Sorry, this still had the extra return. > OK. Since blkdev_get_no_open() from blkdev_get_by_dev() returns NULL until disk_alloc_events() after device_add() completes, there is no race window for unbalanced disk_block_events(disk)/disk_unblock_events(disk) pair. Your patch seems to work. Please propose as a patch.
On Tue, Dec 21, 2021 at 08:41:18PM +0900, Tetsuo Handa wrote: > On 2021/12/21 4:16, Luis Chamberlain wrote: > > The kobject_alive() tells us if at least the device_add() had the > > kobject_add() complete. > > > Testing with error injection > > @@ -3284,6 +3284,9 @@ int device_add(struct device *dev) > if (!dev) > goto done; > > + if (!strcmp(current->comm, "a.out")) > + goto done; > + > if (!dev->p) { > error = device_private_init(dev); > if (error) > > told me that kref count is 1 when reaching the out_disk_release_events label. > Thus, > > if (!kobject_alive(&ddev->kobj)) > > seems wrong. Hrm.... quite unexpected. > Christoph proposed deferring disk_alloc_events(). If it is safe to defer > disk_alloc_events(), that can be a fix. *If safe*, yes, agreed. Luis
diff --git a/block/genhd.c b/block/genhd.c index 30362aeacac4..47bb34ab967b 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -540,7 +540,7 @@ int __must_check device_add_disk(struct device *parent, struct gendisk *disk, out_device_del: device_del(ddev); out_disk_release_events: - disk_release_events(disk); + /* disk_release() will call disk_release_events(). */ out_free_ext_minor: if (disk->major == BLOCK_EXT_MAJOR) blk_free_ext_minor(disk->first_minor);