diff mbox

blkdev: Fix blkdev_open to release the bdev on error

Message ID 20151208072508.GM20997@ZenIV.linux.org.uk (mailing list archive)
State New, archived
Headers show

Commit Message

Al Viro Dec. 8, 2015, 7:25 a.m. UTC
On Mon, Dec 07, 2015 at 06:05:03PM +0000, Suzuki K. Poulose wrote:
> blkdev_open() doesn't release the bdev, it attached to a given
> inode, if blkdev_get() fails (e.g, due to absence of a device).
> This can cause kernel crashes when the original filesystem
> tries to flush the data during evict_inode.
> 
> This can be triggered easily with virtio-9p fs using the following
> simple steps.

???
How can filesystem type affect the behaviour of block devices?

Having mknod /tmp/splat b 8 1; rm /tmp/splat try to evict the pagecache
of /dev/sda1 is simply wrong, no matter what type /tmp happens to have.
And they must share pagecache, or you'll get one hell of cache coherency
problems.  As it is, that pagecache belongs to inode on bdevfs (see
fs/block_dev.c; not mountable anywhere visible, the one and only mount is
internal).  That inode is tied to struct bdev, ditto for its lifetime.

Block device inodes on anything else have their ->i_mapping pointing to
the corresponding (unique for given major/minor) inode on bdevfs; that
gives us the coherency, but that also means that their *own* pagecache
(->i_data) is empty.  Which is just fine, since inode eviction should
get rid of everything in its embedded struct address_space.  In case of
block device inodes on ext2, 9p, etc. that amounts to no pages at all.
In case of bdevfs, it contains the page cache of block device.

<looks> 
Aha...
        truncate_inode_pages_final(inode->i_mapping);
        clear_inode(inode);
        filemap_fdatawrite(inode->i_mapping);

in there is obviously wrong - it should be

        truncate_inode_pages_final(&inode->i_data);
        clear_inode(inode);
        filemap_fdatawrite(&inode->i_data);

and if you check other filesystems' ->evict_inode() you'll see the same thing
there.

We should not do bd_forget() upon failing open() - what for?  As long as
->i_rdev remains the same, the pointer to struct bdev is valid.  It
doesn't pin bdev down; having it (or any other alias) opened does.  When
we decide to evict bdev, *all* aliasing inodes are dissociated from it;
none of them is open at that point, so we are OK.  When an aliasing inode
gets evicted, we have it dissociated from its ->i_bdev (if any).  Since we
only access the ->i_mapping of aliasing inode while its open, those places
are fine and anything that wants ->i_data of alias will simply find it empty.

AFAICS, the cause of your oopsen is that 9p evict_inode is accessing the
object it has no business to touch.

Could you confirm that the patch below fixes your problem?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Suzuki K Poulose Dec. 8, 2015, 10:07 a.m. UTC | #1
On 08/12/15 07:25, Al Viro wrote:
> On Mon, Dec 07, 2015 at 06:05:03PM +0000, Suzuki K. Poulose wrote:
>> blkdev_open() doesn't release the bdev, it attached to a given
>> inode, if blkdev_get() fails (e.g, due to absence of a device).
>> This can cause kernel crashes when the original filesystem
>> tries to flush the data during evict_inode.
>>
>> This can be triggered easily with virtio-9p fs using the following
>> simple steps.
>
> ???

> How can filesystem type affect the behaviour of block devices?
>

...

>
> We should not do bd_forget() upon failing open() - what for?  As long as
> ->i_rdev remains the same, the pointer to struct bdev is valid.  It
> doesn't pin bdev down; having it (or any other alias) opened does.  When
> we decide to evict bdev, *all* aliasing inodes are dissociated from it;
> none of them is open at that point, so we are OK.  When an aliasing inode
> gets evicted, we have it dissociated from its ->i_bdev (if any).  Since we
> only access the ->i_mapping of aliasing inode while its open, those places
> are fine and anything that wants ->i_data of alias will simply find it empty.

Thanks for the detailed explanation. Surely my patch was not cooked up
on the full understanding of the bdev fs. Things are much more clear now.

> Could you confirm that the patch below fixes your problem?


Yes, it does solve the issue.

Thanks
Suzuki

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 699941e..5110785 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -451,9 +451,9 @@  void v9fs_evict_inode(struct inode *inode)
 {
 	struct v9fs_inode *v9inode = V9FS_I(inode);
 
-	truncate_inode_pages_final(inode->i_mapping);
+	truncate_inode_pages_final(&inode->i_data);
 	clear_inode(inode);
-	filemap_fdatawrite(inode->i_mapping);
+	filemap_fdatawrite(&inode->i_data);
 
 	v9fs_cache_inode_put_cookie(inode);
 	/* clunk the fid stashed in writeback_fid */