diff mbox

[08/10] block: Fix oops in locked_inode_to_wb_and_lock_list()

Message ID 20170209124433.2626-9-jack@suse.cz (mailing list archive)
State New, archived
Headers show

Commit Message

Jan Kara Feb. 9, 2017, 12:44 p.m. UTC
When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment. A complication is that i_wb can point to non-root wb_writeback
structure and in that case we do need to clean it up as bdi_unregister()
blocks waiting for all non-root wb_writeback references to get dropped.
Thus this i_wb reference could block device removal e.g. from
__scsi_remove_device() (which indirectly ends up calling
bdi_unregister()). We cannot rely on block device inode to go away soon
(and thus i_wb reference to get dropped) as the device may got
hot-removed e.g. under a mounted filesystem. We deal with these issues
by switching block device inode from non-root wb_writeback structure to
bdi->wb when needed.  Since this is rather expensive (requires
synchronize_rcu()) we do the switching only in del_gendisk() when we
know the device is going away.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Reported-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/genhd.c             |  4 ++--
 fs/block_dev.c            | 11 ++++-------
 include/linux/fs.h        |  2 +-
 include/linux/writeback.h |  1 +
 4 files changed, 8 insertions(+), 10 deletions(-)

Comments

Tejun Heo Feb. 12, 2017, 4:40 a.m. UTC | #1
Hello, Jan.

On Thu, Feb 09, 2017 at 01:44:31PM +0100, Jan Kara wrote:
> When block device is closed, we call inode_detach_wb() in __blkdev_put()
> which sets inode->i_wb to NULL. That is contrary to expectations that
> inode->i_wb stays valid once set during the whole inode's lifetime and
> leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
> inode_to_wb() returned NULL.
> 
> The reason why we called inode_detach_wb() is not valid anymore though.
> BDI is guaranteed to stay along until we call bdi_put() from
> bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
> moment. A complication is that i_wb can point to non-root wb_writeback
> structure and in that case we do need to clean it up as bdi_unregister()
> blocks waiting for all non-root wb_writeback references to get dropped.
> Thus this i_wb reference could block device removal e.g. from
> __scsi_remove_device() (which indirectly ends up calling
> bdi_unregister()). We cannot rely on block device inode to go away soon
> (and thus i_wb reference to get dropped) as the device may got
> hot-removed e.g. under a mounted filesystem. We deal with these issues
> by switching block device inode from non-root wb_writeback structure to
> bdi->wb when needed.  Since this is rather expensive (requires
> synchronize_rcu()) we do the switching only in del_gendisk() when we
> know the device is going away.

So, the only reason cgwb_bdi_destroy() is synchronous is because bdi
destruction was synchronous.  Now that bdi is properly reference
counted and can be decoupled from gendisk / q destruction, I can't
think of a reason to keep cgwb destruction synchronous.  Switching
wb's on destruction is kinda clumsy and it almost always hurts to
expose synchronize_rcu() in userland visible paths.

Wouldn't something like the following work?

* Remove bdi->usage_cnt and the synchronous waiting in
  cgwb_bdi_destroy().

* Instead, make cgwb's hold bdi->refcnt and put it from
  cgwb_release_workfn().

Then, we don't have to switch during shutdown and can just let things
drain.

Thanks.
Jan Kara Feb. 20, 2017, 4:58 p.m. UTC | #2
On Sun 12-02-17 13:40:27, Tejun Heo wrote:
> Hello, Jan.
> 
> On Thu, Feb 09, 2017 at 01:44:31PM +0100, Jan Kara wrote:
> > When block device is closed, we call inode_detach_wb() in __blkdev_put()
> > which sets inode->i_wb to NULL. That is contrary to expectations that
> > inode->i_wb stays valid once set during the whole inode's lifetime and
> > leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
> > inode_to_wb() returned NULL.
> > 
> > The reason why we called inode_detach_wb() is not valid anymore though.
> > BDI is guaranteed to stay along until we call bdi_put() from
> > bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
> > moment. A complication is that i_wb can point to non-root wb_writeback
> > structure and in that case we do need to clean it up as bdi_unregister()
> > blocks waiting for all non-root wb_writeback references to get dropped.
> > Thus this i_wb reference could block device removal e.g. from
> > __scsi_remove_device() (which indirectly ends up calling
> > bdi_unregister()). We cannot rely on block device inode to go away soon
> > (and thus i_wb reference to get dropped) as the device may got
> > hot-removed e.g. under a mounted filesystem. We deal with these issues
> > by switching block device inode from non-root wb_writeback structure to
> > bdi->wb when needed.  Since this is rather expensive (requires
> > synchronize_rcu()) we do the switching only in del_gendisk() when we
> > know the device is going away.
> 
> So, the only reason cgwb_bdi_destroy() is synchronous is because bdi
> destruction was synchronous.  Now that bdi is properly reference
> counted and can be decoupled from gendisk / q destruction, I can't
> think of a reason to keep cgwb destruction synchronous.  Switching
> wb's on destruction is kinda clumsy and it almost always hurts to
> expose synchronize_rcu() in userland visible paths.
> 
> Wouldn't something like the following work?
> 
> * Remove bdi->usage_cnt and the synchronous waiting in
>   cgwb_bdi_destroy().
> 
> * Instead, make cgwb's hold bdi->refcnt and put it from
>   cgwb_release_workfn().
> 
> Then, we don't have to switch during shutdown and can just let things
> drain.

At first sight this looks workable and would mean less special code so I
like it. I'll experiment with it and see how it works out.

								Honza
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index 68c613edb93a..721921a140cc 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,13 +649,13 @@  void del_gendisk(struct gendisk *disk)
 			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
 	while ((part = disk_part_iter_next(&piter))) {
 		invalidate_partition(disk, part->partno);
-		bdev_unhash_inode(part_devt(part));
+		bdev_cleanup_inode(part_devt(part));
 		delete_partition(disk, part->partno);
 	}
 	disk_part_iter_exit(&piter);
 
 	invalidate_partition(disk, 0);
-	bdev_unhash_inode(disk_devt(disk));
+	bdev_cleanup_inode(disk_devt(disk));
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 360439373a66..65ac3a60ac8e 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -884,6 +884,8 @@  static void bdev_evict_inode(struct inode *inode)
 	spin_lock(&bdev_lock);
 	list_del_init(&bdev->bd_list);
 	spin_unlock(&bdev_lock);
+	/* Detach inode from wb early as bdi_put() may free bdi->wb */
+	inode_detach_wb(inode);
 	if (bdev->bd_bdi != &noop_backing_dev_info)
 		bdi_put(bdev->bd_bdi);
 }
@@ -960,13 +962,14 @@  static LIST_HEAD(all_bdevs);
  * If there is a bdev inode for this device, unhash it so that it gets evicted
  * as soon as last inode reference is dropped.
  */
-void bdev_unhash_inode(dev_t dev)
+void bdev_cleanup_inode(dev_t dev)
 {
 	struct inode *inode;
 
 	inode = ilookup5(blockdev_superblock, hash(dev), bdev_test, &dev);
 	if (inode) {
 		remove_inode_hash(inode);
+		inode_switch_to_default_wb_sync(inode);
 		iput(inode);
 	}
 }
@@ -1874,12 +1877,6 @@  static void __blkdev_put(struct block_device *bdev, fmode_t mode, int for_part)
 		kill_bdev(bdev);
 
 		bdev_write_inode(bdev);
-		/*
-		 * Detaching bdev inode from its wb in __destroy_inode()
-		 * is too late: the queue which embeds its bdi (along with
-		 * root wb) can be gone as soon as we put_disk() below.
-		 */
-		inode_detach_wb(bdev->bd_inode);
 	}
 	if (bdev->bd_contains == bdev) {
 		if (disk->fops->release)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 319fb76f9081..f8c86b9c31d5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2344,7 +2344,7 @@  extern struct kmem_cache *names_cachep;
 #ifdef CONFIG_BLOCK
 extern int register_blkdev(unsigned int, const char *);
 extern void unregister_blkdev(unsigned int, const char *);
-extern void bdev_unhash_inode(dev_t dev);
+extern void bdev_cleanup_inode(dev_t dev);
 extern struct block_device *bdget(dev_t);
 extern struct block_device *bdgrab(struct block_device *bdev);
 extern void bd_set_size(struct block_device *, loff_t size);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0d3ba83a0f7f..6d27b78c9a79 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -237,6 +237,7 @@  static inline void inode_attach_wb(struct inode *inode, struct page *page)
 static inline void inode_detach_wb(struct inode *inode)
 {
 	if (inode->i_wb) {
+		WARN_ON_ONCE(!(inode->i_state & I_CLEAR));
 		wb_put(inode->i_wb);
 		inode->i_wb = NULL;
 	}