diff mbox

[4/8] mm, dax: truncate dax mappings at bdev or fs shutdown

Message ID 1447977911.20885.10.camel@intel.com (mailing list archive)
State Superseded
Headers show

Commit Message

Dan Williams Nov. 20, 2015, 12:05 a.m. UTC
On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
> Actually, I think we need to trigger a filesystem shutdown before

> doing anything else (e.g. before unmapping the inodes).That way the

> filesystem will error out new calls to get_blocks() and prevent any

> new IO submission while the block device teardown and inode

> unmapping is done. i.e. solving the problem at the block device

> level is hard, but we already have all the necessary infrastructure

> to shut off IO and new block mappings at the filesystem level....

> 


Shutting down the filesystem on block_device remove seems a more
invasive behavior change from what we've historically done.  I.e. a
best effort "let the fs struggle on to service whatever it can that is
not dependent on new disk I/O".

Solving it at the block layer isn't that hard now that we have
blk_queue_enter() to cheaply test if the block_device is dying or dead.

Below is an updated attempt that borrows a common inode walking
pattern, and simply reorders the order of operations done by
del_gendisk and blk_cleanup_queue.

Note that this depends on dax_map_atomic() from "[PATCH 8/8] dax: fix
lifetime of in-kernel dax mappings with dax_map_atomic()" (https://list
s.01.org/pipermail/linux-nvdimm/2015-November/002882.html).  Where
dax_map_atomic() checks that the queue is still live before allowing
new calls to bdev_direct_access().

8<-----
Subject: mm, dax: unmap dax mappings at bdev shutdown

From: Dan Williams <dan.j.williams@intel.com>


Currently dax mappings leak past / survive block_device shutdown.  While
page cache pages are permitted to be read/written after the block_device
is torn down this is not acceptable in the dax case as all media access
must end when the device is disabled.  The pfn backing a dax mapping is
permitted to be invalidated after bdev shutdown and this is indeed the
case with brd.

When a dax capable block_device driver calls del_gendisk_queue() in its
shutdown path it needs to ensure that all DAX pfns are unmapped, and
that no new mappings can be established.  This is different than the
pagecache backed case where the disk is protected by the queue being
torn down which ends I/O to the device.  Since dax bypasses the page
cache we need to unconditionally unmap the inode.

Cc: Jan Kara <jack@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[honza: drop changes to truncate_inode_pages_final]
[honza: ensure mappings can't be re-established]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>

---
 block/genhd.c                |   88 +++++++++++++++++++++++++++++++++++-------
 drivers/block/brd.c          |    3 -
 drivers/nvdimm/pmem.c        |    3 -
 drivers/s390/block/dcssblk.c |    6 +--
 fs/block_dev.c               |   41 ++++++++++++++++++++
 include/linux/fs.h           |    1 
 include/linux/genhd.h        |    1 
 7 files changed, 121 insertions(+), 22 deletions(-)

Comments

Dave Chinner Nov. 20, 2015, 4:06 a.m. UTC | #1
On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote:
> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
> > Actually, I think we need to trigger a filesystem shutdown before
> > doing anything else (e.g. before unmapping the inodes).That way the
> > filesystem will error out new calls to get_blocks() and prevent any
> > new IO submission while the block device teardown and inode
> > unmapping is done. i.e. solving the problem at the block device
> > level is hard, but we already have all the necessary infrastructure
> > to shut off IO and new block mappings at the filesystem level....
> > 
> 
> Shutting down the filesystem on block_device remove seems a more
> invasive behavior change from what we've historically done.

I've heard that so many times I figured that would be your answer.
yet we've got a clear situation where we have a race between file
level access and block device operations that is clearly solved by
doing an upfront filesystem shutdown on unplug, but still the answer
is "ignore the filesystem, we need to do everything in the block
layer, no matter how hard or complex it is to solve"...

>  I.e. a
> best effort "let the fs struggle on to service whatever it can that is
> not dependent on new disk I/O".

And so we still have this limbo fs state that is an utter nightmare
to handle sanely. We don't know what the cause of the IO error are
and so we have to try to handle them as though we can recover in
some way from the error. Only when we get an error we can't possibly
recover from do we shut the fileystem down and then stop all
attempts at issuing IO, mapping page faults, etc.

However, if the device has been unplugged then we *can never
recover* and so continuing on with out eyes closed and fingers in
our eyes shouting 'lalalalalalala" as loud as we can won't change
the fact that we are going to shut down the filesystem in the near
future.

-Dave.
Dan Williams Nov. 20, 2015, 4:25 a.m. UTC | #2
On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote:
>> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote:
>> > Actually, I think we need to trigger a filesystem shutdown before
>> > doing anything else (e.g. before unmapping the inodes).That way the
>> > filesystem will error out new calls to get_blocks() and prevent any
>> > new IO submission while the block device teardown and inode
>> > unmapping is done. i.e. solving the problem at the block device
>> > level is hard, but we already have all the necessary infrastructure
>> > to shut off IO and new block mappings at the filesystem level....
>> >
>>
>> Shutting down the filesystem on block_device remove seems a more
>> invasive behavior change from what we've historically done.
>
> I've heard that so many times I figured that would be your answer.
> yet we've got a clear situation where we have a race between file
> level access and block device operations that is clearly solved by
> doing an upfront filesystem shutdown on unplug, but still the answer
> is "ignore the filesystem, we need to do everything in the block
> layer, no matter how hard or complex it is to solve"...

I was only disagreeing with your assertion that solving this
particular problem in the block layer was hard, and not asserting that
this needs to be solved in the block layer at all costs.

>>  I.e. a
>> best effort "let the fs struggle on to service whatever it can that is
>> not dependent on new disk I/O".
>
> And so we still have this limbo fs state that is an utter nightmare
> to handle sanely. We don't know what the cause of the IO error are
> and so we have to try to handle them as though we can recover in
> some way from the error. Only when we get an error we can't possibly
> recover from do we shut the fileystem down and then stop all
> attempts at issuing IO, mapping page faults, etc.
>
> However, if the device has been unplugged then we *can never
> recover* and so continuing on with out eyes closed and fingers in
> our eyes shouting 'lalalalalalala" as loud as we can won't change
> the fact that we are going to shut down the filesystem in the near
> future.
>

Can't argue with that, and XFS takes a lot longer to mourn the loss of
the block device than ext4.

What would be the recommended interface to tell XFS to sync if it can,
but give up quickly if it hits an error and then shutdown permanently?
Dan Williams Nov. 20, 2015, 5:08 p.m. UTC | #3
On Thu, Nov 19, 2015 at 8:25 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote:
[..]
> What would be the recommended interface to tell XFS to sync if it can,
> but give up quickly if it hits an error and then shutdown permanently?

Thinking about this a bit further I think a "give up" notification to
a filesystem is an incremental addition to the common vfs layer
invalidate_partition() and now unmap_partition().  Because "sync if
you can, but give up quickly" is handled by fsync_bdev() and
generic_make_request() returning immediate errors.  It's only the file
system triggered retries that we need to turn off.
diff mbox

Patch

diff --git a/block/genhd.c b/block/genhd.c
index e5cafa51567c..37ab780c0937 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -634,24 +634,14 @@  void add_disk(struct gendisk *disk)
 }
 EXPORT_SYMBOL(add_disk);
 
-void del_gendisk(struct gendisk *disk)
+static void del_gendisk_start(struct gendisk *disk)
 {
-	struct disk_part_iter piter;
-	struct hd_struct *part;
-
 	blk_integrity_del(disk);
 	disk_del_events(disk);
+}
 
-	/* invalidate stuff */
-	disk_part_iter_init(&piter, disk,
-			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
-	while ((part = disk_part_iter_next(&piter))) {
-		invalidate_partition(disk, part->partno);
-		delete_partition(disk, part->partno);
-	}
-	disk_part_iter_exit(&piter);
-
-	invalidate_partition(disk, 0);
+static void del_gendisk_end(struct gendisk *disk)
+{
 	set_capacity(disk, 0);
 	disk->flags &= ~GENHD_FL_UP;
 
@@ -670,9 +660,79 @@  void del_gendisk(struct gendisk *disk)
 	pm_runtime_set_memalloc_noio(disk_to_dev(disk), false);
 	device_del(disk_to_dev(disk));
 }
+
+#define for_each_part(part, piter) \
+	for (part = disk_part_iter_next(piter); part; \
+			part = disk_part_iter_next(piter))
+void del_gendisk(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* invalidate stuff */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		invalidate_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
 EXPORT_SYMBOL(del_gendisk);
 
 /**
+ * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue
+ * @disk: disk to delete, invalidate, unmap, and end i/o
+ *
+ * This alternative for open coded calls to:
+ *     del_gendisk()
+ *     blk_cleanup_queue()
+ * ...is for ->direct_access() (DAX) capable devices.  DAX bypasses page
+ * cache and mappings go directly to storage media.  When such a disk is
+ * removed the pfn backing a mapping may be invalid or removed from the
+ * system.  Upon return accessing DAX mappings of this disk will trigger
+ * SIGBUS.
+ */
+void del_gendisk_queue(struct gendisk *disk)
+{
+	struct disk_part_iter piter;
+	struct hd_struct *part;
+
+	del_gendisk_start(disk);
+
+	/* pass1 sync fs + evict idle inodes */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter)
+		invalidate_partition(disk, part->partno);
+	disk_part_iter_exit(&piter);
+	invalidate_partition(disk, 0);
+
+	blk_cleanup_queue(disk->queue);
+
+	/*
+	 * pass2 now that the queue is dead unmap DAX inodes, and delete
+	 * partitions
+	 */
+	disk_part_iter_init(&piter, disk,
+			     DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE);
+	for_each_part(part, &piter) {
+		unmap_partition(disk, part->partno);
+		delete_partition(disk, part->partno);
+	}
+	disk_part_iter_exit(&piter);
+	unmap_partition(disk, 0);
+
+	del_gendisk_end(disk);
+}
+EXPORT_SYMBOL(del_gendisk_queue);
+
+/**
  * get_gendisk - get partitioning information for a given device
  * @devt: device to get partitioning information for
  * @partno: returned partition index
diff --git a/drivers/block/brd.c b/drivers/block/brd.c
index a5880f4ab40e..f149dd57bba3 100644
--- a/drivers/block/brd.c
+++ b/drivers/block/brd.c
@@ -532,7 +532,6 @@  out:
 static void brd_free(struct brd_device *brd)
 {
 	put_disk(brd->brd_disk);
-	blk_cleanup_queue(brd->brd_queue);
 	brd_free_pages(brd);
 	kfree(brd);
 }
@@ -560,7 +559,7 @@  out:
 static void brd_del_one(struct brd_device *brd)
 {
 	list_del(&brd->brd_list);
-	del_gendisk(brd->brd_disk);
+	del_gendisk_queue(brd->brd_disk);
 	brd_free(brd);
 }
 
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index 8ee79893d2f5..6dd06e9d34b0 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -158,9 +158,8 @@  static void pmem_detach_disk(struct pmem_device *pmem)
 	if (!pmem->pmem_disk)
 		return;
 
-	del_gendisk(pmem->pmem_disk);
+	del_gendisk_queue(pmem->pmem_disk);
 	put_disk(pmem->pmem_disk);
-	blk_cleanup_queue(pmem->pmem_queue);
 }
 
 static int pmem_attach_disk(struct device *dev,
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 94a8f4ab57bc..0c3c968b57d9 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -388,8 +388,7 @@  removeseg:
 	}
 	list_del(&dev_info->lh);
 
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	up_write(&dcssblk_devices_sem);
@@ -751,8 +750,7 @@  dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch
 	}
 
 	list_del(&dev_info->lh);
-	del_gendisk(dev_info->gd);
-	blk_cleanup_queue(dev_info->dcssblk_queue);
+	del_gendisk_queue(dev_info->gd);
 	dev_info->gd->queue = NULL;
 	put_disk(dev_info->gd);
 	device_unregister(&dev_info->dev);
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 4c14d4467bbf..975d32b5623b 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1795,6 +1795,47 @@  int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 }
 EXPORT_SYMBOL(__invalidate_device);
 
+void unmap_partition(struct gendisk *disk, int partno)
+{
+	bool dax = !!disk->fops->direct_access;
+	struct block_device *bdev = dax ? bdget_disk(disk, partno) : NULL;
+	struct super_block *sb = get_super(bdev);
+	struct inode *inode, *_inode = NULL;
+
+	if (!bdev)
+		return;
+
+	if (!sb) {
+		bdput(bdev);
+		return;
+	}
+
+	spin_lock(&sb->s_inode_list_lock);
+	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+		spin_lock(&inode->i_lock);
+		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
+				|| !IS_DAX(inode)) {
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		__iget(inode);
+		spin_unlock(&inode->i_lock);
+		spin_unlock(&sb->s_inode_list_lock);
+
+		unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+		iput(_inode);
+		_inode = inode;
+		cond_resched();
+
+		spin_lock(&sb->s_inode_list_lock);
+	}
+	spin_unlock(&sb->s_inode_list_lock);
+	iput(_inode);
+
+	drop_super(sb);
+	bdput(bdev);
+}
+
 void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa514254161..d2dda249abf8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2390,6 +2390,7 @@  extern int revalidate_disk(struct gendisk *);
 extern int check_disk_change(struct block_device *);
 extern int __invalidate_device(struct block_device *, bool);
 extern int invalidate_partition(struct gendisk *, int);
+extern void unmap_partition(struct gendisk *, int);
 #endif
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 847cc1d91634..028cf15a8a57 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -431,6 +431,7 @@  extern void part_round_stats(int cpu, struct hd_struct *part);
 /* block/genhd.c */
 extern void add_disk(struct gendisk *disk);
 extern void del_gendisk(struct gendisk *gp);
+extern void del_gendisk_queue(struct gendisk *disk);
 extern struct gendisk *get_gendisk(dev_t dev, int *partno);
 extern struct block_device *bdget_disk(struct gendisk *disk, int partno);