Message ID | 1447977911.20885.10.camel@intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote: > On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote: > > Actually, I think we need to trigger a filesystem shutdown before > > doing anything else (e.g. before unmapping the inodes).That way the > > filesystem will error out new calls to get_blocks() and prevent any > > new IO submission while the block device teardown and inode > > unmapping is done. i.e. solving the problem at the block device > > level is hard, but we already have all the necessary infrastructure > > to shut off IO and new block mappings at the filesystem level.... > > > > Shutting down the filesystem on block_device remove seems a more > invasive behavior change from what we've historically done. I've heard that so many times I figured that would be your answer. yet we've got a clear situation where we have a race between file level access and block device operations that is clearly solved by doing an upfront filesystem shutdown on unplug, but still the answer is "ignore the filesystem, we need to do everything in the block layer, no matter how hard or complex it is to solve"... > I.e. a > best effort "let the fs struggle on to service whatever it can that is > not dependent on new disk I/O". And so we still have this limbo fs state that is an utter nightmare to handle sanely. We don't know what the cause of the IO error are and so we have to try to handle them as though we can recover in some way from the error. Only when we get an error we can't possibly recover from do we shut the fileystem down and then stop all attempts at issuing IO, mapping page faults, etc. However, if the device has been unplugged then we *can never recover* and so continuing on with out eyes closed and fingers in our eyes shouting 'lalalalalalala" as loud as we can won't change the fact that we are going to shut down the filesystem in the near future. -Dave.
On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote: > On Fri, Nov 20, 2015 at 12:05:11AM +0000, Williams, Dan J wrote: >> On Fri, 2015-11-20 at 10:17 +1100, Dave Chinner wrote: >> > Actually, I think we need to trigger a filesystem shutdown before >> > doing anything else (e.g. before unmapping the inodes).That way the >> > filesystem will error out new calls to get_blocks() and prevent any >> > new IO submission while the block device teardown and inode >> > unmapping is done. i.e. solving the problem at the block device >> > level is hard, but we already have all the necessary infrastructure >> > to shut off IO and new block mappings at the filesystem level.... >> > >> >> Shutting down the filesystem on block_device remove seems a more >> invasive behavior change from what we've historically done. > > I've heard that so many times I figured that would be your answer. > yet we've got a clear situation where we have a race between file > level access and block device operations that is clearly solved by > doing an upfront filesystem shutdown on unplug, but still the answer > is "ignore the filesystem, we need to do everything in the block > layer, no matter how hard or complex it is to solve"... I was only disagreeing with your assertion that solving this particular problem in the block layer was hard, and not asserting that this needs to be solved in the block layer at all costs. >> I.e. a >> best effort "let the fs struggle on to service whatever it can that is >> not dependent on new disk I/O". > > And so we still have this limbo fs state that is an utter nightmare > to handle sanely. We don't know what the cause of the IO error are > and so we have to try to handle them as though we can recover in > some way from the error. Only when we get an error we can't possibly > recover from do we shut the fileystem down and then stop all > attempts at issuing IO, mapping page faults, etc. > > However, if the device has been unplugged then we *can never > recover* and so continuing on with out eyes closed and fingers in > our eyes shouting 'lalalalalalala" as loud as we can won't change > the fact that we are going to shut down the filesystem in the near > future. > Can't argue with that, and XFS takes a lot longer to mourn the loss of the block device than ext4. What would be the recommended interface to tell XFS to sync if it can, but give up quickly if it hits an error and then shutdown permanently?
On Thu, Nov 19, 2015 at 8:25 PM, Dan Williams <dan.j.williams@intel.com> wrote: > On Thu, Nov 19, 2015 at 8:06 PM, Dave Chinner <david@fromorbit.com> wrote: [..] > What would be the recommended interface to tell XFS to sync if it can, > but give up quickly if it hits an error and then shutdown permanently? Thinking about this a bit further I think a "give up" notification to a filesystem is an incremental addition to the common vfs layer invalidate_partition() and now unmap_partition(). Because "sync if you can, but give up quickly" is handled by fsync_bdev() and generic_make_request() returning immediate errors. It's only the file system triggered retries that we need to turn off.
diff --git a/block/genhd.c b/block/genhd.c index e5cafa51567c..37ab780c0937 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -634,24 +634,14 @@ void add_disk(struct gendisk *disk) } EXPORT_SYMBOL(add_disk); -void del_gendisk(struct gendisk *disk) +static void del_gendisk_start(struct gendisk *disk) { - struct disk_part_iter piter; - struct hd_struct *part; - blk_integrity_del(disk); disk_del_events(disk); +} - /* invalidate stuff */ - disk_part_iter_init(&piter, disk, - DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); - while ((part = disk_part_iter_next(&piter))) { - invalidate_partition(disk, part->partno); - delete_partition(disk, part->partno); - } - disk_part_iter_exit(&piter); - - invalidate_partition(disk, 0); +static void del_gendisk_end(struct gendisk *disk) +{ set_capacity(disk, 0); disk->flags &= ~GENHD_FL_UP; @@ -670,9 +660,79 @@ void del_gendisk(struct gendisk *disk) pm_runtime_set_memalloc_noio(disk_to_dev(disk), false); device_del(disk_to_dev(disk)); } + +#define for_each_part(part, piter) \ + for (part = disk_part_iter_next(piter); part; \ + part = disk_part_iter_next(piter)) +void del_gendisk(struct gendisk *disk) +{ + struct disk_part_iter piter; + struct hd_struct *part; + + del_gendisk_start(disk); + + /* invalidate stuff */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) { + invalidate_partition(disk, part->partno); + delete_partition(disk, part->partno); + } + disk_part_iter_exit(&piter); + invalidate_partition(disk, 0); + + del_gendisk_end(disk); +} EXPORT_SYMBOL(del_gendisk); /** + * del_gendisk_queue - combined del_gendisk + blk_cleanup_queue + * @disk: disk to delete, invalidate, unmap, and end i/o + * + * This alternative for open coded calls to: + * del_gendisk() + * blk_cleanup_queue() + * ...is for ->direct_access() (DAX) capable devices. DAX bypasses page + * cache and mappings go directly to storage media. When such a disk is + * removed the pfn backing a mapping may be invalid or removed from the + * system. Upon return accessing DAX mappings of this disk will trigger + * SIGBUS. + */ +void del_gendisk_queue(struct gendisk *disk) +{ + struct disk_part_iter piter; + struct hd_struct *part; + + del_gendisk_start(disk); + + /* pass1 sync fs + evict idle inodes */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) + invalidate_partition(disk, part->partno); + disk_part_iter_exit(&piter); + invalidate_partition(disk, 0); + + blk_cleanup_queue(disk->queue); + + /* + * pass2 now that the queue is dead unmap DAX inodes, and delete + * partitions + */ + disk_part_iter_init(&piter, disk, + DISK_PITER_INCL_EMPTY | DISK_PITER_REVERSE); + for_each_part(part, &piter) { + unmap_partition(disk, part->partno); + delete_partition(disk, part->partno); + } + disk_part_iter_exit(&piter); + unmap_partition(disk, 0); + + del_gendisk_end(disk); +} +EXPORT_SYMBOL(del_gendisk_queue); + +/** * get_gendisk - get partitioning information for a given device * @devt: device to get partitioning information for * @partno: returned partition index diff --git a/drivers/block/brd.c b/drivers/block/brd.c index a5880f4ab40e..f149dd57bba3 100644 --- a/drivers/block/brd.c +++ b/drivers/block/brd.c @@ -532,7 +532,6 @@ out: static void brd_free(struct brd_device *brd) { put_disk(brd->brd_disk); - blk_cleanup_queue(brd->brd_queue); brd_free_pages(brd); kfree(brd); } @@ -560,7 +559,7 @@ out: static void brd_del_one(struct brd_device *brd) { list_del(&brd->brd_list); - del_gendisk(brd->brd_disk); + del_gendisk_queue(brd->brd_disk); brd_free(brd); } diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 8ee79893d2f5..6dd06e9d34b0 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -158,9 +158,8 @@ static void pmem_detach_disk(struct pmem_device *pmem) if (!pmem->pmem_disk) return; - del_gendisk(pmem->pmem_disk); + del_gendisk_queue(pmem->pmem_disk); put_disk(pmem->pmem_disk); - blk_cleanup_queue(pmem->pmem_queue); } static int pmem_attach_disk(struct device *dev, diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index 94a8f4ab57bc..0c3c968b57d9 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -388,8 +388,7 @@ removeseg: } list_del(&dev_info->lh); - del_gendisk(dev_info->gd); - blk_cleanup_queue(dev_info->dcssblk_queue); + del_gendisk_queue(dev_info->gd); dev_info->gd->queue = NULL; put_disk(dev_info->gd); up_write(&dcssblk_devices_sem); @@ -751,8 +750,7 @@ dcssblk_remove_store(struct device *dev, struct device_attribute *attr, const ch } list_del(&dev_info->lh); - del_gendisk(dev_info->gd); - blk_cleanup_queue(dev_info->dcssblk_queue); + del_gendisk_queue(dev_info->gd); dev_info->gd->queue = NULL; put_disk(dev_info->gd); device_unregister(&dev_info->dev); diff --git a/fs/block_dev.c b/fs/block_dev.c index 4c14d4467bbf..975d32b5623b 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1795,6 +1795,47 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) } EXPORT_SYMBOL(__invalidate_device); +void unmap_partition(struct gendisk *disk, int partno) +{ + bool dax = !!disk->fops->direct_access; + struct block_device *bdev = dax ? bdget_disk(disk, partno) : NULL; + struct super_block *sb = get_super(bdev); + struct inode *inode, *_inode = NULL; + + if (!bdev) + return; + + if (!sb) { + bdput(bdev); + return; + } + + spin_lock(&sb->s_inode_list_lock); + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { + spin_lock(&inode->i_lock); + if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) + || !IS_DAX(inode)) { + spin_unlock(&inode->i_lock); + continue; + } + __iget(inode); + spin_unlock(&inode->i_lock); + spin_unlock(&sb->s_inode_list_lock); + + unmap_mapping_range(inode->i_mapping, 0, 0, 1); + iput(_inode); + _inode = inode; + cond_resched(); + + spin_lock(&sb->s_inode_list_lock); + } + spin_unlock(&sb->s_inode_list_lock); + iput(_inode); + + drop_super(sb); + bdput(bdev); +} + void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg) { struct inode *inode, *old_inode = NULL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa514254161..d2dda249abf8 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2390,6 +2390,7 @@ extern int revalidate_disk(struct gendisk *); extern int check_disk_change(struct block_device *); extern int __invalidate_device(struct block_device *, bool); extern int invalidate_partition(struct gendisk *, int); +extern void unmap_partition(struct gendisk *, int); #endif unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end); diff --git a/include/linux/genhd.h b/include/linux/genhd.h index 847cc1d91634..028cf15a8a57 100644 --- a/include/linux/genhd.h +++ b/include/linux/genhd.h @@ -431,6 +431,7 @@ extern void part_round_stats(int cpu, struct hd_struct *part); /* block/genhd.c */ extern void add_disk(struct gendisk *disk); extern void del_gendisk(struct gendisk *gp); +extern void del_gendisk_queue(struct gendisk *disk); extern struct gendisk *get_gendisk(dev_t dev, int *partno); extern struct block_device *bdget_disk(struct gendisk *disk, int partno);