Message ID | 20151125183724.12508.73971.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: > Set SB_I_BDIDEAD when a block device is no longer available to service > requests. This will be used in several places where an fs should give > up early because the block device is gone. Letting the fs continue on > as if the block device is still present can lead to long latencies > waiting for an fs to detect the loss of its backing device, trigger > crashes, or generate misleasing warnings. > > Cc: Jan Kara <jack@suse.com> > Cc: Jens Axboe <axboe@fb.com> > Suggested-by: Dave Chinner <david@fromorbit.com> This isn't what I suggested. :/ ..... > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 1dd416bf72f7..d0233d643d33 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) > } > EXPORT_SYMBOL(__invalidate_device); > > +void kill_bdev_super(struct gendisk *disk, int partno) > +{ > + struct block_device *bdev = bdget_disk(disk, partno); > + struct super_block *sb; > + > + if (!bdev) > + return; > + sb = get_super(bdev); > + if (!sb) > + goto out; > + > + sb->s_iflags |= SB_I_BDI_DEAD; > + drop_super(sb); > + out: > + bdput(bdev); > +} That's not a notification to the filesystem - that's a status flag the filesystem has to explicitly check for *on every operation*. We already have checks like these throughout filesystems, but they are filesystem specific and need to propagate into fs-specific subsystems that have knowledge of VFS level superblocks. To that end, what I actually suggested what a callback - something like a function in the super operations structure so that the filesystem can take *immediate action* when the block device is dying. i.e. void kill_bdev_super(struct gendisk *disk, int partno) { struct block_device *bdev = bdget_disk(disk, partno); struct super_block *sb; if (!bdev) return; sb = get_super(bdev); if (!sb) goto out; if (sb->s_ops->shutdown) sb->s_ops->shutdown(sb); drop_super(sb); out: bdput(bdev); } and then we implement ->shutdown somthing like this in XFS: xfs_fs_shutdown(struct superblock *sb) { xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ); } and so we immediately notify the entire filesystem that a shutdown state has been entered and the appropriate actions are immediately taken. Cheers, Dave.
On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: >> Set SB_I_BDIDEAD when a block device is no longer available to service >> requests. This will be used in several places where an fs should give >> up early because the block device is gone. Letting the fs continue on >> as if the block device is still present can lead to long latencies >> waiting for an fs to detect the loss of its backing device, trigger >> crashes, or generate misleasing warnings. >> >> Cc: Jan Kara <jack@suse.com> >> Cc: Jens Axboe <axboe@fb.com> >> Suggested-by: Dave Chinner <david@fromorbit.com> > > This isn't what I suggested. :/ > > ..... > >> diff --git a/fs/block_dev.c b/fs/block_dev.c >> index 1dd416bf72f7..d0233d643d33 100644 >> --- a/fs/block_dev.c >> +++ b/fs/block_dev.c >> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) >> } >> EXPORT_SYMBOL(__invalidate_device); >> >> +void kill_bdev_super(struct gendisk *disk, int partno) >> +{ >> + struct block_device *bdev = bdget_disk(disk, partno); >> + struct super_block *sb; >> + >> + if (!bdev) >> + return; >> + sb = get_super(bdev); >> + if (!sb) >> + goto out; >> + >> + sb->s_iflags |= SB_I_BDI_DEAD; >> + drop_super(sb); >> + out: >> + bdput(bdev); >> +} > > That's not a notification to the filesystem - that's a status flag > the filesystem has to explicitly check for *on every operation*. We > already have checks like these throughout filesystems, but they are > filesystem specific and need to propagate into fs-specific > subsystems that have knowledge of VFS level superblocks. > > To that end, what I actually suggested what a callback - something > like a function in the super operations structure so that the > filesystem can take *immediate action* when the block device is > dying. i.e. > > void kill_bdev_super(struct gendisk *disk, int partno) > { > struct block_device *bdev = bdget_disk(disk, partno); > struct super_block *sb; > > if (!bdev) > return; > sb = get_super(bdev); > if (!sb) > goto out; > > if (sb->s_ops->shutdown) > sb->s_ops->shutdown(sb); > > drop_super(sb); > out: > bdput(bdev); > } > > and then we implement ->shutdown somthing like this in XFS: > > xfs_fs_shutdown(struct superblock *sb) > { > xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ); > } > > and so we immediately notify the entire filesystem that a shutdown > state has been entered and the appropriate actions are immediately > taken. > That sounds good in theory. However, in the case of xfs it is already calling xfs_force_shutdown(), but that does not prevent it from continuing to wait indefinitely at umount. For the ext4 the mark_inode_dirty() warning we're triggering the error in generic code. None of this fixes the problem of dax mappings leaking past block device remove. That can be done generically without needing per-fs code. Solving the "zombie filesystem after block device down" problem is incremental to this patch set.
On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote: > On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: > >> Set SB_I_BDIDEAD when a block device is no longer available to service > >> requests. This will be used in several places where an fs should give > >> up early because the block device is gone. Letting the fs continue on > >> as if the block device is still present can lead to long latencies > >> waiting for an fs to detect the loss of its backing device, trigger > >> crashes, or generate misleasing warnings. > >> > >> Cc: Jan Kara <jack@suse.com> > >> Cc: Jens Axboe <axboe@fb.com> > >> Suggested-by: Dave Chinner <david@fromorbit.com> > > > > This isn't what I suggested. :/ > > > > ..... > > > >> diff --git a/fs/block_dev.c b/fs/block_dev.c > >> index 1dd416bf72f7..d0233d643d33 100644 > >> --- a/fs/block_dev.c > >> +++ b/fs/block_dev.c > >> @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) > >> } > >> EXPORT_SYMBOL(__invalidate_device); > >> > >> +void kill_bdev_super(struct gendisk *disk, int partno) > >> +{ > >> + struct block_device *bdev = bdget_disk(disk, partno); > >> + struct super_block *sb; > >> + > >> + if (!bdev) > >> + return; > >> + sb = get_super(bdev); > >> + if (!sb) > >> + goto out; > >> + > >> + sb->s_iflags |= SB_I_BDI_DEAD; > >> + drop_super(sb); > >> + out: > >> + bdput(bdev); > >> +} > > > > That's not a notification to the filesystem - that's a status flag > > the filesystem has to explicitly check for *on every operation*. We > > already have checks like these throughout filesystems, but they are > > filesystem specific and need to propagate into fs-specific > > subsystems that have knowledge of VFS level superblocks. > > > > To that end, what I actually suggested what a callback - something > > like a function in the super operations structure so that the > > filesystem can take *immediate action* when the block device is > > dying. i.e. > > > > void kill_bdev_super(struct gendisk *disk, int partno) > > { > > struct block_device *bdev = bdget_disk(disk, partno); > > struct super_block *sb; > > > > if (!bdev) > > return; > > sb = get_super(bdev); > > if (!sb) > > goto out; > > > > if (sb->s_ops->shutdown) > > sb->s_ops->shutdown(sb); > > > > drop_super(sb); > > out: > > bdput(bdev); > > } > > > > and then we implement ->shutdown somthing like this in XFS: > > > > xfs_fs_shutdown(struct superblock *sb) > > { > > xfs_force_shutdown(XFS_M(sb), SHUTDOWN_DEVICE_REQ); > > } > > > > and so we immediately notify the entire filesystem that a shutdown > > state has been entered and the appropriate actions are immediately > > taken. > > > > That sounds good in theory. However, in the case of xfs it is already > calling xfs_force_shutdown(), Where? If XFS does not do any metadata IO, then it won't shut the filesystem down. We almost always allocate/map blocks without doing any IO, which means we cannot guarantee erroring out page faults during get_blocks until a shutdown has be triggered by other means.... > but that does not prevent it from > continuing to wait indefinitely at umount. Which is a bug we need to fix - I don't see how a shutdown implementation problem is at all relevant to triggering shutdowns in a prompt manner... > For the ext4 the > mark_inode_dirty() warning we're triggering the error in generic code. So? XFS doesn't use that generic code, but we have filesystem specific issues that we need to handle sanely. > None of this fixes the problem of dax mappings leaking past block > device remove. That can be done generically without needing per-fs > code. No, the shutdown is intended to close the race between the device being removed, the mappings being invalidated and the filesytem racing creating new mappings during page faults because it doesn't know the device has been unplugged until it does IO that gets some error in an unrecoverable path... Cheers, Dave.
On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote: >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: [..] >> That sounds good in theory. However, in the case of xfs it is already >> calling xfs_force_shutdown(), > > Where? In the trace I included in the cover letter. Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN checks, especially in the unmount path. Currently we deadlock here on umount after block device removal: INFO: task umount:2187 blocked for more than 120 seconds. Tainted: G O 4.4.0-rc2+ #1953 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D ffff8800d2fbfd70 0 2187 2095 0x00000080 ffff8800d2fbfd70 ffffffff81f94f98 ffff88031fc97bd8 ffff88030af5ad80 ffff8800db71db00 ffff8800d2fc0000 ffff8800db8dbde0 ffff8800d93b6708 ffff8800d93b6760 ffff8800d93b66d8 ffff8800d2fbfd88 ffffffff818f0695 Call Trace: [<ffffffff818f0695>] schedule+0x35/0x80 [<ffffffffa01e134e>] xfs_ail_push_all_sync+0xbe/0x110 [xfs] [<ffffffff810ecc30>] ? wait_woken+0x80/0x80 [<ffffffffa01c8d91>] xfs_unmountfs+0x81/0x1b0 [xfs] [<ffffffffa01c991b>] ? xfs_mru_cache_destroy+0x6b/0x90 [xfs] [<ffffffffa01cbf30>] xfs_fs_put_super+0x30/0x90 [xfs] [<ffffffff81247eca>] generic_shutdown_super+0x6a/0xf0 Earlier in this trace xfs has already performed: XFS (pmem0m): xfs_do_force_shutdown(0x2) called from line 1197 of file fs/xfs/xfs_log.c. ...but xfs_log_work_queue() continues to run periodically. > If XFS does not do any metadata IO, then it won't shut the > filesystem down. We almost always allocate/map blocks without doing > any IO, which means we cannot guarantee erroring out page faults > during get_blocks until a shutdown has be triggered by other > means.... > >> but that does not prevent it from >> continuing to wait indefinitely at umount. > > Which is a bug we need to fix - I don't see how a shutdown > implementation problem is at all relevant to triggering shutdowns in > a prompt manner... > >> For the ext4 the >> mark_inode_dirty() warning we're triggering the error in generic code. > > So? XFS doesn't use that generic code, but we have filesystem > specific issues that we need to handle sanely. > >> None of this fixes the problem of dax mappings leaking past block >> device remove. That can be done generically without needing per-fs >> code. > > No, the shutdown is intended to close the race between the device > being removed, the mappings being invalidated and the filesytem > racing creating new mappings during page faults because it doesn't > know the device has been unplugged until it does IO that gets some > error in an unrecoverable path... > So you want me to grow a ->sb_shutdown() method for each fs that supports dax only to call unmap_mapping_range on each dax inode when common code could have already walked the inode list. Also separately teach each fs to start returning errors on get_blocks() after shutdown even though fs/dax.c can figure it out from the return value from blk_queue_enter? I'm failing to see the point. That is of course separate from the real need for an ->sb_shutdown() to tell the fs that the device is gone for other filesystem specific operations.
On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote: > On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote: > >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: > [..] > >> That sounds good in theory. However, in the case of xfs it is already > >> calling xfs_force_shutdown(), > > > > Where? > > In the trace I included in the cover letter. > > Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN > checks, especially in the unmount path. Currently we deadlock here on > umount after block device removal: What's the test case? From what you've said, it appears to be another manifestation where we trying to recovery from non-fatal IO errors that are being reported by the block device, but that is short-cutting the path that normally does shutdown detection on metadata buffer IO completion. Filesystem error handling coul dbe a lot more simple if we didnt' have to try to guess what EIO from the block device actually means. If device unplug triggered a shutdown, we wouldn't have to care about retrying in many cases where we do right now because shutdown then has clear priority of all other errors we need to handle. RIgh tnow we just have ot guess, and there's a long history of corner cases where we have guessed wrong.... I have patches to fix this particular manifestation, but until we have have notification that devices hav ebeen unplugged and are not coming back we're going to continue to struggle to get this right and hence have re-occurring bugs when users pull USB drives out from under active filesystems..... > >> None of this fixes the problem of dax mappings leaking past block > >> device remove. That can be done generically without needing per-fs > >> code. > > > > No, the shutdown is intended to close the race between the device > > being removed, the mappings being invalidated and the filesytem > > racing creating new mappings during page faults because it doesn't > > know the device has been unplugged until it does IO that gets some > > error in an unrecoverable path... > > So you want me to grow a ->sb_shutdown() method for each fs that > supports dax only to call unmap_mapping_range on each dax inode when > common code could have already walked the inode list. No, I don't want you to implement some whacky, dax only ->sb_shutdown method. I want a notification method to be implemented so that each filesystem can take the necessary action it requires when the underlying block device goes away. Yes, that *filesystem specific method* might involve invalidating all the dirty inodes in the filesystem, and if there are DAX mappings in the filesystem then invalidating them, too, but that's something the filesystem needs to take care of because *all filesystem shutdowns must do this*. Do you see the distinction there? This isn't just about device unplug - there are lots of different triggers for a filesystem shutdown. However, regardless of the shutdown trigger, we have to take the same action. That is, we have to prevent all pending and future IO from being issued to the block device, *even if the block device is still OK*. For example, if we detect a free space corruption during allocation, it is not safe to trust *any active mapping* because we can't trust that we having handed out the same block to multiple owners. Hence on such a filesystem shutdown, we have to prevent any new DAX mapping from occurring and invalidate all existing mappings as we cannot allow userspace to modify any data or metadata until we've resolved the corruption situation. The simple fact is that a /filesystem/ shutdown needs to do DAX mapping invalidation regardless of whether the block device has been unplugged or not. This is not a case of "this only happens when we unplug the device", this is a user data protection mechanism that we use to prevent corruption propagation once it has been detected. A device unplug is just one type of "corruption" that can occur. Hence putting all this special invalidation stuff into the block devices is simply duplicating functionality we need to implement in the filesystem layers. Regardless of this, it should be done by the filesystem layers because that is where all the information necesary to determine what needs invalidation is stored. So, from the block dvice perspective, the *only thing it needs to do* is notify the filesystem that it needs to shut down, and the filesystem should then take care of everything else. Device unplug is not some special snowflake that needs to be treated differently from all other types of "it's dead, Jim" filesystem errors - from the FS perspective it's the dead simple case because there are no serious consequences if we screw up. i.e. if the FS gets it wrong and IO is still issued after the shutdown, the IO is going to be errored out rather than corrupting something on disk that would have otherwise been OK.... > Also separately > teach each fs to start returning errors on get_blocks() after shutdown > even though fs/dax.c can figure it out from the return value from > blk_queue_enter? They will already return errors from get_blocks. STATIC int __xfs_get_blocks( struct inode *inode, sector_t iblock, struct buffer_head *bh_result, int create, bool direct, bool dax_fault) { struct xfs_inode *ip = XFS_I(inode); struct xfs_mount *mp = ip->i_mount; xfs_fileoff_t offset_fsb, end_fsb; int error = 0; int lockmode = 0; struct xfs_bmbt_irec imap; int nimaps = 1; xfs_off_t offset; ssize_t size; int new = 0; if (XFS_FORCED_SHUTDOWN(mp)) return -EIO; ..... Hmmm? First thing XFS does in get_blocks() is check for shutdown. In fact, the first thing that every major subsystem entry point does in XFs is check for shutdown. Let's look at a typical allocation get_blocks call in XFS: xfs_get_blocks __xfs_get_blocks - has shutdown check xfs_bmapi_read - shutdown check is second after error injection xfs_iomap_write_direct xfs_trans_reserve - has shutdown check xfs_bmapi_write - has shutdown check xfs_trans_commit - has shutdown check IOWs, there are shutdown checks all through the get_blocks callbacks already, so it's hardly a caase of you having to go an add support to any filesystem for this... Cheers, MDave.
On Mon, Nov 30, 2015 at 8:03 PM, Dave Chinner <david@fromorbit.com> wrote: > On Wed, Nov 25, 2015 at 11:11:00PM -0800, Dan Williams wrote: >> On Wed, Nov 25, 2015 at 10:27 PM, Dave Chinner <david@fromorbit.com> wrote: >> > On Wed, Nov 25, 2015 at 02:09:34PM -0800, Dan Williams wrote: >> >> On Wed, Nov 25, 2015 at 1:50 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> > On Wed, Nov 25, 2015 at 10:37:24AM -0800, Dan Williams wrote: >> [..] >> >> That sounds good in theory. However, in the case of xfs it is already >> >> calling xfs_force_shutdown(), >> > >> > Where? >> >> In the trace I included in the cover letter. >> >> Testing this patch set reveals that xfs needs more XFS_FORCED_SHUTDOWN >> checks, especially in the unmount path. Currently we deadlock here on >> umount after block device removal: > > What's the test case? > > From what you've said, it appears to be another manifestation where > we trying to recovery from non-fatal IO errors that are being > reported by the block device, but that is short-cutting the path > that normally does shutdown detection on metadata buffer IO > completion. > > Filesystem error handling coul dbe a lot more simple if we didnt' > have to try to guess what EIO from the block device actually means. > If device unplug triggered a shutdown, we wouldn't have to care > about retrying in many cases where we do right now because shutdown > then has clear priority of all other errors we need to handle. RIgh > tnow we just have ot guess, and there's a long history of corner > cases where we have guessed wrong.... > > I have patches to fix this particular manifestation, but until we > have have notification that devices hav ebeen unplugged and are not > coming back we're going to continue to struggle to get this right > and hence have re-occurring bugs when users pull USB drives out from > under active filesystems..... > > >> >> None of this fixes the problem of dax mappings leaking past block >> >> device remove. That can be done generically without needing per-fs >> >> code. >> > >> > No, the shutdown is intended to close the race between the device >> > being removed, the mappings being invalidated and the filesytem >> > racing creating new mappings during page faults because it doesn't >> > know the device has been unplugged until it does IO that gets some >> > error in an unrecoverable path... >> >> So you want me to grow a ->sb_shutdown() method for each fs that >> supports dax only to call unmap_mapping_range on each dax inode when >> common code could have already walked the inode list. > > No, I don't want you to implement some whacky, dax only > ->sb_shutdown method. I want a notification method to be implemented > so that each filesystem can take the necessary action it requires > when the underlying block device goes away. > > Yes, that *filesystem specific method* might involve invalidating > all the dirty inodes in the filesystem, and if there are DAX > mappings in the filesystem then invalidating them, too, but that's > something the filesystem needs to take care of because *all > filesystem shutdowns must do this*. > > Do you see the distinction there? This isn't just about device > unplug - there are lots of different triggers for a filesystem > shutdown. However, regardless of the shutdown trigger, we have to > take the same action. That is, we have to prevent all pending and > future IO from being issued to the block device, *even if the block > device is still OK*. Ah, that finally got through to me. I'll refactor this to be a ->shutdown notification with a generic unmap that a filesystem can optionally call under its own discretion. As I now see that generic functionality is just common code that an fs might optionally need/use, but it is counter productive for that to be privately triggered only by the block layer and only from an event like del_gendisk. We need it in potentially all fs shutdown paths.
diff --git a/block/genhd.c b/block/genhd.c index e5cafa51567c..8a743cb81fb4 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -648,10 +648,12 @@ void del_gendisk(struct gendisk *disk) while ((part = disk_part_iter_next(&piter))) { invalidate_partition(disk, part->partno); delete_partition(disk, part->partno); + kill_bdev_super(disk, part->partno); } disk_part_iter_exit(&piter); invalidate_partition(disk, 0); + kill_bdev_super(disk, 0); set_capacity(disk, 0); disk->flags &= ~GENHD_FL_UP; diff --git a/fs/block_dev.c b/fs/block_dev.c index 1dd416bf72f7..d0233d643d33 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -1795,6 +1795,23 @@ int __invalidate_device(struct block_device *bdev, bool kill_dirty) } EXPORT_SYMBOL(__invalidate_device); +void kill_bdev_super(struct gendisk *disk, int partno) +{ + struct block_device *bdev = bdget_disk(disk, partno); + struct super_block *sb; + + if (!bdev) + return; + sb = get_super(bdev); + if (!sb) + goto out; + + sb->s_iflags |= SB_I_BDI_DEAD; + drop_super(sb); + out: + bdput(bdev); +} + void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg) { struct inode *inode, *old_inode = NULL; diff --git a/include/linux/fs.h b/include/linux/fs.h index 3aa514254161..76925e322e87 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1254,6 +1254,7 @@ struct mm_struct; /* sb->s_iflags */ #define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */ #define SB_I_NOEXEC 0x00000002 /* Ignore executables on this fs */ +#define SB_I_BDI_DEAD 0x00000004 /* Give up, backing device is dead */ /* Possible states of 'frozen' field */ enum { @@ -2390,6 +2391,7 @@ extern int revalidate_disk(struct gendisk *); extern int check_disk_change(struct block_device *); extern int __invalidate_device(struct block_device *, bool); extern int invalidate_partition(struct gendisk *, int); +extern void kill_bdev_super(struct gendisk *, int); #endif unsigned long invalidate_mapping_pages(struct address_space *mapping, pgoff_t start, pgoff_t end);
Set SB_I_BDIDEAD when a block device is no longer available to service requests. This will be used in several places where an fs should give up early because the block device is gone. Letting the fs continue on as if the block device is still present can lead to long latencies waiting for an fs to detect the loss of its backing device, trigger crashes, or generate misleasing warnings. Cc: Jan Kara <jack@suse.com> Cc: Jens Axboe <axboe@fb.com> Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/genhd.c | 2 ++ fs/block_dev.c | 17 +++++++++++++++++ include/linux/fs.h | 2 ++ 3 files changed, 21 insertions(+)