Message ID | 20201123004116.2453-1-ruansy.fnst@cn.fujitsu.com (mailing list archive) |
---|---|
Headers | show |
Series | fsdax: introduce fs query to support reflink | expand |
On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote: > This patchset is a try to resolve the problem of tracking shared page > for fsdax. > > Change from v1: > - Intorduce ->block_lost() for block device > - Support mapped device > - Add 'not available' warning for realtime device in XFS > - Rebased to v5.10-rc1 > > This patchset moves owner tracking from dax_assocaite_entry() to pmem > device, by introducing an interface ->memory_failure() of struct > pagemap. The interface is called by memory_failure() in mm, and > implemented by pmem device. Then pmem device calls its ->block_lost() > to find the filesystem which the damaged page located in, and call > ->storage_lost() to track files or metadata assocaited with this page. > Finally we are able to try to fix the damaged data in filesystem and do > other necessary processing, such as killing processes who are using the > files affected. > > The call trace is like this: > memory_failure() > pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() > gendisk->fops->block_lost() => pmem_block_lost() or > md_blk_block_lost() > sb->s_ops->storage_lost() => xfs_fs_storage_lost() > xfs_rmap_query_range() > xfs_storage_lost_helper() > mf_recover_controller->recover_fn => \ > memory_failure_dev_pagemap_kill_procs() > > The collect_procs() and kill_procs() are moved into a callback which > is passed from memory_failure() to xfs_storage_lost_helper(). So we > can call it when a file assocaited is found, instead of creating a > file list and iterate it. > > The fsdax & reflink support for XFS is not contained in this patchset. This looks promising - the overall architecture is a lot more generic and less dependent on knowing about memory, dax or memory failures. A few comments that I think would further improve understanding the patchset and the implementation: - the order of the patches is inverted. It should start with a single patch introducing the mf_recover_controller structure for callbacks, then introduce pgmap->ops->memory_failure, then ->block_lost, then the pmem and md implementations of ->block list, then ->storage_lost and the XFS implementations of ->storage_lost. - I think the names "block_lost" and "storage_lost" are misleading. It's more like a "media failure" or a general "data corruption" event at a specific physical location. The data may not be "lost" but only damaged, so we might be able to recover from it without "losing" anything. Hence I think they could be better named, perhaps just "->corrupt_range" - need to pass a {offset,len} pair through the chain, not just a single offset. This will allow other types of devices to report different ranges of failures, from a single sector to an entire device. - I'm not sure that passing the mf_recover_controller structure through the corruption event chain is the right thing to do here. A block device could generate this storage failure callback if it detects an unrecoverable error (e.g. during a MD media scrub or rebuild/resilver failure) and in that case we don't have PFNs or memory device failure functions to perform. IOWs, I think the action that is taken needs to be independent of the source that generated the error. Even for a pmem device, we can be using the page cache, so it may be possible to recover the pmem error by writing the cached page (if it exists) back over the pmem. Hence I think that the recover function probably needs to be moved to the address space ops, because what we do to recover from the error is going to be dependent on type of mapping the filesystem is using. If it's a DAX mapping, we call back into a generic DAX function that does the vma walk and process kill functions. If it is a page cache mapping, then if the page is cached then we can try to re-write it to disk to fix the bad data, otherwise we treat it like a writeback error and report it on the next write/fsync/close operation done on that file. This gets rid of the mf_recover_controller altogether and allows the interface to be used by any sort of block device for any sort of bottom-up reporting of media/device failures. Cheers, Dave.
Hi Dave, On 2020/11/30 上午6:47, Dave Chinner wrote: > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote: >> >> The call trace is like this: >> memory_failure() >> pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() >> gendisk->fops->block_lost() => pmem_block_lost() or >> md_blk_block_lost() >> sb->s_ops->storage_lost() => xfs_fs_storage_lost() >> xfs_rmap_query_range() >> xfs_storage_lost_helper() >> mf_recover_controller->recover_fn => \ >> memory_failure_dev_pagemap_kill_procs() >> >> The collect_procs() and kill_procs() are moved into a callback which >> is passed from memory_failure() to xfs_storage_lost_helper(). So we >> can call it when a file assocaited is found, instead of creating a >> file list and iterate it. >> >> The fsdax & reflink support for XFS is not contained in this patchset. > > This looks promising - the overall architecture is a lot more > generic and less dependent on knowing about memory, dax or memory > failures. A few comments that I think would further improve > understanding the patchset and the implementation: Thanks for your kindly comment. It gives me confidence. > > - the order of the patches is inverted. It should start with a > single patch introducing the mf_recover_controller structure for > callbacks, then introduce pgmap->ops->memory_failure, then > ->block_lost, then the pmem and md implementations of ->block > list, then ->storage_lost and the XFS implementations of > ->storage_lost. Yes, it will be easier to understand the patchset in this order. But I have something unsure: for example, I introduce ->memory_failure() firstly, but the implementation of ->memory_failure() needs to call ->block_lost() which is supposed to be introduced in the next patch. So, I am not sure the code is supposed to be what in the implementation of ->memory_failure() in pmem? To avoid this situation, I committed the patches in the inverted order: lowest level first, then its caller, and then caller's caller. I am trying to sort out the order. How about this: Patch i. Introduce ->memory_failure() - just introduce interface, without implementation Patch i++. Introduce ->block_lost() - introduce interface and implement ->memory_failure() in pmem, so that it can call ->block_lost() Patch i++. (similar with above, skip...) > > - I think the names "block_lost" and "storage_lost" are misleading. > It's more like a "media failure" or a general "data corruption" > event at a specific physical location. The data may not be "lost" > but only damaged, so we might be able to recover from it without > "losing" anything. Hence I think they could be better named, > perhaps just "->corrupt_range" 'corrupt' sounds better. (I'm not good at naming functions...) > > - need to pass a {offset,len} pair through the chain, not just a > single offset. This will allow other types of devices to report > different ranges of failures, from a single sector to an entire > device. Yes, it's better to add the length. I restrictively thought that memory-failure on pmem should affect one single page at one time. > > - I'm not sure that passing the mf_recover_controller structure > through the corruption event chain is the right thing to do here. > A block device could generate this storage failure callback if it > detects an unrecoverable error (e.g. during a MD media scrub or > rebuild/resilver failure) and in that case we don't have PFNs or > memory device failure functions to perform. > > IOWs, I think the action that is taken needs to be independent of > the source that generated the error. Even for a pmem device, we > can be using the page cache, so it may be possible to recover the > pmem error by writing the cached page (if it exists) back over the > pmem. > > Hence I think that the recover function probably needs to be moved > to the address space ops, because what we do to recover from the > error is going to be dependent on type of mapping the filesystem > is using. If it's a DAX mapping, we call back into a generic DAX > function that does the vma walk and process kill functions. If it > is a page cache mapping, then if the page is cached then we can > try to re-write it to disk to fix the bad data, otherwise we treat > it like a writeback error and report it on the next > write/fsync/close operation done on that file. > > This gets rid of the mf_recover_controller altogether and allows > the interface to be used by any sort of block device for any sort > of bottom-up reporting of media/device failures. Moving the recover function to the address_space ops looks a better idea. But I think that the error handler for page cache mapping is finished well in memory-failure. The memory-failure is also reused to handles anonymous page. If we move the recover function to address_space ops, I think we also need to refactor the existing handler for page cache mapping, which may affect anonymous page handling. This makes me confused... I rewrote the call trace: memory_failure() * dax mapping case pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() gendisk->fops->block_corrupt_range() => - pmem_block_corrupt_range() - md_blk_block_corrupt_range() sb->s_ops->storage_currupt_range() => xfs_fs_storage_corrupt_range() xfs_rmap_query_range() xfs_storage_lost_helper() mapping->a_ops->corrupt_range() => xfs_dax_aops.xfs_dax_corrupt_range memory_failure_dev_pagemap_kill_procs() * page cache mapping case mapping->a_ops->corrupt_range() => xfs_address_space_operations.xfs_xxx memory_failure_generic_kill_procs() It's rough and not completed yet. Hope for your comment.
On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote: > Hi Dave, > > On 2020/11/30 上午6:47, Dave Chinner wrote: > > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote: > > > > > > The call trace is like this: > > > memory_failure() > > > pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() > > > gendisk->fops->block_lost() => pmem_block_lost() or > > > md_blk_block_lost() > > > sb->s_ops->storage_lost() => xfs_fs_storage_lost() > > > xfs_rmap_query_range() > > > xfs_storage_lost_helper() > > > mf_recover_controller->recover_fn => \ > > > memory_failure_dev_pagemap_kill_procs() > > > > > > The collect_procs() and kill_procs() are moved into a callback which > > > is passed from memory_failure() to xfs_storage_lost_helper(). So we > > > can call it when a file assocaited is found, instead of creating a > > > file list and iterate it. > > > > > > The fsdax & reflink support for XFS is not contained in this patchset. > > > > This looks promising - the overall architecture is a lot more > > generic and less dependent on knowing about memory, dax or memory > > failures. A few comments that I think would further improve > > understanding the patchset and the implementation: > > Thanks for your kindly comment. It gives me confidence. > > > > > - the order of the patches is inverted. It should start with a > > single patch introducing the mf_recover_controller structure for > > callbacks, then introduce pgmap->ops->memory_failure, then > > ->block_lost, then the pmem and md implementations of ->block > > list, then ->storage_lost and the XFS implementations of > > ->storage_lost. > > Yes, it will be easier to understand the patchset in this order. > > But I have something unsure: for example, I introduce ->memory_failure() > firstly, but the implementation of ->memory_failure() needs to call > ->block_lost() which is supposed to be introduced in the next patch. So, I > am not sure the code is supposed to be what in the implementation of > ->memory_failure() in pmem? To avoid this situation, I committed the > patches in the inverted order: lowest level first, then its caller, and then > caller's caller. Well, there's two things here. The first is the infrastructure, the second is the drivers that use the infrastructure. You can introduce a method in one patch, and then the driver that uses it in another. Or you can introduce a driver skeleton that doesn't nothing until more infrastructure is added. so... > > I am trying to sort out the order. How about this: > Patch i. > Introduce ->memory_failure() > - just introduce interface, without implementation > Patch i++. > Introduce ->block_lost() > - introduce interface and implement ->memory_failure() > in pmem, so that it can call ->block_lost() > Patch i++. > (similar with above, skip...) So this is better, but you don't need to add the pmem driver use of "->block_lost" in the patch that adds the method. IOWs, something like: P1: introduce ->memory_failure API, all the required documentation and add the call sites in the infrastructure that trigger it P2: introduce ->corrupted_range to the block device API, all the required documentation and any generic block infrastructure that needs to call it. P3: introduce ->corrupted_range to the superblock ops API, all the required documentation P4: add ->corrupted_range() API to the address space ops, all the required documentation P5: factor the existing kill procs stuff to be able to be called on via generic_mapping_kill_range() P5: add dax_mapping_kill_range() P6: add the pmem driver support for ->memory_failure P7: add the block device driver support for ->corrupted_range P8: add filesystem support for sb_ops->corrupted_range. P9: add filesystem support for aops->corrupted_range. > > This gets rid of the mf_recover_controller altogether and allows > > the interface to be used by any sort of block device for any sort > > of bottom-up reporting of media/device failures. > > Moving the recover function to the address_space ops looks a better idea. > But I think that the error handler for page cache mapping is finished well > in memory-failure. The memory-failure is also reused to handles anonymous > page. Yes, anonymous page handling can remain there, we're only concerned about handling file mapped pages here right now. If we end up sharing page cache pages across reflink mappings, we'll have exactly the same issue we have now with DAX.... > If we move the recover function to address_space ops, I think we also > need to refactor the existing handler for page cache mapping, which may > affect anonymous page handling. This makes me confused... Make the handling of the page the error occurred in conditional on !PageAnon(). > I rewrote the call trace: > memory_failure() > * dax mapping case > pgmap->ops->memory_failure() => > pmem_pgmap_memory_failure() > gendisk->fops->block_corrupt_range() => > - pmem_block_corrupt_range() > - md_blk_block_corrupt_range() > sb->s_ops->storage_currupt_range() => > xfs_fs_storage_corrupt_range() No need for block/storage prefixes in these... > xfs_rmap_query_range() > xfs_storage_lost_helper() > mapping->a_ops->corrupt_range() => > xfs_dax_aops.xfs_dax_corrupt_range > memory_failure_dev_pagemap_kill_procs() This assumes we find a user data mapping. We might find the corrupted storage contained metadata, in which case we'll be shutting down the filesystem, not trying to kill user procs... Also, we don't need aops->corrupt_range() here as we are already in the filesystem code and if we find a mapping in memory we can just do "if (IS_DAX(mapping->host))" to call the right kill procs implementation for the mapping we've found. aops->corrupt_range is for generic code working on a mapping to inform the filesystem that there is a bad range in the mapping (my apologies for getting that all mixed up in my last email). > * page cache mapping case > mapping->a_ops->corrupt_range() => > xfs_address_space_operations.xfs_xxx > memory_failure_generic_kill_procs() We need the aops->corrupted_range() to call into the filesystem so it can do a similar reverse mapping lookup to sb->s_ops->corrupted_range. Yes, the page cache should already have a mapping attached to the page, but we do not know whether it is the only mapping that exists for that page. e.g. if/when we implement multiple-mapped shared read-only reflink pages in the page cache which results in the same problem we have with DAX pages right now. Overall, though, it seems like you're on the right path. :) Cheers, Dave.
Hi, Shiyang, On 11/22/2020 4:41 PM, Shiyang Ruan wrote: > This patchset is a try to resolve the problem of tracking shared page > for fsdax. > > Change from v1: > - Intorduce ->block_lost() for block device > - Support mapped device > - Add 'not available' warning for realtime device in XFS > - Rebased to v5.10-rc1 > > This patchset moves owner tracking from dax_assocaite_entry() to pmem > device, by introducing an interface ->memory_failure() of struct > pagemap. The interface is called by memory_failure() in mm, and > implemented by pmem device. Then pmem device calls its ->block_lost() > to find the filesystem which the damaged page located in, and call > ->storage_lost() to track files or metadata assocaited with this page. > Finally we are able to try to fix the damaged data in filesystem and do Does that mean clearing poison? if so, would you mind to elaborate specifically which change does that? Thanks! -jane > other necessary processing, such as killing processes who are using the > files affected.
Hi Jane On 2020/12/15 上午4:58, Jane Chu wrote: > Hi, Shiyang, > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote: >> This patchset is a try to resolve the problem of tracking shared page >> for fsdax. >> >> Change from v1: >> - Intorduce ->block_lost() for block device >> - Support mapped device >> - Add 'not available' warning for realtime device in XFS >> - Rebased to v5.10-rc1 >> >> This patchset moves owner tracking from dax_assocaite_entry() to pmem >> device, by introducing an interface ->memory_failure() of struct >> pagemap. The interface is called by memory_failure() in mm, and >> implemented by pmem device. Then pmem device calls its ->block_lost() >> to find the filesystem which the damaged page located in, and call >> ->storage_lost() to track files or metadata assocaited with this page. >> Finally we are able to try to fix the damaged data in filesystem and do > > Does that mean clearing poison? if so, would you mind to elaborate > specifically which change does that? Recovering data for filesystem (or pmem device) has not been done in this patchset... I just triggered the handler for the files sharing the corrupted page here. -- Thanks, Ruan Shiyang. > > Thanks! > -jane > >> other necessary processing, such as killing processes who are using the >> files affected. > >
On 12/15/2020 3:58 AM, Ruan Shiyang wrote: > Hi Jane > > On 2020/12/15 上午4:58, Jane Chu wrote: >> Hi, Shiyang, >> >> On 11/22/2020 4:41 PM, Shiyang Ruan wrote: >>> This patchset is a try to resolve the problem of tracking shared page >>> for fsdax. >>> >>> Change from v1: >>> - Intorduce ->block_lost() for block device >>> - Support mapped device >>> - Add 'not available' warning for realtime device in XFS >>> - Rebased to v5.10-rc1 >>> >>> This patchset moves owner tracking from dax_assocaite_entry() to pmem >>> device, by introducing an interface ->memory_failure() of struct >>> pagemap. The interface is called by memory_failure() in mm, and >>> implemented by pmem device. Then pmem device calls its ->block_lost() >>> to find the filesystem which the damaged page located in, and call >>> ->storage_lost() to track files or metadata assocaited with this page. >>> Finally we are able to try to fix the damaged data in filesystem and do >> >> Does that mean clearing poison? if so, would you mind to elaborate >> specifically which change does that? > > Recovering data for filesystem (or pmem device) has not been done in > this patchset... I just triggered the handler for the files sharing the > corrupted page here. Thanks! That confirms my understanding. With the framework provided by the patchset, how do you envision it to ease/simplify poison recovery from the user's perspective? And how does it help in dealing with page faults upon poisoned dax page? thanks! -jane
On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote: > On 12/15/2020 3:58 AM, Ruan Shiyang wrote: > > Hi Jane > > > > On 2020/12/15 上午4:58, Jane Chu wrote: > > > Hi, Shiyang, > > > > > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote: > > > > This patchset is a try to resolve the problem of tracking shared page > > > > for fsdax. > > > > > > > > Change from v1: > > > > - Intorduce ->block_lost() for block device > > > > - Support mapped device > > > > - Add 'not available' warning for realtime device in XFS > > > > - Rebased to v5.10-rc1 > > > > > > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem > > > > device, by introducing an interface ->memory_failure() of struct > > > > pagemap. The interface is called by memory_failure() in mm, and > > > > implemented by pmem device. Then pmem device calls its ->block_lost() > > > > to find the filesystem which the damaged page located in, and call > > > > ->storage_lost() to track files or metadata assocaited with this page. > > > > Finally we are able to try to fix the damaged data in filesystem and do > > > > > > Does that mean clearing poison? if so, would you mind to elaborate > > > specifically which change does that? > > > > Recovering data for filesystem (or pmem device) has not been done in > > this patchset... I just triggered the handler for the files sharing the > > corrupted page here. > > Thanks! That confirms my understanding. > > With the framework provided by the patchset, how do you envision it to > ease/simplify poison recovery from the user's perspective? At the moment, I'd say no change what-so-ever. THe behaviour is necessary so that we can kill whatever user application maps multiply-shared physical blocks if there's a memory error. THe recovery method from that is unchanged. The only advantage may be that the filesystem (if rmap enabled) can tell you the exact file and offset into the file where data was corrupted. However, it can be worse, too: it may also now completely shut down the filesystem if the filesystem discovers the error is in metadata rather than user data. That's much more complex to recover from, and right now will require downtime to take the filesystem offline and run fsck to correct the error. That may trash whatever the metadata that can't be recovered points to, so you still have a uesr data recovery process to perform after this... > And how does it help in dealing with page faults upon poisoned > dax page? It doesn't. If the page is poisoned, the same behaviour will occur as does now. This is simply error reporting infrastructure, not error handling. Future work might change how we correct the faults found in the storage, but I think the user visible behaviour is going to be "kill apps mapping corrupted data" for a long time yet.... Cheers, Dave.
On Wed, Dec 16, 2020 at 10:10:22AM +1100, Dave Chinner wrote: > On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote: > > On 12/15/2020 3:58 AM, Ruan Shiyang wrote: > > > Hi Jane > > > > > > On 2020/12/15 上午4:58, Jane Chu wrote: > > > > Hi, Shiyang, > > > > > > > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote: > > > > > This patchset is a try to resolve the problem of tracking shared page > > > > > for fsdax. > > > > > > > > > > Change from v1: > > > > > - Intorduce ->block_lost() for block device > > > > > - Support mapped device > > > > > - Add 'not available' warning for realtime device in XFS > > > > > - Rebased to v5.10-rc1 > > > > > > > > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem > > > > > device, by introducing an interface ->memory_failure() of struct > > > > > pagemap. The interface is called by memory_failure() in mm, and > > > > > implemented by pmem device. Then pmem device calls its ->block_lost() > > > > > to find the filesystem which the damaged page located in, and call > > > > > ->storage_lost() to track files or metadata assocaited with this page. > > > > > Finally we are able to try to fix the damaged data in filesystem and do > > > > > > > > Does that mean clearing poison? if so, would you mind to elaborate > > > > specifically which change does that? > > > > > > Recovering data for filesystem (or pmem device) has not been done in > > > this patchset... I just triggered the handler for the files sharing the > > > corrupted page here. > > > > Thanks! That confirms my understanding. > > > > With the framework provided by the patchset, how do you envision it to > > ease/simplify poison recovery from the user's perspective? > > At the moment, I'd say no change what-so-ever. THe behaviour is > necessary so that we can kill whatever user application maps > multiply-shared physical blocks if there's a memory error. THe > recovery method from that is unchanged. The only advantage may be > that the filesystem (if rmap enabled) can tell you the exact file > and offset into the file where data was corrupted. > > However, it can be worse, too: it may also now completely shut down > the filesystem if the filesystem discovers the error is in metadata > rather than user data. That's much more complex to recover from, and > right now will require downtime to take the filesystem offline and > run fsck to correct the error. That may trash whatever the metadata > that can't be recovered points to, so you still have a uesr data > recovery process to perform after this... ...though for the future future I'd like to bypass the default behaviors if there's somebody watching the sb notification that will also kick off the appropriate repair activities. The xfs auto-repair parts are coming along nicely. Dunno about userspace, though I figure if we can do userspace page faults then some people could probably do autorepair too. --D > > And how does it help in dealing with page faults upon poisoned > > dax page? > > It doesn't. If the page is poisoned, the same behaviour will occur > as does now. This is simply error reporting infrastructure, not > error handling. > > Future work might change how we correct the faults found in the > storage, but I think the user visible behaviour is going to be "kill > apps mapping corrupted data" for a long time yet.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com