Message ID | 20220227120747.711169-2-ruansy.fnst@fujitsu.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | fsdax: introduce fs query to support reflink | expand |
On Sun, Feb 27, 2022 at 4:08 AM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote: > > To easily track filesystem from a pmem device, we introduce a holder for > dax_device structure, and also its operation. This holder is used to > remember who is using this dax_device: > - When it is the backend of a filesystem, the holder will be the > instance of this filesystem. > - When this pmem device is one of the targets in a mapped device, the > holder will be this mapped device. In this case, the mapped device > has its own dax_device and it will follow the first rule. So that we > can finally track to the filesystem we needed. > > The holder and holder_ops will be set when filesystem is being mounted, > or an target device is being activated. > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> > --- > drivers/dax/super.c | 89 +++++++++++++++++++++++++++++++++++++++++++++ > include/linux/dax.h | 32 ++++++++++++++++ > 2 files changed, 121 insertions(+) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index e3029389d809..da5798e19d57 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -21,6 +21,9 @@ > * @cdev: optional character interface for "device dax" > * @private: dax driver private data > * @flags: state and boolean properties > + * @ops: operations for dax_device > + * @holder_data: holder of a dax_device: could be filesystem or mapped device > + * @holder_ops: operations for the inner holder > */ > struct dax_device { > struct inode inode; > @@ -28,6 +31,8 @@ struct dax_device { > void *private; > unsigned long flags; > const struct dax_operations *ops; > + void *holder_data; > + const struct dax_holder_operations *holder_ops; > }; > > static dev_t dax_devt; > @@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > } > EXPORT_SYMBOL_GPL(dax_zero_page_range); > > +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, > + u64 len, int mf_flags) > +{ > + int rc, id; > + > + id = dax_read_lock(); > + if (!dax_alive(dax_dev)) { > + rc = -ENXIO; > + goto out; > + } > + > + if (!dax_dev->holder_ops) { > + rc = -EOPNOTSUPP; I think it is ok to return success (0) for this case. All the caller of dax_holder_notify_failure() wants to know is if the notification was successfully delivered to the holder. If there is no holder present then there is nothing to report. This is minor enough for me to fix up locally if nothing else needs to be changed. > + goto out; > + } > + > + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags); > +out: > + dax_read_unlock(id); > + return rc; > +} > +EXPORT_SYMBOL_GPL(dax_holder_notify_failure); > + > #ifdef CONFIG_ARCH_HAS_PMEM_API > void arch_wb_cache_pmem(void *addr, size_t size); > void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) > @@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev) > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags); > synchronize_srcu(&dax_srcu); > + > + /* clear holder data */ > + dax_dev->holder_ops = NULL; > + dax_dev->holder_data = NULL; Isn't this another failure scenario? If kill_dax() is called while a holder is still holding the dax_device that seems to be another ->notify_failure scenario to tell the holder that the device is going away and the holder has not released the device yet. > } > EXPORT_SYMBOL_GPL(kill_dax); > > @@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev) > } > EXPORT_SYMBOL_GPL(put_dax); > > +/** > + * dax_holder() - obtain the holder of a dax device > + * @dax_dev: a dax_device instance > + > + * Return: the holder's data which represents the holder if registered, > + * otherwize NULL. > + */ > +void *dax_holder(struct dax_device *dax_dev) > +{ > + if (!dax_alive(dax_dev)) > + return NULL; It's safe for the holder to assume that it can de-reference ->holder_data freely in its notify_handler callback because dax_holder_notify_failure() arranges for the callback to run in dax_read_lock() context. This is another minor detail that I can fixup locally. > + > + return dax_dev->holder_data; > +} > +EXPORT_SYMBOL_GPL(dax_holder); > + > +/** > + * dax_register_holder() - register a holder to a dax device > + * @dax_dev: a dax_device instance > + * @holder: a pointer to a holder's data which represents the holder > + * @ops: operations of this holder > + > + * Return: negative errno if an error occurs, otherwise 0. > + */ > +int dax_register_holder(struct dax_device *dax_dev, void *holder, > + const struct dax_holder_operations *ops) > +{ > + if (!dax_alive(dax_dev)) > + return -ENXIO; > + > + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) > + return -EBUSY; > + > + dax_dev->holder_ops = ops; > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_register_holder); > + > +/** > + * dax_unregister_holder() - unregister the holder for a dax device > + * @dax_dev: a dax_device instance > + * @holder: the holder to be unregistered > + * > + * Return: negative errno if an error occurs, otherwise 0. > + */ > +int dax_unregister_holder(struct dax_device *dax_dev, void *holder) > +{ > + if (!dax_alive(dax_dev)) > + return -ENXIO; > + > + if (cmpxchg(&dax_dev->holder_data, holder, NULL) != holder) > + return -EBUSY; > + dax_dev->holder_ops = NULL; > + return 0; > +} > +EXPORT_SYMBOL_GPL(dax_unregister_holder); > + > /** > * inode_dax: convert a public inode into its dax_dev > * @inode: An inode with i_cdev pointing to a dax_dev > diff --git a/include/linux/dax.h b/include/linux/dax.h > index 9fc5f99a0ae2..262d7bad131a 100644 > --- a/include/linux/dax.h > +++ b/include/linux/dax.h > @@ -32,8 +32,24 @@ struct dax_operations { > int (*zero_page_range)(struct dax_device *, pgoff_t, size_t); > }; > > +struct dax_holder_operations { > + /* > + * notify_failure - notify memory failure into inner holder device > + * @dax_dev: the dax device which contains the holder > + * @offset: offset on this dax device where memory failure occurs > + * @len: length of this memory failure event Forgive me if this has been discussed before, but since dax_operations are in terms of pgoff and nr pages and memory_failure() is in terms of pfns what was the rationale for making the function signature byte based? I want to get this series merged into linux-next shortly after v5.18-rc1. Then we can start working on incremental fixups rather resending the full series with these long reply cycles.
在 2022/3/12 7:35, Dan Williams 写道: > On Sun, Feb 27, 2022 at 4:08 AM Shiyang Ruan <ruansy.fnst@fujitsu.com> wrote: >> >> To easily track filesystem from a pmem device, we introduce a holder for >> dax_device structure, and also its operation. This holder is used to >> remember who is using this dax_device: >> - When it is the backend of a filesystem, the holder will be the >> instance of this filesystem. >> - When this pmem device is one of the targets in a mapped device, the >> holder will be this mapped device. In this case, the mapped device >> has its own dax_device and it will follow the first rule. So that we >> can finally track to the filesystem we needed. >> >> The holder and holder_ops will be set when filesystem is being mounted, >> or an target device is being activated. >> >> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> >> --- >> drivers/dax/super.c | 89 +++++++++++++++++++++++++++++++++++++++++++++ >> include/linux/dax.h | 32 ++++++++++++++++ >> 2 files changed, 121 insertions(+) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index e3029389d809..da5798e19d57 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -21,6 +21,9 @@ >> * @cdev: optional character interface for "device dax" >> * @private: dax driver private data >> * @flags: state and boolean properties >> + * @ops: operations for dax_device >> + * @holder_data: holder of a dax_device: could be filesystem or mapped device >> + * @holder_ops: operations for the inner holder >> */ >> struct dax_device { >> struct inode inode; >> @@ -28,6 +31,8 @@ struct dax_device { >> void *private; >> unsigned long flags; >> const struct dax_operations *ops; >> + void *holder_data; >> + const struct dax_holder_operations *holder_ops; >> }; >> >> static dev_t dax_devt; >> @@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, >> } >> EXPORT_SYMBOL_GPL(dax_zero_page_range); >> >> +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, >> + u64 len, int mf_flags) >> +{ >> + int rc, id; >> + >> + id = dax_read_lock(); >> + if (!dax_alive(dax_dev)) { >> + rc = -ENXIO; >> + goto out; >> + } >> + >> + if (!dax_dev->holder_ops) { >> + rc = -EOPNOTSUPP; > > I think it is ok to return success (0) for this case. All the caller > of dax_holder_notify_failure() wants to know is if the notification > was successfully delivered to the holder. If there is no holder > present then there is nothing to report. This is minor enough for me > to fix up locally if nothing else needs to be changed. I thought it could fall back to generic memory failure handler: mf_generic_kill_procs(), if holder_ops not exists. > >> + goto out; >> + } >> + >> + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags); >> +out: >> + dax_read_unlock(id); >> + return rc; >> +} >> +EXPORT_SYMBOL_GPL(dax_holder_notify_failure); >> + >> #ifdef CONFIG_ARCH_HAS_PMEM_API >> void arch_wb_cache_pmem(void *addr, size_t size); >> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) >> @@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev) >> >> clear_bit(DAXDEV_ALIVE, &dax_dev->flags); >> synchronize_srcu(&dax_srcu); >> + >> + /* clear holder data */ >> + dax_dev->holder_ops = NULL; >> + dax_dev->holder_data = NULL; > > Isn't this another failure scenario? If kill_dax() is called while a > holder is still holding the dax_device that seems to be another > ->notify_failure scenario to tell the holder that the device is going > away and the holder has not released the device yet. Yes. I should call dax_holder_notify_failure() and then unregister the holder. > >> } >> EXPORT_SYMBOL_GPL(kill_dax); >> >> @@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev) >> } >> EXPORT_SYMBOL_GPL(put_dax); >> >> +/** >> + * dax_holder() - obtain the holder of a dax device >> + * @dax_dev: a dax_device instance >> + >> + * Return: the holder's data which represents the holder if registered, >> + * otherwize NULL. >> + */ >> +void *dax_holder(struct dax_device *dax_dev) >> +{ >> + if (!dax_alive(dax_dev)) >> + return NULL; > > It's safe for the holder to assume that it can de-reference > ->holder_data freely in its notify_handler callback because > dax_holder_notify_failure() arranges for the callback to run in > dax_read_lock() context. > > This is another minor detail that I can fixup locally. > >> + >> + return dax_dev->holder_data; >> +} >> +EXPORT_SYMBOL_GPL(dax_holder); >> + >> +/** >> + * dax_register_holder() - register a holder to a dax device >> + * @dax_dev: a dax_device instance >> + * @holder: a pointer to a holder's data which represents the holder >> + * @ops: operations of this holder >> + >> + * Return: negative errno if an error occurs, otherwise 0. >> + */ >> +int dax_register_holder(struct dax_device *dax_dev, void *holder, >> + const struct dax_holder_operations *ops) >> +{ >> + if (!dax_alive(dax_dev)) >> + return -ENXIO; >> + >> + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) >> + return -EBUSY; >> + >> + dax_dev->holder_ops = ops; >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(dax_register_holder); >> + >> +/** >> + * dax_unregister_holder() - unregister the holder for a dax device >> + * @dax_dev: a dax_device instance >> + * @holder: the holder to be unregistered >> + * >> + * Return: negative errno if an error occurs, otherwise 0. >> + */ >> +int dax_unregister_holder(struct dax_device *dax_dev, void *holder) >> +{ >> + if (!dax_alive(dax_dev)) >> + return -ENXIO; >> + >> + if (cmpxchg(&dax_dev->holder_data, holder, NULL) != holder) >> + return -EBUSY; >> + dax_dev->holder_ops = NULL; >> + return 0; >> +} >> +EXPORT_SYMBOL_GPL(dax_unregister_holder); >> + >> /** >> * inode_dax: convert a public inode into its dax_dev >> * @inode: An inode with i_cdev pointing to a dax_dev >> diff --git a/include/linux/dax.h b/include/linux/dax.h >> index 9fc5f99a0ae2..262d7bad131a 100644 >> --- a/include/linux/dax.h >> +++ b/include/linux/dax.h >> @@ -32,8 +32,24 @@ struct dax_operations { >> int (*zero_page_range)(struct dax_device *, pgoff_t, size_t); >> }; >> >> +struct dax_holder_operations { >> + /* >> + * notify_failure - notify memory failure into inner holder device >> + * @dax_dev: the dax device which contains the holder >> + * @offset: offset on this dax device where memory failure occurs >> + * @len: length of this memory failure event > > Forgive me if this has been discussed before, but since dax_operations > are in terms of pgoff and nr pages and memory_failure() is in terms of > pfns what was the rationale for making the function signature byte > based? Maybe I didn't describe it clearly... The @offset and @len here are byte-based. And so is ->memory_failure(). You can find the implementation of ->memory_failure() in 3rd patch: +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap, + phys_addr_t addr, u64 len, int mf_flags) +{ + struct pmem_device *pmem = + container_of(pgmap, struct pmem_device, pgmap); + u64 offset = addr - pmem->phys_addr - pmem->data_offset; + + return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags); +} > > I want to get this series merged into linux-next shortly after > v5.18-rc1. Then we can start working on incremental fixups rather > resending the full series with these long reply cycles. Thanks. That really helps. -- Ruan.
On Fri, Mar 11, 2022 at 03:35:13PM -0800, Dan Williams wrote: > > + if (!dax_dev->holder_ops) { > > + rc = -EOPNOTSUPP; > > I think it is ok to return success (0) for this case. All the caller > of dax_holder_notify_failure() wants to know is if the notification > was successfully delivered to the holder. If there is no holder > present then there is nothing to report. This is minor enough for me > to fix up locally if nothing else needs to be changed. The caller needs to know there are no holder ops to fall back to different path. > Isn't this another failure scenario? If kill_dax() is called while a > holder is still holding the dax_device that seems to be another > ->notify_failure scenario to tell the holder that the device is going > away and the holder has not released the device yet. Yes.
On Wed, Mar 16, 2022 at 09:46:07PM +0800, Shiyang Ruan wrote: > > Forgive me if this has been discussed before, but since dax_operations > > are in terms of pgoff and nr pages and memory_failure() is in terms of > > pfns what was the rationale for making the function signature byte > > based? > > Maybe I didn't describe it clearly... The @offset and @len here are > byte-based. And so is ->memory_failure(). Yes, but is there a good reason for that when the rest of the DAX code tends to work in page chunks?
在 2022/3/30 13:41, Christoph Hellwig 写道: > On Wed, Mar 16, 2022 at 09:46:07PM +0800, Shiyang Ruan wrote: >>> Forgive me if this has been discussed before, but since dax_operations >>> are in terms of pgoff and nr pages and memory_failure() is in terms of >>> pfns what was the rationale for making the function signature byte >>> based? >> >> Maybe I didn't describe it clearly... The @offset and @len here are >> byte-based. And so is ->memory_failure(). > > Yes, but is there a good reason for that when the rest of the DAX code > tends to work in page chunks? Because I am not sure if the offset between each layer is page aligned. For example, when pmem dirver handles ->memory_failure(), it should subtract its ->data_offset when it calls dax_holder_notify_failure(). The implementation of ->memory_failure() by pmem driver: +static int pmem_pagemap_memory_failure(struct dev_pagemap *pgmap, + phys_addr_t addr, u64 len, int mf_flags) +{ + struct pmem_device *pmem = + container_of(pgmap, struct pmem_device, pgmap); + u64 offset = addr - pmem->phys_addr - pmem->data_offset; + + return dax_holder_notify_failure(pmem->dax_dev, offset, len, mf_flags); +} So, I choose u64 as the type of @len. And for consistency, the @addr is using byte-based type as well. > memory_failure() > |* fsdax case > |------------ > |pgmap->ops->memory_failure() => pmem_pgmap_memory_failure() > | dax_holder_notify_failure() => the offset from 'pmem driver' to 'dax holder' > | dax_device->holder_ops->notify_failure() => > | - xfs_dax_notify_failure() > | |* xfs_dax_notify_failure() > | |-------------------------- > | | xfs_rmap_query_range() > | | xfs_dax_failure_fn() > | | * corrupted on metadata > | | try to recover data, call xfs_force_shutdown() > | | * corrupted on file data > | | try to recover data, call mf_dax_kill_procs() > |* normal case > |------------- > |mf_generic_kill_procs() -- Thanks, Ruan.
On Wed, Mar 30, 2022 at 06:03:01PM +0800, Shiyang Ruan wrote: > > Because I am not sure if the offset between each layer is page aligned. For > example, when pmem dirver handles ->memory_failure(), it should subtract its > ->data_offset when it calls dax_holder_notify_failure(). If they aren't, none of the DAX machinery would work.
在 2022/3/30 18:13, Christoph Hellwig 写道: > On Wed, Mar 30, 2022 at 06:03:01PM +0800, Shiyang Ruan wrote: >> >> Because I am not sure if the offset between each layer is page aligned. For >> example, when pmem dirver handles ->memory_failure(), it should subtract its >> ->data_offset when it calls dax_holder_notify_failure(). > > If they aren't, none of the DAX machinery would work. OK. Got it. So, use page-based function signature for ->memory_failure(): int (*memory_failure)(struct dev_pagemap *pgmap, unsigned long pfn, unsigned long nr_pfns, int flags); As the code I pasted before, pmem driver will subtract its ->data_offset, which is byte-based. And the filesystem who implements ->notify_failure() will calculate the offset in unit of byte again. So, leave its function signature byte-based, to avoid repeated conversions. int (*notify_failure)(struct dax_device *dax_dev, u64 offset, u64 len, int mf_flags); What do you think? -- Thanks, Ruan.
On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > As the code I pasted before, pmem driver will subtract its ->data_offset, > which is byte-based. And the filesystem who implements ->notify_failure() > will calculate the offset in unit of byte again. > > So, leave its function signature byte-based, to avoid repeated conversions. I'm actually fine either way, so I'll wait for Dan to comment.
On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > > As the code I pasted before, pmem driver will subtract its ->data_offset, > > which is byte-based. And the filesystem who implements ->notify_failure() > > will calculate the offset in unit of byte again. > > > > So, leave its function signature byte-based, to avoid repeated conversions. > > I'm actually fine either way, so I'll wait for Dan to comment. FWIW I'd convinced myself that the reason for using byte units is to make it possible to reduce the pmem failure blast radius to subpage units... but then I've also been distracted for months. :/ --D
On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: >>> As the code I pasted before, pmem driver will subtract its ->data_offset, >>> which is byte-based. And the filesystem who implements ->notify_failure() >>> will calculate the offset in unit of byte again. >>> >>> So, leave its function signature byte-based, to avoid repeated conversions. >> >> I'm actually fine either way, so I'll wait for Dan to comment. > > FWIW I'd convinced myself that the reason for using byte units is to > make it possible to reduce the pmem failure blast radius to subpage > units... but then I've also been distracted for months. :/ > Yes, thanks Darrick! I recall that. Maybe just add a comment about why byte unit is used? thanks! -jane > --D
On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@oracle.com> wrote: > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > >>> As the code I pasted before, pmem driver will subtract its ->data_offset, > >>> which is byte-based. And the filesystem who implements ->notify_failure() > >>> will calculate the offset in unit of byte again. > >>> > >>> So, leave its function signature byte-based, to avoid repeated conversions. > >> > >> I'm actually fine either way, so I'll wait for Dan to comment. > > > > FWIW I'd convinced myself that the reason for using byte units is to > > make it possible to reduce the pmem failure blast radius to subpage > > units... but then I've also been distracted for months. :/ > > > > Yes, thanks Darrick! I recall that. > Maybe just add a comment about why byte unit is used? I think we start with page failure notification and then figure out how to get finer grained through the dax interface in follow-on changes. Otherwise, for finer grained error handling support, memory_failure() would also need to be converted to stop upcasting cache-line granularity to page granularity failures. The native MCE notification communicates a 'struct mce' that can be in terms of sub-page bytes, but the memory management implications are all page based. I assume the FS implications are all FS-block-size based?
On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote: > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@oracle.com> wrote: > > > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset, > > >>> which is byte-based. And the filesystem who implements ->notify_failure() > > >>> will calculate the offset in unit of byte again. > > >>> > > >>> So, leave its function signature byte-based, to avoid repeated conversions. > > >> > > >> I'm actually fine either way, so I'll wait for Dan to comment. > > > > > > FWIW I'd convinced myself that the reason for using byte units is to > > > make it possible to reduce the pmem failure blast radius to subpage > > > units... but then I've also been distracted for months. :/ > > > > > > > Yes, thanks Darrick! I recall that. > > Maybe just add a comment about why byte unit is used? > > I think we start with page failure notification and then figure out > how to get finer grained through the dax interface in follow-on > changes. Otherwise, for finer grained error handling support, > memory_failure() would also need to be converted to stop upcasting > cache-line granularity to page granularity failures. The native MCE > notification communicates a 'struct mce' that can be in terms of > sub-page bytes, but the memory management implications are all page > based. I assume the FS implications are all FS-block-size based? I wouldn't necessarily make that assumption -- for regular files, the user program is in a better position to figure out how to reset the file contents. For fs metadata, it really depends. In principle, if (say) we could get byte granularity poison info, we could look up the space usage within the block to decide if the poisoned part was actually free space, in which case we can correct the problem by (re)zeroing the affected bytes to clear the poison. Obviously, if the blast radius hits the internal space info or something that was storing useful data, then you'd have to rebuild the whole block (or the whole data structure), but that's not necessarily a given. --D
[ add Mauro and Tony for RAS discussion ] On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong <djwong@kernel.org> wrote: > > On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote: > > On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@oracle.com> wrote: > > > > > > On 3/30/2022 9:18 AM, Darrick J. Wong wrote: > > > > On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: > > > >> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: > > > >>> As the code I pasted before, pmem driver will subtract its ->data_offset, > > > >>> which is byte-based. And the filesystem who implements ->notify_failure() > > > >>> will calculate the offset in unit of byte again. > > > >>> > > > >>> So, leave its function signature byte-based, to avoid repeated conversions. > > > >> > > > >> I'm actually fine either way, so I'll wait for Dan to comment. > > > > > > > > FWIW I'd convinced myself that the reason for using byte units is to > > > > make it possible to reduce the pmem failure blast radius to subpage > > > > units... but then I've also been distracted for months. :/ > > > > > > > > > > Yes, thanks Darrick! I recall that. > > > Maybe just add a comment about why byte unit is used? > > > > I think we start with page failure notification and then figure out > > how to get finer grained through the dax interface in follow-on > > changes. Otherwise, for finer grained error handling support, > > memory_failure() would also need to be converted to stop upcasting > > cache-line granularity to page granularity failures. The native MCE > > notification communicates a 'struct mce' that can be in terms of > > sub-page bytes, but the memory management implications are all page > > based. I assume the FS implications are all FS-block-size based? > > I wouldn't necessarily make that assumption -- for regular files, the > user program is in a better position to figure out how to reset the file > contents. > > For fs metadata, it really depends. In principle, if (say) we could get > byte granularity poison info, we could look up the space usage within > the block to decide if the poisoned part was actually free space, in > which case we can correct the problem by (re)zeroing the affected bytes > to clear the poison. > > Obviously, if the blast radius hits the internal space info or something > that was storing useful data, then you'd have to rebuild the whole block > (or the whole data structure), but that's not necessarily a given. tl;dr: dax_holder_notify_failure() != fs->notify_failure() So I think I see some confusion between what DAX->notify_failure() needs, memory_failure() needs, the raw information provided by the hardware, and the failure granularity the filesystem can make use of. DAX and memory_failure() need to make immediate page granularity decisions. They both need to map out whole pages (in the direct map and userspace respectively) to prevent future poison consumption, at least until the poison is repaired. The event that leads to a page being failed can be triggered by a hardware error as small as an individual cacheline. While that is interesting to a filesystem it isn't information that memory_failure() and DAX can utilize. The reason DAX needs to have a callback into filesystem code is to map the page failure back to all the processes that might have that page mapped because reflink means that page->mapping is not sufficient to find all the affected 'struct address_space' instances. So it's more of an address-translation / "help me kill processes" service than a general failure notification service. Currently when raw hardware event happens there are mechanisms like arch-specific notifier chains, like powerpc::mce_register_notifier() and x86::mce_register_decode_chain(), or other platform firmware code like ghes_edac_report_mem_error() that uplevel the error to a coarse page granularity failure, while emitting the fine granularity error event to userspace. All of this to say that the interface to ask the fs to do the bottom half of memory_failure() (walking affected 'struct address_space' instances and killing processes (mf_dax_kill_procs())) is different than the general interface to tell the filesystem that memory has gone bad relative to a device. So if the only caller of fs->notify_failure() handler is this code: + if (pgmap->ops->memory_failure) { + rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE, + flags); ...then you'll never get fine-grained reports. So, I still think the DAX, pgmap and memory_failure() interface should be pfn based. The interface to the *filesystem* ->notify_failure() can still be byte-based, but the trigger for that byte based interface will likely need to be something driven by another agent. Perhaps like rasdaemon in userspace translating all the arch specific physical address events back into device-relative offsets and then calling a new ABI that is serviced by fs->notify_failure() on the backend.
在 2022/4/8 9:38, Dan Williams 写道: > [ add Mauro and Tony for RAS discussion ] > > On Wed, Apr 6, 2022 at 1:39 PM Darrick J. Wong <djwong@kernel.org> wrote: >> >> On Tue, Apr 05, 2022 at 06:22:48PM -0700, Dan Williams wrote: >>> On Tue, Apr 5, 2022 at 5:55 PM Jane Chu <jane.chu@oracle.com> wrote: >>>> >>>> On 3/30/2022 9:18 AM, Darrick J. Wong wrote: >>>>> On Wed, Mar 30, 2022 at 08:49:29AM -0700, Christoph Hellwig wrote: >>>>>> On Wed, Mar 30, 2022 at 06:58:21PM +0800, Shiyang Ruan wrote: >>>>>>> As the code I pasted before, pmem driver will subtract its ->data_offset, >>>>>>> which is byte-based. And the filesystem who implements ->notify_failure() >>>>>>> will calculate the offset in unit of byte again. >>>>>>> >>>>>>> So, leave its function signature byte-based, to avoid repeated conversions. >>>>>> >>>>>> I'm actually fine either way, so I'll wait for Dan to comment. >>>>> >>>>> FWIW I'd convinced myself that the reason for using byte units is to >>>>> make it possible to reduce the pmem failure blast radius to subpage >>>>> units... but then I've also been distracted for months. :/ >>>>> >>>> >>>> Yes, thanks Darrick! I recall that. >>>> Maybe just add a comment about why byte unit is used? >>> >>> I think we start with page failure notification and then figure out >>> how to get finer grained through the dax interface in follow-on >>> changes. Otherwise, for finer grained error handling support, >>> memory_failure() would also need to be converted to stop upcasting >>> cache-line granularity to page granularity failures. The native MCE >>> notification communicates a 'struct mce' that can be in terms of >>> sub-page bytes, but the memory management implications are all page >>> based. I assume the FS implications are all FS-block-size based? >> >> I wouldn't necessarily make that assumption -- for regular files, the >> user program is in a better position to figure out how to reset the file >> contents. >> >> For fs metadata, it really depends. In principle, if (say) we could get >> byte granularity poison info, we could look up the space usage within >> the block to decide if the poisoned part was actually free space, in >> which case we can correct the problem by (re)zeroing the affected bytes >> to clear the poison. >> >> Obviously, if the blast radius hits the internal space info or something >> that was storing useful data, then you'd have to rebuild the whole block >> (or the whole data structure), but that's not necessarily a given. > > tl;dr: dax_holder_notify_failure() != fs->notify_failure() > > So I think I see some confusion between what DAX->notify_failure() > needs, memory_failure() needs, the raw information provided by the > hardware, and the failure granularity the filesystem can make use of. > > DAX and memory_failure() need to make immediate page granularity > decisions. They both need to map out whole pages (in the direct map > and userspace respectively) to prevent future poison consumption, at > least until the poison is repaired. > > The event that leads to a page being failed can be triggered by a > hardware error as small as an individual cacheline. While that is > interesting to a filesystem it isn't information that memory_failure() > and DAX can utilize. > > The reason DAX needs to have a callback into filesystem code is to map > the page failure back to all the processes that might have that page > mapped because reflink means that page->mapping is not sufficient to > find all the affected 'struct address_space' instances. So it's more > of an address-translation / "help me kill processes" service than a > general failure notification service. > > Currently when raw hardware event happens there are mechanisms like > arch-specific notifier chains, like powerpc::mce_register_notifier() > and x86::mce_register_decode_chain(), or other platform firmware code > like ghes_edac_report_mem_error() that uplevel the error to a coarse > page granularity failure, while emitting the fine granularity error > event to userspace. > > All of this to say that the interface to ask the fs to do the bottom > half of memory_failure() (walking affected 'struct address_space' > instances and killing processes (mf_dax_kill_procs())) is different > than the general interface to tell the filesystem that memory has gone > bad relative to a device. So if the only caller of > fs->notify_failure() handler is this code: > > + if (pgmap->ops->memory_failure) { > + rc = pgmap->ops->memory_failure(pgmap, PFN_PHYS(pfn), PAGE_SIZE, > + flags); > > ...then you'll never get fine-grained reports. So, I still think the > DAX, pgmap and memory_failure() interface should be pfn based. The > interface to the *filesystem* ->notify_failure() can still be > byte-based, but the trigger for that byte based interface will likely > need to be something driven by another agent. Perhaps like rasdaemon > in userspace translating all the arch specific physical address events > back into device-relative offsets and then calling a new ABI that is > serviced by fs->notify_failure() on the backend. Understood. I'll do as your advise. Thanks! -- Ruan.
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index e3029389d809..da5798e19d57 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -21,6 +21,9 @@ * @cdev: optional character interface for "device dax" * @private: dax driver private data * @flags: state and boolean properties + * @ops: operations for dax_device + * @holder_data: holder of a dax_device: could be filesystem or mapped device + * @holder_ops: operations for the inner holder */ struct dax_device { struct inode inode; @@ -28,6 +31,8 @@ struct dax_device { void *private; unsigned long flags; const struct dax_operations *ops; + void *holder_data; + const struct dax_holder_operations *holder_ops; }; static dev_t dax_devt; @@ -193,6 +198,29 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, } EXPORT_SYMBOL_GPL(dax_zero_page_range); +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, + u64 len, int mf_flags) +{ + int rc, id; + + id = dax_read_lock(); + if (!dax_alive(dax_dev)) { + rc = -ENXIO; + goto out; + } + + if (!dax_dev->holder_ops) { + rc = -EOPNOTSUPP; + goto out; + } + + rc = dax_dev->holder_ops->notify_failure(dax_dev, off, len, mf_flags); +out: + dax_read_unlock(id); + return rc; +} +EXPORT_SYMBOL_GPL(dax_holder_notify_failure); + #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) @@ -268,6 +296,10 @@ void kill_dax(struct dax_device *dax_dev) clear_bit(DAXDEV_ALIVE, &dax_dev->flags); synchronize_srcu(&dax_srcu); + + /* clear holder data */ + dax_dev->holder_ops = NULL; + dax_dev->holder_data = NULL; } EXPORT_SYMBOL_GPL(kill_dax); @@ -409,6 +441,63 @@ void put_dax(struct dax_device *dax_dev) } EXPORT_SYMBOL_GPL(put_dax); +/** + * dax_holder() - obtain the holder of a dax device + * @dax_dev: a dax_device instance + + * Return: the holder's data which represents the holder if registered, + * otherwize NULL. + */ +void *dax_holder(struct dax_device *dax_dev) +{ + if (!dax_alive(dax_dev)) + return NULL; + + return dax_dev->holder_data; +} +EXPORT_SYMBOL_GPL(dax_holder); + +/** + * dax_register_holder() - register a holder to a dax device + * @dax_dev: a dax_device instance + * @holder: a pointer to a holder's data which represents the holder + * @ops: operations of this holder + + * Return: negative errno if an error occurs, otherwise 0. + */ +int dax_register_holder(struct dax_device *dax_dev, void *holder, + const struct dax_holder_operations *ops) +{ + if (!dax_alive(dax_dev)) + return -ENXIO; + + if (cmpxchg(&dax_dev->holder_data, NULL, holder)) + return -EBUSY; + + dax_dev->holder_ops = ops; + return 0; +} +EXPORT_SYMBOL_GPL(dax_register_holder); + +/** + * dax_unregister_holder() - unregister the holder for a dax device + * @dax_dev: a dax_device instance + * @holder: the holder to be unregistered + * + * Return: negative errno if an error occurs, otherwise 0. + */ +int dax_unregister_holder(struct dax_device *dax_dev, void *holder) +{ + if (!dax_alive(dax_dev)) + return -ENXIO; + + if (cmpxchg(&dax_dev->holder_data, holder, NULL) != holder) + return -EBUSY; + dax_dev->holder_ops = NULL; + return 0; +} +EXPORT_SYMBOL_GPL(dax_unregister_holder); + /** * inode_dax: convert a public inode into its dax_dev * @inode: An inode with i_cdev pointing to a dax_dev diff --git a/include/linux/dax.h b/include/linux/dax.h index 9fc5f99a0ae2..262d7bad131a 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -32,8 +32,24 @@ struct dax_operations { int (*zero_page_range)(struct dax_device *, pgoff_t, size_t); }; +struct dax_holder_operations { + /* + * notify_failure - notify memory failure into inner holder device + * @dax_dev: the dax device which contains the holder + * @offset: offset on this dax device where memory failure occurs + * @len: length of this memory failure event + * @flags: action flags for memory failure handler + */ + int (*notify_failure)(struct dax_device *dax_dev, u64 offset, + u64 len, int mf_flags); +}; + #if IS_ENABLED(CONFIG_DAX) struct dax_device *alloc_dax(void *private, const struct dax_operations *ops); +int dax_register_holder(struct dax_device *dax_dev, void *holder, + const struct dax_holder_operations *ops); +int dax_unregister_holder(struct dax_device *dax_dev, void *holder); +void *dax_holder(struct dax_device *dax_dev); void put_dax(struct dax_device *dax_dev); void kill_dax(struct dax_device *dax_dev); void dax_write_cache(struct dax_device *dax_dev, bool wc); @@ -53,6 +69,20 @@ static inline bool daxdev_mapping_supported(struct vm_area_struct *vma, return dax_synchronous(dax_dev); } #else +static inline int dax_register_holder(struct dax_device *dax_dev, void *holder, + const struct dax_holder_operations *ops) +{ + return 0; +} +static inline int dax_unregister_holder(struct dax_device *dax_dev, + void *holder) +{ + return 0; +} +static inline void *dax_holder(struct dax_device *dax_dev) +{ + return NULL; +} static inline struct dax_device *alloc_dax(void *private, const struct dax_operations *ops) { @@ -185,6 +215,8 @@ size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, size_t nr_pages); +int dax_holder_notify_failure(struct dax_device *dax_dev, u64 off, u64 len, + int mf_flags); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size); ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
To easily track filesystem from a pmem device, we introduce a holder for dax_device structure, and also its operation. This holder is used to remember who is using this dax_device: - When it is the backend of a filesystem, the holder will be the instance of this filesystem. - When this pmem device is one of the targets in a mapped device, the holder will be this mapped device. In this case, the mapped device has its own dax_device and it will follow the first rule. So that we can finally track to the filesystem we needed. The holder and holder_ops will be set when filesystem is being mounted, or an target device is being activated. Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> --- drivers/dax/super.c | 89 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/dax.h | 32 ++++++++++++++++ 2 files changed, 121 insertions(+)