Message ID | 20220405194747.2386619-5-jane.chu@oracle.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | DAX poison recovery | expand |
On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote: > Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is > not set by default in dax_direct_access() such that the helper > does not translate a pmem range to kernel virtual address if the > range contains uncorrectable errors. When the flag is set, > the helper ignores the UEs and return kernel virtual adderss so > that the caller may get on with data recovery via write. > > Also introduce a new dev_pagemap_ops .recovery_write function. > The function is applicable to FSDAX device only. The device > page backend driver provides .recovery_write function if the > device has underlying mechanism to clear the uncorrectable > errors on the fly. I know Dan suggested it, but I still think dev_pagemap_ops is the very wrong choice here. It is about VM callbacks to ZONE_DEVICE owners independent of what pagemap type they are. .recovery_write on the other hand is completely specific to the DAX write path and has no MM interactions at all. > /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ > __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, pfn_t *pfn) > { > resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; > + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; > + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; > + struct badblocks *bb = &pmem->bb; > + sector_t first_bad; > + int num_bad; > + bool bad_in_range; > + long actual_nr; > + > + if (!bb->count) > + bad_in_range = false; > + else > + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); > > - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, > - PFN_PHYS(nr_pages)))) > + if (bad_in_range && !(flags & DAX_RECOVERY)) > return -EIO; The use of bad_in_range here seems a litle convoluted. See the attached patch on how I would structure the function to avoid the variable and have the reocvery code in a self-contained chunk. > - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), > - &kaddr, NULL); > + nrpg = PHYS_PFN(size); > + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL); Overly long line here. diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index b868a88a0d589..377e4d59aa90f 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -272,42 +272,40 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, struct badblocks *bb = &pmem->bb; sector_t first_bad; int num_bad; - bool bad_in_range; - long actual_nr; - - if (!bb->count) - bad_in_range = false; - else - bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); - - if (bad_in_range && !(flags & DAX_RECOVERY)) - return -EIO; if (kaddr) *kaddr = pmem->virt_addr + offset; if (pfn) *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); - if (!bad_in_range) { + if (bb->count && + badblocks_check(bb, sector, num, &first_bad, &num_bad)) { + long actual_nr; + + if (!(flags & DAX_RECOVERY)) + return -EIO; /* - * If badblock is present but not in the range, limit known good range - * to the requested range. + * Set the recovery stride to the kernel page size because the + * underlying driver and firmware clear poison functions don't + * appear to handle large chunk (such as + * 2MiB) reliably. */ - if (bb->count) - return nr_pages; - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); + actual_nr = PHYS_PFN( + PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT)); + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n", + sector, nr_pages, first_bad, actual_nr); + if (actual_nr) + return actual_nr; + return 1; } /* - * In case poison is found in the given range and DAX_RECOVERY flag is set, - * recovery stride is set to kernel page size because the underlying driver and - * firmware clear poison functions don't appear to handle large chunk (such as - * 2MiB) reliably. + * If badblock is present but not in the range, limit known good range + * to the requested range. */ - actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT)); - dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n", - sector, nr_pages, first_bad, actual_nr); - return (actual_nr == 0) ? 1 : actual_nr; + if (bb->count) + return nr_pages; + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); } static const struct block_device_operations pmem_fops = {
On 4/5/2022 10:19 PM, Christoph Hellwig wrote: > On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote: >> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is >> not set by default in dax_direct_access() such that the helper >> does not translate a pmem range to kernel virtual address if the >> range contains uncorrectable errors. When the flag is set, >> the helper ignores the UEs and return kernel virtual adderss so >> that the caller may get on with data recovery via write. >> >> Also introduce a new dev_pagemap_ops .recovery_write function. >> The function is applicable to FSDAX device only. The device >> page backend driver provides .recovery_write function if the >> device has underlying mechanism to clear the uncorrectable >> errors on the fly. > > I know Dan suggested it, but I still think dev_pagemap_ops is the very > wrong choice here. It is about VM callbacks to ZONE_DEVICE owners > independent of what pagemap type they are. .recovery_write on the > other hand is completely specific to the DAX write path and has no > MM interactions at all. Yes, I believe Dan was motivated by avoiding the dm dance as a result of adding .recovery_write to dax_operations. I understand your point about .recovery_write is device specific and thus not something appropriate for device agnostic ops. I can see 2 options so far - 1) add .recovery_write to dax_operations and do the dm dance to hunt down to the base device that actually provides the recovery action 2) an ugly but expedient approach based on the observation that dax_direct_access() has already gone through the dm dance and thus could scoop up the .recovery_write function pointer if DAX_RECOVERY flag is set. Like bundle action-flag with action, and if should there need more device specific actions, just add another action with associated flag. I'm thinking about something like this long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, struct daxdev_specific *action, int flags, void **kaddr, pfn_t *pfn) where struct daxdev_specific { int flags; /* DAX_RECOVERY, etc */ size_t (*recovery_write) (pfn_t pfn, pgoff_t pgoff, void *addr, size_t bytes, void *iter); } __pmem_direct_access() provides the .recovery_write function pointer; dax_iomap_iter() ends up directly invoke the function in pmem.c which finds pgmap from pfn_t, and (struct pmem *) from pgmap->owner; In this way, we get rid of dax_recovery_write() interface as well as the dm dance. What do you think? Dan, could you also chime in ? > >> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ >> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >> { >> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; >> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; >> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; >> + struct badblocks *bb = &pmem->bb; >> + sector_t first_bad; >> + int num_bad; >> + bool bad_in_range; >> + long actual_nr; >> + >> + if (!bb->count) >> + bad_in_range = false; >> + else >> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); >> >> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, >> - PFN_PHYS(nr_pages)))) >> + if (bad_in_range && !(flags & DAX_RECOVERY)) >> return -EIO; > > The use of bad_in_range here seems a litle convoluted. See the attached > patch on how I would structure the function to avoid the variable and > have the reocvery code in a self-contained chunk. Much better, will use your version, thanks! > >> - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), >> - &kaddr, NULL); >> + nrpg = PHYS_PFN(size); >> + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL); > > Overly long line here. Okay, will run the checkpatch.pl test again. thanks! -jane
Resend, not sure it didn't go through. On 4/6/2022 10:32 AM, Jane Chu wrote: > On 4/5/2022 10:19 PM, Christoph Hellwig wrote: >> On Tue, Apr 05, 2022 at 01:47:45PM -0600, Jane Chu wrote: >>> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is >>> not set by default in dax_direct_access() such that the helper >>> does not translate a pmem range to kernel virtual address if the >>> range contains uncorrectable errors. When the flag is set, >>> the helper ignores the UEs and return kernel virtual adderss so >>> that the caller may get on with data recovery via write. >>> >>> Also introduce a new dev_pagemap_ops .recovery_write function. >>> The function is applicable to FSDAX device only. The device >>> page backend driver provides .recovery_write function if the >>> device has underlying mechanism to clear the uncorrectable >>> errors on the fly. >> >> I know Dan suggested it, but I still think dev_pagemap_ops is the very >> wrong choice here. It is about VM callbacks to ZONE_DEVICE owners >> independent of what pagemap type they are. .recovery_write on the >> other hand is completely specific to the DAX write path and has no >> MM interactions at all. > > Yes, I believe Dan was motivated by avoiding the dm dance as a result of > adding .recovery_write to dax_operations. > > I understand your point about .recovery_write is device specific and > thus not something appropriate for device agnostic ops. > > I can see 2 options so far - > > 1) add .recovery_write to dax_operations and do the dm dance to hunt > down to the base device that actually provides the recovery action > > 2) an ugly but expedient approach based on the observation that > dax_direct_access() has already gone through the dm dance and thus could > scoop up the .recovery_write function pointer if DAX_RECOVERY flag is > set. Like bundle action-flag with action, and if should there need more > device specific actions, just add another action with associated flag. > > I'm thinking about something like this > > long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, > long nr_pages, struct daxdev_specific *action, > int flags, void **kaddr, pfn_t *pfn) > > where > struct daxdev_specific { > int flags; /* DAX_RECOVERY, etc */ > size_t (*recovery_write) (pfn_t pfn, pgoff_t pgoff, void *addr, > size_t bytes, void *iter); > } > > __pmem_direct_access() provides the .recovery_write function pointer; > dax_iomap_iter() ends up directly invoke the function in pmem.c > which finds pgmap from pfn_t, and (struct pmem *) from > pgmap->owner; > > In this way, we get rid of dax_recovery_write() interface as well as the > dm dance. > > What do you think? > > Dan, could you also chime in ? > >> >>> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ >>> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t >>> pgoff, >>> - long nr_pages, void **kaddr, pfn_t *pfn) >>> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >>> { >>> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; >>> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; >>> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; >>> + struct badblocks *bb = &pmem->bb; >>> + sector_t first_bad; >>> + int num_bad; >>> + bool bad_in_range; >>> + long actual_nr; >>> + >>> + if (!bb->count) >>> + bad_in_range = false; >>> + else >>> + bad_in_range = !!badblocks_check(bb, sector, num, >>> &first_bad, &num_bad); >>> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, >>> - PFN_PHYS(nr_pages)))) >>> + if (bad_in_range && !(flags & DAX_RECOVERY)) >>> return -EIO; >> >> The use of bad_in_range here seems a litle convoluted. See the attached >> patch on how I would structure the function to avoid the variable and >> have the reocvery code in a self-contained chunk. > > Much better, will use your version, thanks! > >> >>> - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), >>> - &kaddr, NULL); >>> + nrpg = PHYS_PFN(size); >>> + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, >>> NULL); >> >> Overly long line here. > > Okay, will run the checkpatch.pl test again. > > thanks! > -jane
On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote: > Yes, I believe Dan was motivated by avoiding the dm dance as a result of > adding .recovery_write to dax_operations. > > I understand your point about .recovery_write is device specific and > thus not something appropriate for device agnostic ops. > > I can see 2 options so far - > > 1) add .recovery_write to dax_operations and do the dm dance to hunt > down to the base device that actually provides the recovery action That would be my preference. But I'll wait for Dan to chime in. > Okay, will run the checkpatch.pl test again. Unfortuntely checkpatch.pl is broken in that regard. It treats the exception to occasionally go longer or readability as the default.
On Wed, Apr 6, 2022 at 10:31 PM Christoph Hellwig <hch@infradead.org> wrote: > > On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote: > > Yes, I believe Dan was motivated by avoiding the dm dance as a result of > > adding .recovery_write to dax_operations. > > > > I understand your point about .recovery_write is device specific and > > thus not something appropriate for device agnostic ops. > > > > I can see 2 options so far - > > > > 1) add .recovery_write to dax_operations and do the dm dance to hunt > > down to the base device that actually provides the recovery action > > That would be my preference. But I'll wait for Dan to chime in. Yeah, so the motivation was avoiding plumbing recovery through stacked lookups when the recovery is specific to a pfn and the provider of that pfn, but I also see it from Christoph's perspective that the only agent that cares about recovery is the fsdax I/O path. Certainly having ->dax_direct_access() take a DAX_RECOVERY flag and the op itself go through the pgmap is a confusing split that I did not anticipate when I made the suggestion. Since that flag must be there, then the ->recovery_write() should also stay relative to a dax device. Apologies for the thrash Jane. One ask though, please separate plumbing the new flag argument to ->dax_direct_access() and plumbing the new operation into preparation patches before filling them in with the new goodness.
On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <jane.chu@oracle.com> wrote: > > Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is > not set by default in dax_direct_access() such that the helper > does not translate a pmem range to kernel virtual address if the > range contains uncorrectable errors. When the flag is set, > the helper ignores the UEs and return kernel virtual adderss so > that the caller may get on with data recovery via write. > > Also introduce a new dev_pagemap_ops .recovery_write function. > The function is applicable to FSDAX device only. The device > page backend driver provides .recovery_write function if the > device has underlying mechanism to clear the uncorrectable > errors on the fly. > > Signed-off-by: Jane Chu <jane.chu@oracle.com> > --- > drivers/dax/super.c | 17 ++++++++-- > drivers/md/dm-linear.c | 4 +-- > drivers/md/dm-log-writes.c | 5 +-- > drivers/md/dm-stripe.c | 4 +-- > drivers/md/dm-target.c | 2 +- > drivers/md/dm-writecache.c | 5 +-- > drivers/md/dm.c | 5 +-- > drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------ > drivers/nvdimm/pmem.h | 2 +- > drivers/s390/block/dcssblk.c | 4 +-- > fs/dax.c | 24 ++++++++++---- > fs/fuse/dax.c | 4 +-- > include/linux/dax.h | 11 +++++-- > include/linux/device-mapper.h | 2 +- > include/linux/memremap.h | 7 ++++ > tools/testing/nvdimm/pmem-dax.c | 2 +- > 16 files changed, 116 insertions(+), 39 deletions(-) > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c > index 0211e6f7b47a..8252858cd25a 100644 > --- a/drivers/dax/super.c > +++ b/drivers/dax/super.c > @@ -13,6 +13,7 @@ > #include <linux/uio.h> > #include <linux/dax.h> > #include <linux/fs.h> > +#include <linux/memremap.h> > #include "dax-private.h" > > /** > @@ -117,6 +118,7 @@ enum dax_device_flags { > * @dax_dev: a dax_device instance representing the logical memory range > * @pgoff: offset in pages from the start of the device to translate > * @nr_pages: number of consecutive pages caller can handle relative to @pfn > + * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery > * @kaddr: output parameter that returns a virtual address mapping of pfn > * @pfn: output parameter that returns an absolute pfn translation of @pgoff > * > @@ -124,7 +126,7 @@ enum dax_device_flags { > * pages accessible at the device relative @pgoff. > */ > long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, > - void **kaddr, pfn_t *pfn) > + int flags, void **kaddr, pfn_t *pfn) > { > long avail; > > @@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, > if (nr_pages < 0) > return -EINVAL; > > - avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, > + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags, > kaddr, pfn); > if (!avail) > return -ERANGE; > @@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, > } > EXPORT_SYMBOL_GPL(dax_zero_page_range); > > +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, > + pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter) > +{ > + struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL); > + > + if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write) > + return 0; > + return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter); > +} > +EXPORT_SYMBOL_GPL(dax_recovery_write); > + > #ifdef CONFIG_ARCH_HAS_PMEM_API > void arch_wb_cache_pmem(void *addr, size_t size); > void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) > diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c > index 76b486e4d2be..9e6d8bdf3b2a 100644 > --- a/drivers/md/dm-linear.c > +++ b/drivers/md/dm-linear.c > @@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) > } > > static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, pfn_t *pfn) > { > struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff); > > - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); > + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); > } > > static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, > diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c > index c9d036d6bb2e..e23f062ade5f 100644 > --- a/drivers/md/dm-log-writes.c > +++ b/drivers/md/dm-log-writes.c > @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti, > } > > static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, > + void **kaddr, pfn_t *pfn) > { > struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff); > > - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); > + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); > } > > static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, > diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c > index c81d331d1afe..b89339c78702 100644 > --- a/drivers/md/dm-stripe.c > +++ b/drivers/md/dm-stripe.c > @@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) > } > > static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, pfn_t *pfn) > { > struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff); > > - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); > + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); > } > > static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, > diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c > index 64dd0b34fcf4..24b1e5628f3a 100644 > --- a/drivers/md/dm-target.c > +++ b/drivers/md/dm-target.c > @@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone, > } > > static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, pfn_t *pfn) > { > return -EIO; > } > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 5630b470ba42..180ca8fa383e 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc) > > id = dax_read_lock(); > > - da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn); > + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0, > + &wc->memory_map, &pfn); > if (da < 0) { > wc->memory_map = NULL; > r = da; > @@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) > do { > long daa; > daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i, > - NULL, &pfn); > + 0, NULL, &pfn); > if (daa <= 0) { > r = daa ? daa : -EINVAL; > goto err3; > diff --git a/drivers/md/dm.c b/drivers/md/dm.c > index ad2e0bbeb559..a8c697bb6603 100644 > --- a/drivers/md/dm.c > +++ b/drivers/md/dm.c > @@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, > } > > static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, > + pfn_t *pfn) > { > struct mapped_device *md = dax_get_private(dax_dev); > sector_t sector = pgoff * PAGE_SECTORS; > @@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, > if (len < 1) > goto out; > nr_pages = min(len, nr_pages); > - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); > + ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn); > > out: > dm_put_live_table(md, srcu_idx); > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index 30c71a68175b..0400c5a7ba39 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, > > /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ > __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > - long nr_pages, void **kaddr, pfn_t *pfn) > + long nr_pages, int flags, void **kaddr, pfn_t *pfn) > { > resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; > + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; > + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; > + struct badblocks *bb = &pmem->bb; > + sector_t first_bad; > + int num_bad; > + bool bad_in_range; > + long actual_nr; > + > + if (!bb->count) > + bad_in_range = false; > + else > + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); Why all this change... > > - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, > - PFN_PHYS(nr_pages)))) ...instead of adding "&& !(flags & DAX_RECOVERY)" to this statement? > + if (bad_in_range && !(flags & DAX_RECOVERY)) > return -EIO; > > if (kaddr) > @@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, > if (pfn) > *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); > > + if (!bad_in_range) { > + /* > + * If badblock is present but not in the range, limit known good range > + * to the requested range. > + */ > + if (bb->count) > + return nr_pages; > + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); > + } > + > /* > - * If badblocks are present, limit known good range to the > - * requested range. > + * In case poison is found in the given range and DAX_RECOVERY flag is set, > + * recovery stride is set to kernel page size because the underlying driver and > + * firmware clear poison functions don't appear to handle large chunk (such as > + * 2MiB) reliably. > */ > - if (unlikely(pmem->bb.count)) > - return nr_pages; > - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); > + actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT)); > + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n", > + sector, nr_pages, first_bad, actual_nr); > + return (actual_nr == 0) ? 1 : actual_nr; Similar feedback as above that this is more change than I would expect. I think just adding... if (flags & DAX_RECOVERY) return 1; ...before the typical return path is enough.
On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <jane.chu@oracle.com> wrote: > > Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is > not set by default in dax_direct_access() such that the helper > does not translate a pmem range to kernel virtual address if the > range contains uncorrectable errors. When the flag is set, > the helper ignores the UEs and return kernel virtual adderss so > that the caller may get on with data recovery via write. It strikes me that there is likely never going to be any other flags to dax_direct_access() and what this option really is an access type. I also find code changes like this error prone to read: - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL); + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, 0, &kaddr, NULL); ...i.e. without looking at the prototype, which option is the nr_pages and which is the flags? So how about change 'int flags' to 'enum dax_access_mode mode' where dax_access_mode is: /** * enum dax_access_mode - operational mode for dax_direct_access() * @DAX_ACCESS: nominal access, fail / trim access on encountering poison * @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable for use with dax_recovery_write() */ enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, }; Then the conversions look like this: - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL); + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, DAX_ACCESS, &kaddr, NULL); ...and there's less chance of confusion with the @nr_pages argument.
On Mon, Apr 11, 2022 at 09:57:36PM -0700, Dan Williams wrote: > So how about change 'int flags' to 'enum dax_access_mode mode' where > dax_access_mode is: > > /** > * enum dax_access_mode - operational mode for dax_direct_access() > * @DAX_ACCESS: nominal access, fail / trim access on encountering poison > * @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable > for use with dax_recovery_write() > */ > enum dax_access_mode { > DAX_ACCESS, > DAX_RECOVERY_WRITE, > }; > > Then the conversions look like this: > > - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL); > + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, > DAX_ACCESS, &kaddr, NULL); > > ...and there's less chance of confusion with the @nr_pages argument. Yes, this might be a little nicer.
On 4/6/2022 10:30 PM, Christoph Hellwig wrote: > On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote: >> Yes, I believe Dan was motivated by avoiding the dm dance as a result of >> adding .recovery_write to dax_operations. >> >> I understand your point about .recovery_write is device specific and >> thus not something appropriate for device agnostic ops. >> >> I can see 2 options so far - >> >> 1) add .recovery_write to dax_operations and do the dm dance to hunt >> down to the base device that actually provides the recovery action > > That would be my preference. But I'll wait for Dan to chime in. Okay. > >> Okay, will run the checkpatch.pl test again. > > Unfortuntely checkpatch.pl is broken in that regard. It treats the > exception to occasionally go longer or readability as the default. I see. thanks, -jane
On 4/11/2022 4:55 PM, Dan Williams wrote: > On Wed, Apr 6, 2022 at 10:31 PM Christoph Hellwig <hch@infradead.org> wrote: >> >> On Wed, Apr 06, 2022 at 05:32:31PM +0000, Jane Chu wrote: >>> Yes, I believe Dan was motivated by avoiding the dm dance as a result of >>> adding .recovery_write to dax_operations. >>> >>> I understand your point about .recovery_write is device specific and >>> thus not something appropriate for device agnostic ops. >>> >>> I can see 2 options so far - >>> >>> 1) add .recovery_write to dax_operations and do the dm dance to hunt >>> down to the base device that actually provides the recovery action >> >> That would be my preference. But I'll wait for Dan to chime in. > > Yeah, so the motivation was avoiding plumbing recovery through stacked > lookups when the recovery is specific to a pfn and the provider of > that pfn, but I also see it from Christoph's perspective that the only > agent that cares about recovery is the fsdax I/O path. Certainly > having ->dax_direct_access() take a DAX_RECOVERY flag and the op > itself go through the pgmap is a confusing split that I did not > anticipate when I made the suggestion. Since that flag must be there, > then the ->recovery_write() should also stay relative to a dax device. > > Apologies for the thrash Jane. > > One ask though, please separate plumbing the new flag argument to > ->dax_direct_access() and plumbing the new operation into preparation > patches before filling them in with the new goodness. Okay, will do in next revision. thanks! -jane
On 4/11/2022 5:08 PM, Dan Williams wrote: > On Tue, Apr 5, 2022 at 12:48 PM Jane Chu <jane.chu@oracle.com> wrote: >> >> Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is >> not set by default in dax_direct_access() such that the helper >> does not translate a pmem range to kernel virtual address if the >> range contains uncorrectable errors. When the flag is set, >> the helper ignores the UEs and return kernel virtual adderss so >> that the caller may get on with data recovery via write. >> >> Also introduce a new dev_pagemap_ops .recovery_write function. >> The function is applicable to FSDAX device only. The device >> page backend driver provides .recovery_write function if the >> device has underlying mechanism to clear the uncorrectable >> errors on the fly. >> >> Signed-off-by: Jane Chu <jane.chu@oracle.com> >> --- >> drivers/dax/super.c | 17 ++++++++-- >> drivers/md/dm-linear.c | 4 +-- >> drivers/md/dm-log-writes.c | 5 +-- >> drivers/md/dm-stripe.c | 4 +-- >> drivers/md/dm-target.c | 2 +- >> drivers/md/dm-writecache.c | 5 +-- >> drivers/md/dm.c | 5 +-- >> drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------ >> drivers/nvdimm/pmem.h | 2 +- >> drivers/s390/block/dcssblk.c | 4 +-- >> fs/dax.c | 24 ++++++++++---- >> fs/fuse/dax.c | 4 +-- >> include/linux/dax.h | 11 +++++-- >> include/linux/device-mapper.h | 2 +- >> include/linux/memremap.h | 7 ++++ >> tools/testing/nvdimm/pmem-dax.c | 2 +- >> 16 files changed, 116 insertions(+), 39 deletions(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index 0211e6f7b47a..8252858cd25a 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -13,6 +13,7 @@ >> #include <linux/uio.h> >> #include <linux/dax.h> >> #include <linux/fs.h> >> +#include <linux/memremap.h> >> #include "dax-private.h" >> >> /** >> @@ -117,6 +118,7 @@ enum dax_device_flags { >> * @dax_dev: a dax_device instance representing the logical memory range >> * @pgoff: offset in pages from the start of the device to translate >> * @nr_pages: number of consecutive pages caller can handle relative to @pfn >> + * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery >> * @kaddr: output parameter that returns a virtual address mapping of pfn >> * @pfn: output parameter that returns an absolute pfn translation of @pgoff >> * >> @@ -124,7 +126,7 @@ enum dax_device_flags { >> * pages accessible at the device relative @pgoff. >> */ >> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, >> - void **kaddr, pfn_t *pfn) >> + int flags, void **kaddr, pfn_t *pfn) >> { >> long avail; >> >> @@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, >> if (nr_pages < 0) >> return -EINVAL; >> >> - avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, >> + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags, >> kaddr, pfn); >> if (!avail) >> return -ERANGE; >> @@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, >> } >> EXPORT_SYMBOL_GPL(dax_zero_page_range); >> >> +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, >> + pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter) >> +{ >> + struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL); >> + >> + if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write) >> + return 0; >> + return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter); >> +} >> +EXPORT_SYMBOL_GPL(dax_recovery_write); >> + >> #ifdef CONFIG_ARCH_HAS_PMEM_API >> void arch_wb_cache_pmem(void *addr, size_t size); >> void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) >> diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c >> index 76b486e4d2be..9e6d8bdf3b2a 100644 >> --- a/drivers/md/dm-linear.c >> +++ b/drivers/md/dm-linear.c >> @@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) >> } >> >> static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >> { >> struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff); >> >> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); >> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); >> } >> >> static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, >> diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c >> index c9d036d6bb2e..e23f062ade5f 100644 >> --- a/drivers/md/dm-log-writes.c >> +++ b/drivers/md/dm-log-writes.c >> @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti, >> } >> >> static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, >> + void **kaddr, pfn_t *pfn) >> { >> struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff); >> >> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); >> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); >> } >> >> static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, >> diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c >> index c81d331d1afe..b89339c78702 100644 >> --- a/drivers/md/dm-stripe.c >> +++ b/drivers/md/dm-stripe.c >> @@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) >> } >> >> static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >> { >> struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff); >> >> - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); >> + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); >> } >> >> static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, >> diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c >> index 64dd0b34fcf4..24b1e5628f3a 100644 >> --- a/drivers/md/dm-target.c >> +++ b/drivers/md/dm-target.c >> @@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone, >> } >> >> static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >> { >> return -EIO; >> } >> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c >> index 5630b470ba42..180ca8fa383e 100644 >> --- a/drivers/md/dm-writecache.c >> +++ b/drivers/md/dm-writecache.c >> @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc) >> >> id = dax_read_lock(); >> >> - da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn); >> + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0, >> + &wc->memory_map, &pfn); >> if (da < 0) { >> wc->memory_map = NULL; >> r = da; >> @@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) >> do { >> long daa; >> daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i, >> - NULL, &pfn); >> + 0, NULL, &pfn); >> if (daa <= 0) { >> r = daa ? daa : -EINVAL; >> goto err3; >> diff --git a/drivers/md/dm.c b/drivers/md/dm.c >> index ad2e0bbeb559..a8c697bb6603 100644 >> --- a/drivers/md/dm.c >> +++ b/drivers/md/dm.c >> @@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, >> } >> >> static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, >> + pfn_t *pfn) >> { >> struct mapped_device *md = dax_get_private(dax_dev); >> sector_t sector = pgoff * PAGE_SECTORS; >> @@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, >> if (len < 1) >> goto out; >> nr_pages = min(len, nr_pages); >> - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); >> + ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn); >> >> out: >> dm_put_live_table(md, srcu_idx); >> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c >> index 30c71a68175b..0400c5a7ba39 100644 >> --- a/drivers/nvdimm/pmem.c >> +++ b/drivers/nvdimm/pmem.c >> @@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, >> >> /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ >> __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, >> - long nr_pages, void **kaddr, pfn_t *pfn) >> + long nr_pages, int flags, void **kaddr, pfn_t *pfn) >> { >> resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; >> + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; >> + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; >> + struct badblocks *bb = &pmem->bb; >> + sector_t first_bad; >> + int num_bad; >> + bool bad_in_range; >> + long actual_nr; >> + >> + if (!bb->count) >> + bad_in_range = false; >> + else >> + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); > > Why all this change. > >> >> - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, >> - PFN_PHYS(nr_pages)))) > > ...instead of adding "&& !(flags & DAX_RECOVERY)" to this statement? > >> + if (bad_in_range && !(flags & DAX_RECOVERY)) >> return -EIO; >> >> if (kaddr) >> @@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, >> if (pfn) >> *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); >> >> + if (!bad_in_range) { >> + /* >> + * If badblock is present but not in the range, limit known good range >> + * to the requested range. >> + */ >> + if (bb->count) >> + return nr_pages; >> + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); >> + } >> + >> /* >> - * If badblocks are present, limit known good range to the >> - * requested range. >> + * In case poison is found in the given range and DAX_RECOVERY flag is set, >> + * recovery stride is set to kernel page size because the underlying driver and >> + * firmware clear poison functions don't appear to handle large chunk (such as >> + * 2MiB) reliably. >> */ >> - if (unlikely(pmem->bb.count)) >> - return nr_pages; >> - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); >> + actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT)); >> + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n", >> + sector, nr_pages, first_bad, actual_nr); >> + return (actual_nr == 0) ? 1 : actual_nr; > > Similar feedback as above that this is more change than I would expect. > > I think just adding... > > if (flags & DAX_RECOVERY) > return 1; > > ...before the typical return path is enough. I plan to use the re-organized code by Christoph - looks cleaner to me. thanks! -jane
On 4/11/2022 10:02 PM, Christoph Hellwig wrote: > On Mon, Apr 11, 2022 at 09:57:36PM -0700, Dan Williams wrote: >> So how about change 'int flags' to 'enum dax_access_mode mode' where >> dax_access_mode is: >> >> /** >> * enum dax_access_mode - operational mode for dax_direct_access() >> * @DAX_ACCESS: nominal access, fail / trim access on encountering poison >> * @DAX_RECOVERY_WRITE: ignore poison and provide a pointer suitable >> for use with dax_recovery_write() >> */ >> enum dax_access_mode { >> DAX_ACCESS, >> DAX_RECOVERY_WRITE, >> }; >> >> Then the conversions look like this: >> >> - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL); >> + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, >> DAX_ACCESS, &kaddr, NULL); >> >> ...and there's less chance of confusion with the @nr_pages argument. > > Yes, this might be a little nicer. Will do. Thanks! -jane
diff --git a/drivers/dax/super.c b/drivers/dax/super.c index 0211e6f7b47a..8252858cd25a 100644 --- a/drivers/dax/super.c +++ b/drivers/dax/super.c @@ -13,6 +13,7 @@ #include <linux/uio.h> #include <linux/dax.h> #include <linux/fs.h> +#include <linux/memremap.h> #include "dax-private.h" /** @@ -117,6 +118,7 @@ enum dax_device_flags { * @dax_dev: a dax_device instance representing the logical memory range * @pgoff: offset in pages from the start of the device to translate * @nr_pages: number of consecutive pages caller can handle relative to @pfn + * @flags: by default 0, set to DAX_RECOVERY to kick start dax recovery * @kaddr: output parameter that returns a virtual address mapping of pfn * @pfn: output parameter that returns an absolute pfn translation of @pgoff * @@ -124,7 +126,7 @@ enum dax_device_flags { * pages accessible at the device relative @pgoff. */ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, - void **kaddr, pfn_t *pfn) + int flags, void **kaddr, pfn_t *pfn) { long avail; @@ -137,7 +139,7 @@ long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, if (nr_pages < 0) return -EINVAL; - avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, + avail = dax_dev->ops->direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); if (!avail) return -ERANGE; @@ -194,6 +196,17 @@ int dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, } EXPORT_SYMBOL_GPL(dax_zero_page_range); +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, + pfn_t pfn, void *addr, size_t bytes, struct iov_iter *iter) +{ + struct dev_pagemap *pgmap = get_dev_pagemap(pfn_t_to_pfn(pfn), NULL); + + if (!pgmap || !pgmap->ops || !pgmap->ops->recovery_write) + return 0; + return pgmap->ops->recovery_write(pgmap, pgoff, addr, bytes, iter); +} +EXPORT_SYMBOL_GPL(dax_recovery_write); + #ifdef CONFIG_ARCH_HAS_PMEM_API void arch_wb_cache_pmem(void *addr, size_t size); void dax_flush(struct dax_device *dax_dev, void *addr, size_t size) diff --git a/drivers/md/dm-linear.c b/drivers/md/dm-linear.c index 76b486e4d2be..9e6d8bdf3b2a 100644 --- a/drivers/md/dm-linear.c +++ b/drivers/md/dm-linear.c @@ -172,11 +172,11 @@ static struct dax_device *linear_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) } static long linear_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { struct dax_device *dax_dev = linear_dax_pgoff(ti, &pgoff); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); } static int linear_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index c9d036d6bb2e..e23f062ade5f 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c @@ -889,11 +889,12 @@ static struct dax_device *log_writes_dax_pgoff(struct dm_target *ti, } static long log_writes_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, + void **kaddr, pfn_t *pfn) { struct dax_device *dax_dev = log_writes_dax_pgoff(ti, &pgoff); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); } static int log_writes_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-stripe.c b/drivers/md/dm-stripe.c index c81d331d1afe..b89339c78702 100644 --- a/drivers/md/dm-stripe.c +++ b/drivers/md/dm-stripe.c @@ -315,11 +315,11 @@ static struct dax_device *stripe_dax_pgoff(struct dm_target *ti, pgoff_t *pgoff) } static long stripe_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { struct dax_device *dax_dev = stripe_dax_pgoff(ti, &pgoff); - return dax_direct_access(dax_dev, pgoff, nr_pages, kaddr, pfn); + return dax_direct_access(dax_dev, pgoff, nr_pages, flags, kaddr, pfn); } static int stripe_dax_zero_page_range(struct dm_target *ti, pgoff_t pgoff, diff --git a/drivers/md/dm-target.c b/drivers/md/dm-target.c index 64dd0b34fcf4..24b1e5628f3a 100644 --- a/drivers/md/dm-target.c +++ b/drivers/md/dm-target.c @@ -142,7 +142,7 @@ static void io_err_release_clone_rq(struct request *clone, } static long io_err_dax_direct_access(struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { return -EIO; } diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c index 5630b470ba42..180ca8fa383e 100644 --- a/drivers/md/dm-writecache.c +++ b/drivers/md/dm-writecache.c @@ -286,7 +286,8 @@ static int persistent_memory_claim(struct dm_writecache *wc) id = dax_read_lock(); - da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, &wc->memory_map, &pfn); + da = dax_direct_access(wc->ssd_dev->dax_dev, offset, p, 0, + &wc->memory_map, &pfn); if (da < 0) { wc->memory_map = NULL; r = da; @@ -309,7 +310,7 @@ static int persistent_memory_claim(struct dm_writecache *wc) do { long daa; daa = dax_direct_access(wc->ssd_dev->dax_dev, offset + i, p - i, - NULL, &pfn); + 0, NULL, &pfn); if (daa <= 0) { r = daa ? daa : -EINVAL; goto err3; diff --git a/drivers/md/dm.c b/drivers/md/dm.c index ad2e0bbeb559..a8c697bb6603 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1087,7 +1087,8 @@ static struct dm_target *dm_dax_get_live_target(struct mapped_device *md, } static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, + pfn_t *pfn) { struct mapped_device *md = dax_get_private(dax_dev); sector_t sector = pgoff * PAGE_SECTORS; @@ -1105,7 +1106,7 @@ static long dm_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, if (len < 1) goto out; nr_pages = min(len, nr_pages); - ret = ti->type->direct_access(ti, pgoff, nr_pages, kaddr, pfn); + ret = ti->type->direct_access(ti, pgoff, nr_pages, flags, kaddr, pfn); out: dm_put_live_table(md, srcu_idx); diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 30c71a68175b..0400c5a7ba39 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -238,12 +238,23 @@ static int pmem_rw_page(struct block_device *bdev, sector_t sector, /* see "strong" declaration in tools/testing/nvdimm/pmem-dax.c */ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset; + sector_t sector = PFN_PHYS(pgoff) >> SECTOR_SHIFT; + unsigned int num = PFN_PHYS(nr_pages) >> SECTOR_SHIFT; + struct badblocks *bb = &pmem->bb; + sector_t first_bad; + int num_bad; + bool bad_in_range; + long actual_nr; + + if (!bb->count) + bad_in_range = false; + else + bad_in_range = !!badblocks_check(bb, sector, num, &first_bad, &num_bad); - if (unlikely(is_bad_pmem(&pmem->bb, PFN_PHYS(pgoff) / 512, - PFN_PHYS(nr_pages)))) + if (bad_in_range && !(flags & DAX_RECOVERY)) return -EIO; if (kaddr) @@ -251,13 +262,26 @@ __weak long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, if (pfn) *pfn = phys_to_pfn_t(pmem->phys_addr + offset, pmem->pfn_flags); + if (!bad_in_range) { + /* + * If badblock is present but not in the range, limit known good range + * to the requested range. + */ + if (bb->count) + return nr_pages; + return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); + } + /* - * If badblocks are present, limit known good range to the - * requested range. + * In case poison is found in the given range and DAX_RECOVERY flag is set, + * recovery stride is set to kernel page size because the underlying driver and + * firmware clear poison functions don't appear to handle large chunk (such as + * 2MiB) reliably. */ - if (unlikely(pmem->bb.count)) - return nr_pages; - return PHYS_PFN(pmem->size - pmem->pfn_pad - offset); + actual_nr = PHYS_PFN(PAGE_ALIGN((first_bad - sector) << SECTOR_SHIFT)); + dev_dbg(pmem->bb.dev, "start sector(%llu), nr_pages(%ld), first_bad(%llu), actual_nr(%ld)\n", + sector, nr_pages, first_bad, actual_nr); + return (actual_nr == 0) ? 1 : actual_nr; } static const struct block_device_operations pmem_fops = { @@ -277,11 +301,12 @@ static int pmem_dax_zero_page_range(struct dax_device *dax_dev, pgoff_t pgoff, } static long pmem_dax_direct_access(struct dax_device *dax_dev, - pgoff_t pgoff, long nr_pages, void **kaddr, pfn_t *pfn) + pgoff_t pgoff, long nr_pages, int flags, void **kaddr, + pfn_t *pfn) { struct pmem_device *pmem = dax_get_private(dax_dev); - return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn); + return __pmem_direct_access(pmem, pgoff, nr_pages, flags, kaddr, pfn); } static const struct dax_operations pmem_dax_ops = { @@ -289,6 +314,12 @@ static const struct dax_operations pmem_dax_ops = { .zero_page_range = pmem_dax_zero_page_range, }; +static size_t pmem_recovery_write(struct dev_pagemap *pgmap, pgoff_t pgoff, + void *addr, size_t bytes, void *iter) +{ + return 0; +} + static ssize_t write_cache_show(struct device *dev, struct device_attribute *attr, char *buf) { @@ -349,6 +380,10 @@ static void pmem_release_disk(void *__pmem) blk_cleanup_disk(pmem->disk); } +static const struct dev_pagemap_ops pmem_pgmap_ops = { + .recovery_write = pmem_recovery_write, +}; + static int pmem_attach_disk(struct device *dev, struct nd_namespace_common *ndns) { @@ -380,6 +415,8 @@ static int pmem_attach_disk(struct device *dev, rc = nvdimm_setup_pfn(nd_pfn, &pmem->pgmap); if (rc) return rc; + if (nd_pfn->mode == PFN_MODE_PMEM) + pmem->pgmap.ops = &pmem_pgmap_ops; } /* we're attaching a block device, disable raw namespace access */ diff --git a/drivers/nvdimm/pmem.h b/drivers/nvdimm/pmem.h index 1f51a2361429..e9c53d42c488 100644 --- a/drivers/nvdimm/pmem.h +++ b/drivers/nvdimm/pmem.h @@ -28,7 +28,7 @@ struct pmem_device { }; long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn); + long nr_pages, int flag, void **kaddr, pfn_t *pfn); #ifdef CONFIG_MEMORY_FAILURE static inline bool test_and_clear_pmem_poison(struct page *page) diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c index d614843caf6c..c3fbf500868f 100644 --- a/drivers/s390/block/dcssblk.c +++ b/drivers/s390/block/dcssblk.c @@ -32,7 +32,7 @@ static int dcssblk_open(struct block_device *bdev, fmode_t mode); static void dcssblk_release(struct gendisk *disk, fmode_t mode); static void dcssblk_submit_bio(struct bio *bio); static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn); + long nr_pages, int flags, void **kaddr, pfn_t *pfn); static char dcssblk_segments[DCSSBLK_PARM_LEN] = "\0"; @@ -927,7 +927,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff, static long dcssblk_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { struct dcssblk_dev_info *dev_info = dax_get_private(dax_dev); diff --git a/fs/dax.c b/fs/dax.c index 67a08a32fccb..e8900e92990b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -721,7 +721,7 @@ static int copy_cow_page_dax(struct vm_fault *vmf, const struct iomap_iter *iter int id; id = dax_read_lock(); - rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, &kaddr, NULL); + rc = dax_direct_access(iter->iomap.dax_dev, pgoff, 1, 0, &kaddr, NULL); if (rc < 0) { dax_read_unlock(id); return rc; @@ -1012,7 +1012,7 @@ static int dax_iomap_pfn(const struct iomap *iomap, loff_t pos, size_t size, long length; id = dax_read_lock(); - length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), + length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size), 0, NULL, pfnp); if (length < 0) { rc = length; @@ -1122,7 +1122,7 @@ static int dax_memzero(struct dax_device *dax_dev, pgoff_t pgoff, void *kaddr; long ret; - ret = dax_direct_access(dax_dev, pgoff, 1, &kaddr, NULL); + ret = dax_direct_access(dax_dev, pgoff, 1, 0, &kaddr, NULL); if (ret > 0) { memset(kaddr + offset, 0, size); dax_flush(dax_dev, kaddr + offset, size); @@ -1239,15 +1239,24 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi, const size_t size = ALIGN(length + offset, PAGE_SIZE); pgoff_t pgoff = dax_iomap_pgoff(iomap, pos); ssize_t map_len; + bool recovery = false; void *kaddr; + long nrpg; + pfn_t pfn; if (fatal_signal_pending(current)) { ret = -EINTR; break; } - map_len = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), - &kaddr, NULL); + nrpg = PHYS_PFN(size); + map_len = dax_direct_access(dax_dev, pgoff, nrpg, 0, &kaddr, NULL); + if (map_len == -EIO && iov_iter_rw(iter) == WRITE) { + map_len = dax_direct_access(dax_dev, pgoff, nrpg, + DAX_RECOVERY, &kaddr, &pfn); + if (map_len > 0) + recovery = true; + } if (map_len < 0) { ret = map_len; break; @@ -1259,7 +1268,10 @@ static loff_t dax_iomap_iter(const struct iomap_iter *iomi, if (map_len > end - pos) map_len = end - pos; - if (iov_iter_rw(iter) == WRITE) + if (recovery) + xfer = dax_recovery_write(dax_dev, pgoff, pfn, kaddr, + map_len, iter); + else if (iov_iter_rw(iter) == WRITE) xfer = dax_copy_from_iter(dax_dev, pgoff, kaddr, map_len, iter); else diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c index d7d3a7f06862..bf5e40a0707b 100644 --- a/fs/fuse/dax.c +++ b/fs/fuse/dax.c @@ -1241,8 +1241,8 @@ static int fuse_dax_mem_range_init(struct fuse_conn_dax *fcd) INIT_DELAYED_WORK(&fcd->free_work, fuse_dax_free_mem_worker); id = dax_read_lock(); - nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), NULL, - NULL); + nr_pages = dax_direct_access(fcd->dev, 0, PHYS_PFN(dax_size), 0, + NULL, NULL); dax_read_unlock(id); if (nr_pages < 0) { pr_debug("dax_direct_access() returned %ld\n", nr_pages); diff --git a/include/linux/dax.h b/include/linux/dax.h index 9fc5f99a0ae2..fc9ee886de89 100644 --- a/include/linux/dax.h +++ b/include/linux/dax.h @@ -14,14 +14,17 @@ struct iomap_ops; struct iomap_iter; struct iomap; +/* Flag to communicate for DAX recovery operation */ +#define DAX_RECOVERY 0x1 + struct dax_operations { /* * direct_access: translate a device-relative * logical-page-offset into an absolute physical pfn. Return the * number of pages available for DAX at that pfn. */ - long (*direct_access)(struct dax_device *, pgoff_t, long, - void **, pfn_t *); + long (*direct_access)(struct dax_device *dax_dev, pgoff_t pgoff, + long nr_pages, int flags, void **kaddr, pfn_t *pfn); /* * Validate whether this device is usable as an fsdax backing * device. @@ -40,6 +43,8 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc); bool dax_write_cache_enabled(struct dax_device *dax_dev); bool dax_synchronous(struct dax_device *dax_dev); void set_dax_synchronous(struct dax_device *dax_dev); +size_t dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff, pfn_t pfn, + void *addr, size_t bytes, struct iov_iter *i); /* * Check if given mapping is supported by the file / underlying device. */ @@ -178,7 +183,7 @@ static inline void dax_read_unlock(int id) bool dax_alive(struct dax_device *dax_dev); void *dax_get_private(struct dax_device *dax_dev); long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages, - void **kaddr, pfn_t *pfn); + int flags, void **kaddr, pfn_t *pfn); size_t dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, size_t bytes, struct iov_iter *i); size_t dax_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff, void *addr, diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index c2a3758c4aaa..45ad013294a3 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h @@ -146,7 +146,7 @@ typedef int (*dm_busy_fn) (struct dm_target *ti); * >= 0 : the number of bytes accessible at the address */ typedef long (*dm_dax_direct_access_fn) (struct dm_target *ti, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn); + long nr_pages, int flags, void **kaddr, pfn_t *pfn); typedef int (*dm_dax_zero_page_range_fn)(struct dm_target *ti, pgoff_t pgoff, size_t nr_pages); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 8af304f6b504..79a170cb49ef 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -79,6 +79,13 @@ struct dev_pagemap_ops { * the page back to a CPU accessible page. */ vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf); + + /* + * Used for FS DAX device only. For synchronous recovery from DAX media + * encountering Uncorrectable Error. + */ + size_t (*recovery_write)(struct dev_pagemap *pgmap, pgoff_t pgoff, + void *addr, size_t bytes, void *iter); }; #define PGMAP_ALTMAP_VALID (1 << 0) diff --git a/tools/testing/nvdimm/pmem-dax.c b/tools/testing/nvdimm/pmem-dax.c index af19c85558e7..287db5e3e293 100644 --- a/tools/testing/nvdimm/pmem-dax.c +++ b/tools/testing/nvdimm/pmem-dax.c @@ -8,7 +8,7 @@ #include <nd.h> long __pmem_direct_access(struct pmem_device *pmem, pgoff_t pgoff, - long nr_pages, void **kaddr, pfn_t *pfn) + long nr_pages, int flags, void **kaddr, pfn_t *pfn) { resource_size_t offset = PFN_PHYS(pgoff) + pmem->data_offset;
Introduce DAX_RECOVERY flag to dax_direct_access(). The flag is not set by default in dax_direct_access() such that the helper does not translate a pmem range to kernel virtual address if the range contains uncorrectable errors. When the flag is set, the helper ignores the UEs and return kernel virtual adderss so that the caller may get on with data recovery via write. Also introduce a new dev_pagemap_ops .recovery_write function. The function is applicable to FSDAX device only. The device page backend driver provides .recovery_write function if the device has underlying mechanism to clear the uncorrectable errors on the fly. Signed-off-by: Jane Chu <jane.chu@oracle.com> --- drivers/dax/super.c | 17 ++++++++-- drivers/md/dm-linear.c | 4 +-- drivers/md/dm-log-writes.c | 5 +-- drivers/md/dm-stripe.c | 4 +-- drivers/md/dm-target.c | 2 +- drivers/md/dm-writecache.c | 5 +-- drivers/md/dm.c | 5 +-- drivers/nvdimm/pmem.c | 57 +++++++++++++++++++++++++++------ drivers/nvdimm/pmem.h | 2 +- drivers/s390/block/dcssblk.c | 4 +-- fs/dax.c | 24 ++++++++++---- fs/fuse/dax.c | 4 +-- include/linux/dax.h | 11 +++++-- include/linux/device-mapper.h | 2 +- include/linux/memremap.h | 7 ++++ tools/testing/nvdimm/pmem-dax.c | 2 +- 16 files changed, 116 insertions(+), 39 deletions(-)