[RFC,v2,0/6] fsdax: introduce fs query to support reflink

Message ID	20201123004116.2453-1-ruansy.fnst@cn.fujitsu.com (mailing list archive)
Headers	show Return-Path: <SRS0=IQ1B=E5=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 074B52076E Received-SPF: None (mailfrom) identity=mailfrom; client-ip=183.91.158.132; helo=heian.cn.fujitsu.com; envelope-from=ruansy.fnst@cn.fujitsu.com; receiver=<UNKNOWN> From: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> To: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>, <linux-nvdimm@lists.01.org>, <linux-mm@kvack.org> Subject: [RFC PATCH v2 0/6] fsdax: introduce fs query to support reflink Date: Mon, 23 Nov 2020 08:41:10 +0800 Message-ID: <20201123004116.2453-1-ruansy.fnst@cn.fujitsu.com> MIME-Version: 1.0 Message-ID-Hash: YOYVHDVZBUSF6AP5MUXJTPEU55IVGSYN CC: linux-fsdevel@vger.kernel.org, linux-raid@vger.kernel.org, darrick.wong@oracle.com, david@fromorbit.com, hch@lst.de, song@kernel.org, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/YOYVHDVZBUSF6AP5MUXJTPEU55IVGSYN/> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	fsdax: introduce fs query to support reflink \| expand [RFC,v2,0/6] fsdax: introduce fs query to support reflink [RFC,v2,1/6] fs: introduce ->storage_lost() for memory-failure [RFC,v2,2/6] blk: introduce ->block_lost() to handle memory-failure [RFC,v2,3/6] md: implement ->block_lost() for memory-failure [RFC,v2,4/6] pagemap: introduce ->memory_failure() [RFC,v2,5/6] mm, fsdax: refactor dax handler in memory-failure [RFC,v2,6/6] fsdax: remove useless (dis)associate functions

Ruan Shiyang Nov. 23, 2020, 12:41 a.m. UTC

This patchset is a try to resolve the problem of tracking shared page
for fsdax.

Change from v1:
  - Intorduce ->block_lost() for block device
  - Support mapped device
  - Add 'not available' warning for realtime device in XFS
  - Rebased to v5.10-rc1

This patchset moves owner tracking from dax_assocaite_entry() to pmem
device, by introducing an interface ->memory_failure() of struct
pagemap.  The interface is called by memory_failure() in mm, and
implemented by pmem device.  Then pmem device calls its ->block_lost()
to find the filesystem which the damaged page located in, and call
->storage_lost() to track files or metadata assocaited with this page.
Finally we are able to try to fix the damaged data in filesystem and do
other necessary processing, such as killing processes who are using the
files affected.

The call trace is like this:
 memory_failure()
   pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
    gendisk->fops->block_lost()   => pmem_block_lost() or
                                         md_blk_block_lost()
     sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
      xfs_rmap_query_range()
       xfs_storage_lost_helper()
        mf_recover_controller->recover_fn => \ 
                            memory_failure_dev_pagemap_kill_procs()

The collect_procs() and kill_procs() are moved into a callback which
is passed from memory_failure() to xfs_storage_lost_helper().  So we
can call it when a file assocaited is found, instead of creating a
file list and iterate it.

The fsdax & reflink support for XFS is not contained in this patchset.

(Rebased on v5.10-rc1)

Shiyang Ruan (6):
  fs: introduce ->storage_lost() for memory-failure
  blk: introduce ->block_lost() to handle memory-failure
  md: implement ->block_lost() for memory-failure
  pagemap: introduce ->memory_failure()
  mm, fsdax: refactor dax handler in memory-failure
  fsdax: remove useless (dis)associate functions

 block/genhd.c                 |  12 ++++
 drivers/md/dm-linear.c        |   8 +++
 drivers/md/dm.c               |  64 +++++++++++++++++
 drivers/nvdimm/pmem.c         |  50 +++++++++++++
 fs/block_dev.c                |  23 ++++++
 fs/dax.c                      |  64 ++---------------
 fs/xfs/xfs_super.c            |  87 +++++++++++++++++++++++
 include/linux/blkdev.h        |   2 +
 include/linux/dax.h           |   5 +-
 include/linux/device-mapper.h |   2 +
 include/linux/fs.h            |   2 +
 include/linux/genhd.h         |   9 +++
 include/linux/memremap.h      |   3 +
 include/linux/mm.h            |  14 ++++
 mm/memory-failure.c           | 127 +++++++++++++++++++++-------------
 15 files changed, 362 insertions(+), 110 deletions(-)

Dave Chinner Nov. 29, 2020, 10:47 p.m. UTC | #1

On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>   - Intorduce ->block_lost() for block device
>   - Support mapped device
>   - Add 'not available' warning for realtime device in XFS
>   - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do
> other necessary processing, such as killing processes who are using the
> files affected.
> 
> The call trace is like this:
>  memory_failure()
>    pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
>     gendisk->fops->block_lost()   => pmem_block_lost() or
>                                          md_blk_block_lost()
>      sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
>       xfs_rmap_query_range()
>        xfs_storage_lost_helper()
>         mf_recover_controller->recover_fn => \ 
>                             memory_failure_dev_pagemap_kill_procs()
> 
> The collect_procs() and kill_procs() are moved into a callback which
> is passed from memory_failure() to xfs_storage_lost_helper().  So we
> can call it when a file assocaited is found, instead of creating a
> file list and iterate it.
> 
> The fsdax & reflink support for XFS is not contained in this patchset.

This looks promising - the overall architecture is a lot more
generic and less dependent on knowing about memory, dax or memory
failures. A few comments that I think would further improve
understanding the patchset and the implementation:

- the order of the patches is inverted. It should start with a
  single patch introducing the mf_recover_controller structure for
  callbacks, then introduce pgmap->ops->memory_failure, then
  ->block_lost, then the pmem and md implementations of ->block
  list, then ->storage_lost and the XFS implementations of
  ->storage_lost.

- I think the names "block_lost" and "storage_lost" are misleading.
  It's more like a "media failure" or a general "data corruption"
  event at a specific physical location. The data may not be "lost"
  but only damaged, so we might be able to recover from it without
  "losing" anything. Hence I think they could be better named,
  perhaps just "->corrupt_range"

- need to pass a {offset,len} pair through the chain, not just a
  single offset. This will allow other types of devices to report
  different ranges of failures, from a single sector to an entire
  device.

- I'm not sure that passing the mf_recover_controller structure
  through the corruption event chain is the right thing to do here.
  A block device could generate this storage failure callback if it
  detects an unrecoverable error (e.g. during a MD media scrub or
  rebuild/resilver failure) and in that case we don't have PFNs or
  memory device failure functions to perform.

  IOWs, I think the action that is taken needs to be independent of
  the source that generated the error. Even for a pmem device, we
  can be using the page cache, so it may be possible to recover the
  pmem error by writing the cached page (if it exists) back over the
  pmem.

  Hence I think that the recover function probably needs to be moved
  to the address space ops, because what we do to recover from the
  error is going to be dependent on type of mapping the filesystem
  is using. If it's a DAX mapping, we call back into a generic DAX
  function that does the vma walk and process kill functions. If it
  is a page cache mapping, then if the page is cached then we can
  try to re-write it to disk to fix the bad data, otherwise we treat
  it like a writeback error and report it on the next
  write/fsync/close operation done on that file.

  This gets rid of the mf_recover_controller altogether and allows
  the interface to be used by any sort of block device for any sort
  of bottom-up reporting of media/device failures.

Cheers,

Dave.

Ruan Shiyang Dec. 2, 2020, 7:12 a.m. UTC | #2

Hi Dave,

On 2020/11/30 上午6:47, Dave Chinner wrote:
> On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
>> 
>> The call trace is like this:
>>   memory_failure()
>>     pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
>>      gendisk->fops->block_lost()   => pmem_block_lost() or
>>                                           md_blk_block_lost()
>>       sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
>>        xfs_rmap_query_range()
>>         xfs_storage_lost_helper()
>>          mf_recover_controller->recover_fn => \
>>                              memory_failure_dev_pagemap_kill_procs()
>>
>> The collect_procs() and kill_procs() are moved into a callback which
>> is passed from memory_failure() to xfs_storage_lost_helper().  So we
>> can call it when a file assocaited is found, instead of creating a
>> file list and iterate it.
>>
>> The fsdax & reflink support for XFS is not contained in this patchset.
> 
> This looks promising - the overall architecture is a lot more
> generic and less dependent on knowing about memory, dax or memory
> failures. A few comments that I think would further improve
> understanding the patchset and the implementation:

Thanks for your kindly comment.  It gives me confidence.

> 
> - the order of the patches is inverted. It should start with a
>    single patch introducing the mf_recover_controller structure for
>    callbacks, then introduce pgmap->ops->memory_failure, then
>    ->block_lost, then the pmem and md implementations of ->block
>    list, then ->storage_lost and the XFS implementations of
>    ->storage_lost.

Yes, it will be easier to understand the patchset in this order.

But I have something unsure: for example, I introduce ->memory_failure() 
firstly, but the implementation of ->memory_failure() needs to call 
->block_lost() which is supposed to be introduced in the next patch. So, 
I am not sure the code is supposed to be what in the implementation of 
->memory_failure() in pmem?  To avoid this situation, I committed the 
patches in the inverted order: lowest level first, then its caller, and 
then caller's caller.

I am trying to sort out the order.  How about this:
  Patch i.
    Introduce ->memory_failure()
       - just introduce interface, without implementation
  Patch i++.
    Introduce ->block_lost()
       - introduce interface and implement ->memory_failure()
          in pmem, so that it can call ->block_lost()
  Patch i++.
    (similar with above, skip...)

> 
> - I think the names "block_lost" and "storage_lost" are misleading.
>    It's more like a "media failure" or a general "data corruption"
>    event at a specific physical location. The data may not be "lost"
>    but only damaged, so we might be able to recover from it without
>    "losing" anything. Hence I think they could be better named,
>    perhaps just "->corrupt_range"

'corrupt' sounds better.  (I'm not good at naming functions...)

> 
> - need to pass a {offset,len} pair through the chain, not just a
>    single offset. This will allow other types of devices to report
>    different ranges of failures, from a single sector to an entire
>    device.

Yes, it's better to add the length.  I restrictively thought that 
memory-failure on pmem should affect one single page at one time.

> 
> - I'm not sure that passing the mf_recover_controller structure
>    through the corruption event chain is the right thing to do here.
>    A block device could generate this storage failure callback if it
>    detects an unrecoverable error (e.g. during a MD media scrub or
>    rebuild/resilver failure) and in that case we don't have PFNs or
>    memory device failure functions to perform.
> 
>    IOWs, I think the action that is taken needs to be independent of
>    the source that generated the error. Even for a pmem device, we
>    can be using the page cache, so it may be possible to recover the
>    pmem error by writing the cached page (if it exists) back over the
>    pmem.
> 
>    Hence I think that the recover function probably needs to be moved
>    to the address space ops, because what we do to recover from the
>    error is going to be dependent on type of mapping the filesystem
>    is using. If it's a DAX mapping, we call back into a generic DAX
>    function that does the vma walk and process kill functions. If it
>    is a page cache mapping, then if the page is cached then we can
>    try to re-write it to disk to fix the bad data, otherwise we treat
>    it like a writeback error and report it on the next
>    write/fsync/close operation done on that file.
> 
>    This gets rid of the mf_recover_controller altogether and allows
>    the interface to be used by any sort of block device for any sort
>    of bottom-up reporting of media/device failures.

Moving the recover function to the address_space ops looks a better 
idea. But I think that the error handler for page cache mapping is 
finished well in memory-failure.  The memory-failure is also reused to 
handles anonymous page.  If we move the recover function to 
address_space ops, I think we also need to refactor the existing handler 
for page cache mapping, which may affect anonymous page handling.  This 
makes me confused...


I rewrote the call trace:
memory_failure()
  * dax mapping case
  pgmap->ops->memory_failure()          =>
                                    pmem_pgmap_memory_failure()
   gendisk->fops->block_corrupt_range() =>
                                    - pmem_block_corrupt_range()
                                    - md_blk_block_corrupt_range()
    sb->s_ops->storage_currupt_range()  =>
                                    xfs_fs_storage_corrupt_range()
     xfs_rmap_query_range()
      xfs_storage_lost_helper()
       mapping->a_ops->corrupt_range()  =>
                                    xfs_dax_aops.xfs_dax_corrupt_range
        memory_failure_dev_pagemap_kill_procs()

  * page cache mapping case
  mapping->a_ops->corrupt_range()       =>
                                    xfs_address_space_operations.xfs_xxx
   memory_failure_generic_kill_procs()

It's rough and not completed yet.  Hope for your comment.

Dave Chinner Dec. 6, 2020, 10:55 p.m. UTC | #3

On Wed, Dec 02, 2020 at 03:12:20PM +0800, Ruan Shiyang wrote:
> Hi Dave,
> 
> On 2020/11/30 上午6:47, Dave Chinner wrote:
> > On Mon, Nov 23, 2020 at 08:41:10AM +0800, Shiyang Ruan wrote:
> > > 
> > > The call trace is like this:
> > >   memory_failure()
> > >     pgmap->ops->memory_failure()   => pmem_pgmap_memory_failure()
> > >      gendisk->fops->block_lost()   => pmem_block_lost() or
> > >                                           md_blk_block_lost()
> > >       sb->s_ops->storage_lost()    => xfs_fs_storage_lost()
> > >        xfs_rmap_query_range()
> > >         xfs_storage_lost_helper()
> > >          mf_recover_controller->recover_fn => \
> > >                              memory_failure_dev_pagemap_kill_procs()
> > > 
> > > The collect_procs() and kill_procs() are moved into a callback which
> > > is passed from memory_failure() to xfs_storage_lost_helper().  So we
> > > can call it when a file assocaited is found, instead of creating a
> > > file list and iterate it.
> > > 
> > > The fsdax & reflink support for XFS is not contained in this patchset.
> > 
> > This looks promising - the overall architecture is a lot more
> > generic and less dependent on knowing about memory, dax or memory
> > failures. A few comments that I think would further improve
> > understanding the patchset and the implementation:
> 
> Thanks for your kindly comment.  It gives me confidence.
> 
> > 
> > - the order of the patches is inverted. It should start with a
> >    single patch introducing the mf_recover_controller structure for
> >    callbacks, then introduce pgmap->ops->memory_failure, then
> >    ->block_lost, then the pmem and md implementations of ->block
> >    list, then ->storage_lost and the XFS implementations of
> >    ->storage_lost.
> 
> Yes, it will be easier to understand the patchset in this order.
> 
> But I have something unsure: for example, I introduce ->memory_failure()
> firstly, but the implementation of ->memory_failure() needs to call
> ->block_lost() which is supposed to be introduced in the next patch. So, I
> am not sure the code is supposed to be what in the implementation of
> ->memory_failure() in pmem?  To avoid this situation, I committed the
> patches in the inverted order: lowest level first, then its caller, and then
> caller's caller.

Well, there's two things here. The first is the infrastructure, the
second is the drivers that use the infrastructure. You can introduce
a method in one patch, and then the driver that uses it in another.
Or you can introduce a driver skeleton that doesn't nothing until
more infrastructure is added. so...

> 
> I am trying to sort out the order.  How about this:
>  Patch i.
>    Introduce ->memory_failure()
>       - just introduce interface, without implementation
>  Patch i++.
>    Introduce ->block_lost()
>       - introduce interface and implement ->memory_failure()
>          in pmem, so that it can call ->block_lost()
>  Patch i++.
>    (similar with above, skip...)

So this is better, but you don't need to add the pmem driver use of
"->block_lost" in the patch that adds the method. IOWs, something
like:

P1: introduce ->memory_failure API, all the required documentation
and add the call sites in the infrastructure that trigger it

P2: introduce ->corrupted_range to the block device API, all the
required documentation and any generic block infrastructure that
needs to call it.

P3: introduce ->corrupted_range to the superblock ops API, all the
required documentation

P4: add ->corrupted_range() API to the address space ops, all the
required documentation

P5: factor the existing kill procs stuff to be able to be called on
via generic_mapping_kill_range()

P5: add dax_mapping_kill_range()

P6: add the pmem driver support for ->memory_failure

P7: add the block device driver support for ->corrupted_range

P8: add filesystem support for sb_ops->corrupted_range.

P9: add filesystem support for aops->corrupted_range.

> >    This gets rid of the mf_recover_controller altogether and allows
> >    the interface to be used by any sort of block device for any sort
> >    of bottom-up reporting of media/device failures.
> 
> Moving the recover function to the address_space ops looks a better idea.
> But I think that the error handler for page cache mapping is finished well
> in memory-failure.  The memory-failure is also reused to handles anonymous
> page.

Yes, anonymous page handling can remain there, we're only concerned
about handling file mapped pages here right now. If we end up
sharing page cache pages across reflink mappings, we'll have exactly
the same issue we have now with DAX....

> If we move the recover function to address_space ops, I think we also
> need to refactor the existing handler for page cache mapping, which may
> affect anonymous page handling.  This makes me confused...

Make the handling of the page the error occurred in conditional on
!PageAnon().

> I rewrote the call trace:
> memory_failure()
>  * dax mapping case
>  pgmap->ops->memory_failure()          =>
>                                    pmem_pgmap_memory_failure()
>   gendisk->fops->block_corrupt_range() =>
>                                    - pmem_block_corrupt_range()
>                                    - md_blk_block_corrupt_range()
>    sb->s_ops->storage_currupt_range()  =>
>                                    xfs_fs_storage_corrupt_range()

No need for block/storage prefixes in these...

>     xfs_rmap_query_range()
>      xfs_storage_lost_helper()
>       mapping->a_ops->corrupt_range()  =>
>                                    xfs_dax_aops.xfs_dax_corrupt_range
>        memory_failure_dev_pagemap_kill_procs()

This assumes we find a user data mapping. We might find the
corrupted storage contained metadata, in which case we'll be
shutting down the filesystem, not trying to kill user procs...

Also, we don't need aops->corrupt_range() here as we are already in
the filesystem code and if we find a mapping in memory we can just
do "if (IS_DAX(mapping->host))" to call the right kill procs
implementation for the mapping we've found. aops->corrupt_range is
for generic code working on a mapping to inform the filesystem that
there is a bad range in the mapping (my apologies for getting that
all mixed up in my last email).

>  * page cache mapping case
>  mapping->a_ops->corrupt_range()       =>
>                                    xfs_address_space_operations.xfs_xxx
>   memory_failure_generic_kill_procs()

We need the aops->corrupted_range() to call into the filesystem so
it can do a similar reverse mapping lookup to
sb->s_ops->corrupted_range.  Yes, the page cache should already have
a mapping attached to the page, but we do not know whether it is the
only mapping that exists for that page. e.g. if/when we implement
multiple-mapped shared read-only reflink pages in the page cache
which results in the same problem we have with DAX pages right now.

Overall, though, it seems like you're on the right path. :)

Cheers,

Dave.

Jane Chu Dec. 14, 2020, 8:58 p.m. UTC | #4

Hi, Shiyang,

On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> This patchset is a try to resolve the problem of tracking shared page
> for fsdax.
> 
> Change from v1:
>    - Intorduce ->block_lost() for block device
>    - Support mapped device
>    - Add 'not available' warning for realtime device in XFS
>    - Rebased to v5.10-rc1
> 
> This patchset moves owner tracking from dax_assocaite_entry() to pmem
> device, by introducing an interface ->memory_failure() of struct
> pagemap.  The interface is called by memory_failure() in mm, and
> implemented by pmem device.  Then pmem device calls its ->block_lost()
> to find the filesystem which the damaged page located in, and call
> ->storage_lost() to track files or metadata assocaited with this page.
> Finally we are able to try to fix the damaged data in filesystem and do

Does that mean clearing poison? if so, would you mind to elaborate 
specifically which change does that?

Thanks!
-jane

> other necessary processing, such as killing processes who are using the
> files affected.

Ruan Shiyang Dec. 15, 2020, 11:58 a.m. UTC | #5

Hi Jane

On 2020/12/15 上午4:58, Jane Chu wrote:
> Hi, Shiyang,
> 
> On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
>> This patchset is a try to resolve the problem of tracking shared page
>> for fsdax.
>>
>> Change from v1:
>>    - Intorduce ->block_lost() for block device
>>    - Support mapped device
>>    - Add 'not available' warning for realtime device in XFS
>>    - Rebased to v5.10-rc1
>>
>> This patchset moves owner tracking from dax_assocaite_entry() to pmem
>> device, by introducing an interface ->memory_failure() of struct
>> pagemap.  The interface is called by memory_failure() in mm, and
>> implemented by pmem device.  Then pmem device calls its ->block_lost()
>> to find the filesystem which the damaged page located in, and call
>> ->storage_lost() to track files or metadata assocaited with this page.
>> Finally we are able to try to fix the damaged data in filesystem and do
> 
> Does that mean clearing poison? if so, would you mind to elaborate 
> specifically which change does that?

Recovering data for filesystem (or pmem device) has not been done in 
this patchset...  I just triggered the handler for the files sharing the 
corrupted page here.


--
Thanks,
Ruan Shiyang.

> 
> Thanks!
> -jane
> 
>> other necessary processing, such as killing processes who are using the
>> files affected.
> 
>

Jane Chu Dec. 15, 2020, 7:05 p.m. UTC | #6

On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> Hi Jane
> 
> On 2020/12/15 上午4:58, Jane Chu wrote:
>> Hi, Shiyang,
>>
>> On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
>>> This patchset is a try to resolve the problem of tracking shared page
>>> for fsdax.
>>>
>>> Change from v1:
>>>    - Intorduce ->block_lost() for block device
>>>    - Support mapped device
>>>    - Add 'not available' warning for realtime device in XFS
>>>    - Rebased to v5.10-rc1
>>>
>>> This patchset moves owner tracking from dax_assocaite_entry() to pmem
>>> device, by introducing an interface ->memory_failure() of struct
>>> pagemap.  The interface is called by memory_failure() in mm, and
>>> implemented by pmem device.  Then pmem device calls its ->block_lost()
>>> to find the filesystem which the damaged page located in, and call
>>> ->storage_lost() to track files or metadata assocaited with this page.
>>> Finally we are able to try to fix the damaged data in filesystem and do
>>
>> Does that mean clearing poison? if so, would you mind to elaborate 
>> specifically which change does that?
> 
> Recovering data for filesystem (or pmem device) has not been done in 
> this patchset...  I just triggered the handler for the files sharing the 
> corrupted page here.

Thanks! That confirms my understanding.

With the framework provided by the patchset, how do you envision it to
ease/simplify poison recovery from the user's perspective?

And how does it help in dealing with page faults upon poisoned
dax page?

thanks!
-jane

Dave Chinner Dec. 15, 2020, 11:10 p.m. UTC | #7

On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > Hi Jane
> > 
> > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > Hi, Shiyang,
> > > 
> > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > This patchset is a try to resolve the problem of tracking shared page
> > > > for fsdax.
> > > > 
> > > > Change from v1:
> > > >    - Intorduce ->block_lost() for block device
> > > >    - Support mapped device
> > > >    - Add 'not available' warning for realtime device in XFS
> > > >    - Rebased to v5.10-rc1
> > > > 
> > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > device, by introducing an interface ->memory_failure() of struct
> > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > to find the filesystem which the damaged page located in, and call
> > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > 
> > > Does that mean clearing poison? if so, would you mind to elaborate
> > > specifically which change does that?
> > 
> > Recovering data for filesystem (or pmem device) has not been done in
> > this patchset...  I just triggered the handler for the files sharing the
> > corrupted page here.
> 
> Thanks! That confirms my understanding.
> 
> With the framework provided by the patchset, how do you envision it to
> ease/simplify poison recovery from the user's perspective?

At the moment, I'd say no change what-so-ever. THe behaviour is
necessary so that we can kill whatever user application maps
multiply-shared physical blocks if there's a memory error. THe
recovery method from that is unchanged. The only advantage may be
that the filesystem (if rmap enabled) can tell you the exact file
and offset into the file where data was corrupted.

However, it can be worse, too: it may also now completely shut down
the filesystem if the filesystem discovers the error is in metadata
rather than user data. That's much more complex to recover from, and
right now will require downtime to take the filesystem offline and
run fsck to correct the error. That may trash whatever the metadata
that can't be recovered points to, so you still have a uesr data
recovery process to perform after this...

> And how does it help in dealing with page faults upon poisoned
> dax page?

It doesn't. If the page is poisoned, the same behaviour will occur
as does now. This is simply error reporting infrastructure, not
error handling.

Future work might change how we correct the faults found in the
storage, but I think the user visible behaviour is going to be "kill
apps mapping corrupted data" for a long time yet....

Cheers,

Dave.

Darrick J. Wong Dec. 16, 2020, 2:46 a.m. UTC | #8

On Wed, Dec 16, 2020 at 10:10:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 15, 2020 at 11:05:07AM -0800, Jane Chu wrote:
> > On 12/15/2020 3:58 AM, Ruan Shiyang wrote:
> > > Hi Jane
> > > 
> > > On 2020/12/15 上午4:58, Jane Chu wrote:
> > > > Hi, Shiyang,
> > > > 
> > > > On 11/22/2020 4:41 PM, Shiyang Ruan wrote:
> > > > > This patchset is a try to resolve the problem of tracking shared page
> > > > > for fsdax.
> > > > > 
> > > > > Change from v1:
> > > > >    - Intorduce ->block_lost() for block device
> > > > >    - Support mapped device
> > > > >    - Add 'not available' warning for realtime device in XFS
> > > > >    - Rebased to v5.10-rc1
> > > > > 
> > > > > This patchset moves owner tracking from dax_assocaite_entry() to pmem
> > > > > device, by introducing an interface ->memory_failure() of struct
> > > > > pagemap.  The interface is called by memory_failure() in mm, and
> > > > > implemented by pmem device.  Then pmem device calls its ->block_lost()
> > > > > to find the filesystem which the damaged page located in, and call
> > > > > ->storage_lost() to track files or metadata assocaited with this page.
> > > > > Finally we are able to try to fix the damaged data in filesystem and do
> > > > 
> > > > Does that mean clearing poison? if so, would you mind to elaborate
> > > > specifically which change does that?
> > > 
> > > Recovering data for filesystem (or pmem device) has not been done in
> > > this patchset...  I just triggered the handler for the files sharing the
> > > corrupted page here.
> > 
> > Thanks! That confirms my understanding.
> > 
> > With the framework provided by the patchset, how do you envision it to
> > ease/simplify poison recovery from the user's perspective?
> 
> At the moment, I'd say no change what-so-ever. THe behaviour is
> necessary so that we can kill whatever user application maps
> multiply-shared physical blocks if there's a memory error. THe
> recovery method from that is unchanged. The only advantage may be
> that the filesystem (if rmap enabled) can tell you the exact file
> and offset into the file where data was corrupted.
> 
> However, it can be worse, too: it may also now completely shut down
> the filesystem if the filesystem discovers the error is in metadata
> rather than user data. That's much more complex to recover from, and
> right now will require downtime to take the filesystem offline and
> run fsck to correct the error. That may trash whatever the metadata
> that can't be recovered points to, so you still have a uesr data
> recovery process to perform after this...

...though for the future future I'd like to bypass the default behaviors
if there's somebody watching the sb notification that will also kick off
the appropriate repair activities.  The xfs auto-repair parts are coming
along nicely.  Dunno about userspace, though I figure if we can do
userspace page faults then some people could probably do autorepair
too.

--D

> > And how does it help in dealing with page faults upon poisoned
> > dax page?
> 
> It doesn't. If the page is poisoned, the same behaviour will occur
> as does now. This is simply error reporting infrastructure, not
> error handling.
> 
> Future work might change how we correct the faults found in the
> storage, but I think the user visible behaviour is going to be "kill
> apps mapping corrupted data" for a long time yet....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[RFC,v2,0/6] fsdax: introduce fs query to support reflink

Message

Comments