[RFC,0/8] dax: Add a dax-rmap tree to support reflink

Message ID	20200427084750.136031-1-ruansy.fnst@cn.fujitsu.com (mailing list archive)
Headers	show Return-Path: <SRS0=Or7U=6L=lists.01.org=linux-nvdimm-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 404602075E Received-SPF: None (mailfrom) identity=mailfrom; client-ip=183.91.158.132; helo=heian.cn.fujitsu.com; envelope-from=ruansy.fnst@cn.fujitsu.com; receiver=<UNKNOWN> From: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> To: <linux-kernel@vger.kernel.org>, <linux-xfs@vger.kernel.org>, <linux-nvdimm@lists.01.org> Subject: [RFC PATCH 0/8] dax: Add a dax-rmap tree to support reflink Date: Mon, 27 Apr 2020 16:47:42 +0800 Message-ID: <20200427084750.136031-1-ruansy.fnst@cn.fujitsu.com> MIME-Version: 1.0 Message-ID-Hash: W57IF4VGKUCAPDXDICWZC3R5C7EDGMT2 CC: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, darrick.wong@oracle.com, david@fromorbit.com, hch@lst.de, rgoldwyn@suse.de, qi.fuli@fujitsu.com, y-goto@fujitsu.com Precedence: list Archived-At: <https://lists.01.org/hyperkitty/list/linux-nvdimm@lists.01.org/message/W57IF4VGKUCAPDXDICWZC3R5C7EDGMT2/> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit
Series	dax: Add a dax-rmap tree to support reflink \| expand [RFC,0/8] dax: Add a dax-rmap tree to support reflink [RFC,1/8] fs/dax: Introduce dax-rmap btree for reflink [RFC,2/8] mm: add dax-rmap for memory-failure and rmap [RFC,3/8] fs/dax: Introduce dax_copy_edges() for COW [RFC,4/8] fs/dax: copy data before write [RFC,5/8] fs/dax: replace mmap entry in case of CoW [RFC,6/8] fs/dax: dedup file range to use a compare function [RFC,7/8] fs/xfs: handle CoW for fsdax write() path [RFC,8/8] fs/xfs: support dedupe for fsdax

Ruan Shiyang April 27, 2020, 8:47 a.m. UTC

This patchset is a try to resolve the shared 'page cache' problem for
fsdax.

In order to track multiple mappings and indexes on one page, I
introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
will be associated more than once if is shared.  At the second time we
associate this entry, we create this rb-tree and store its root in
page->private(not used in fsdax).  Insert (->mapping, ->index) when
dax_associate_entry() and delete it when dax_disassociate_entry().

We can iterate the dax-rmap rb-tree before any other operations on
mappings of files.  Such as memory-failure and rmap.

Same as before, I borrowed and made some changes on Goldwyn's patchsets.
These patches makes up for the lack of CoW mechanism in fsdax.

The rests are dax & reflink support for xfs.

(Rebased to 5.7-rc2)


Shiyang Ruan (8):
  fs/dax: Introduce dax-rmap btree for reflink
  mm: add dax-rmap for memory-failure and rmap
  fs/dax: Introduce dax_copy_edges() for COW
  fs/dax: copy data before write
  fs/dax: replace mmap entry in case of CoW
  fs/dax: dedup file range to use a compare function
  fs/xfs: handle CoW for fsdax write() path
  fs/xfs: support dedupe for fsdax

 fs/dax.c               | 343 +++++++++++++++++++++++++++++++++++++----
 fs/ocfs2/file.c        |   2 +-
 fs/read_write.c        |  11 +-
 fs/xfs/xfs_bmap_util.c |   6 +-
 fs/xfs/xfs_file.c      |  10 +-
 fs/xfs/xfs_iomap.c     |   3 +-
 fs/xfs/xfs_iops.c      |  11 +-
 fs/xfs/xfs_reflink.c   |  79 ++++++----
 include/linux/dax.h    |  11 ++
 include/linux/fs.h     |   9 +-
 mm/memory-failure.c    |  63 ++++++--
 mm/rmap.c              |  54 +++++--
 12 files changed, 498 insertions(+), 104 deletions(-)

Matthew Wilcox April 27, 2020, 12:28 p.m. UTC | #1

On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> This patchset is a try to resolve the shared 'page cache' problem for
> fsdax.
> 
> In order to track multiple mappings and indexes on one page, I
> introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> will be associated more than once if is shared.  At the second time we
> associate this entry, we create this rb-tree and store its root in
> page->private(not used in fsdax).  Insert (->mapping, ->index) when
> dax_associate_entry() and delete it when dax_disassociate_entry().

Do we really want to track all of this on a per-page basis?  I would
have thought a per-extent basis was more useful.  Essentially, create
a new address_space for each shared extent.  Per page just seems like
a huge overhead.

Ruan Shiyang April 28, 2020, 6:09 a.m. UTC | #2

在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:

>On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>  This patchset is a try to resolve the shared 'page cache' problem for
>>  fsdax.
>>
>>  In order to track multiple mappings and indexes on one page, I
>>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>  will be associated more than once if is shared.  At the second time we
>>  associate this entry, we create this rb-tree and store its root in
>>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>  dax_associate_entry() and delete it when dax_disassociate_entry().
>
>Do we really want to track all of this on a per-page basis?  I would
>have thought a per-extent basis was more useful.  Essentially, create
>a new address_space for each shared extent.  Per page just seems like
>a huge overhead.
>
Per-extent tracking is a nice idea for me.  I haven't thought of it 
yet...

But the extent info is maintained by filesystem.  I think we need a way 
to obtain this info from FS when associating a page.  May be a bit 
complicated.  Let me think about it...


--
Thanks,
Ruan Shiyang.

Dave Chinner April 28, 2020, 6:43 a.m. UTC | #3

On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> 
> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> 
> >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> >>  This patchset is a try to resolve the shared 'page cache' problem for
> >>  fsdax.
> >>
> >>  In order to track multiple mappings and indexes on one page, I
> >>  introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> >>  will be associated more than once if is shared.  At the second time we
> >>  associate this entry, we create this rb-tree and store its root in
> >>  page->private(not used in fsdax).  Insert (->mapping, ->index) when
> >>  dax_associate_entry() and delete it when dax_disassociate_entry().
> >
> >Do we really want to track all of this on a per-page basis?  I would
> >have thought a per-extent basis was more useful.  Essentially, create
> >a new address_space for each shared extent.  Per page just seems like
> >a huge overhead.
> >
> Per-extent tracking is a nice idea for me.  I haven't thought of it 
> yet...
> 
> But the extent info is maintained by filesystem.  I think we need a way 
> to obtain this info from FS when associating a page.  May be a bit 
> complicated.  Let me think about it...

That's why I want the -user of this association- to do a filesystem
callout instead of keeping it's own naive tracking infrastructure.
The filesystem can do an efficient, on-demand reverse mapping lookup
from it's own extent tracking infrastructure, and there's zero
runtime overhead when there are no errors present.

At the moment, this "dax association" is used to "report" a storage
media error directly to userspace. I say "report" because what it
does is kill userspace processes dead. The storage media error
actually needs to be reported to the owner of the storage media,
which in the case of FS-DAX is the filesytem.

That way the filesystem can then look up all the owners of that bad
media range (i.e. the filesystem block it corresponds to) and take
appropriate action. e.g.

- if it falls in filesytem metadata, shutdown the filesystem
- if it falls in user data, call the "kill userspace dead" routines
  for each mapping/index tuple the filesystem finds for the given
  LBA address that the media error occurred.

Right now if the media error is in filesystem metadata, the
filesystem isn't even told about it. The filesystem can't even shut
down - the error is just dropped on the floor and it won't be until
the filesystem next tries to reference that metadata that we notice
there is an issue.

Cheers,

Dave.

Ruan Shiyang April 28, 2020, 9:32 a.m. UTC | #4

On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>   This patchset is a try to resolve the shared 'page cache' problem for
>>>>   fsdax.
>>>>
>>>>   In order to track multiple mappings and indexes on one page, I
>>>>   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>   will be associated more than once if is shared.  At the second time we
>>>>   associate this entry, we create this rb-tree and store its root in
>>>>   page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>   dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis?  I would
>>> have thought a per-extent basis was more useful.  Essentially, create
>>> a new address_space for each shared extent.  Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem.  I think we need a way
>> to obtain this info from FS when associating a page.  May be a bit
>> complicated.  Let me think about it...
> 
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.
> 
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.

Understood.

BTW, this is the usage in memory-failure, so what about rmap?  I have 
not found how to use this tracking in rmap.  Do you have any ideas?

> 
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.

I tried writing a function to look up all the owners' info of one block 
in xfs for memory-failure use.  It was dropped in this patchset because 
I found out that this lookup function needs 'rmapbt' to be enabled when 
mkfs.  But by default, rmapbt is disabled.  I am not sure if it matters...

> 
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
>    for each mapping/index tuple the filesystem finds for the given
>    LBA address that the media error occurred >
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.

Understood.  Thanks.

> 
> Cheers,
> 
> Dave.
> 


--
Thanks,
Ruan Shiyang.

Matthew Wilcox April 28, 2020, 11:16 a.m. UTC | #5

On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > >   fsdax.
> > > > > 
> > > > >   In order to track multiple mappings and indexes on one page, I
> > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > >   will be associated more than once if is shared.  At the second time we
> > > > >   associate this entry, we create this rb-tree and store its root in
> > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > 
> > > > Do we really want to track all of this on a per-page basis?  I would
> > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > a new address_space for each shared extent.  Per page just seems like
> > > > a huge overhead.
> > > > 
> > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > yet...
> > > 
> > > But the extent info is maintained by filesystem.  I think we need a way
> > > to obtain this info from FS when associating a page.  May be a bit
> > > complicated.  Let me think about it...
> > 
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> > 
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> 
> Understood.
> 
> BTW, this is the usage in memory-failure, so what about rmap?  I have not
> found how to use this tracking in rmap.  Do you have any ideas?
> 
> > 
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> 
> I tried writing a function to look up all the owners' info of one block in
> xfs for memory-failure use.  It was dropped in this patchset because I found
> out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> by default, rmapbt is disabled.  I am not sure if it matters...

I'm pretty sure you can't have shared extents on an XFS filesystem if you
_don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

Dave Chinner April 28, 2020, 11:24 a.m. UTC | #6

On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > > 
> > > At the moment, this "dax association" is used to "report" a storage
> > > media error directly to userspace. I say "report" because what it
> > > does is kill userspace processes dead. The storage media error
> > > actually needs to be reported to the owner of the storage media,
> > > which in the case of FS-DAX is the filesytem.
> > 
> > Understood.
> > 
> > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > found how to use this tracking in rmap.  Do you have any ideas?
> > 
> > > 
> > > That way the filesystem can then look up all the owners of that bad
> > > media range (i.e. the filesystem block it corresponds to) and take
> > > appropriate action. e.g.
> > 
> > I tried writing a function to look up all the owners' info of one block in
> > xfs for memory-failure use.  It was dropped in this patchset because I found
> > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > by default, rmapbt is disabled.  I am not sure if it matters...
> 
> I'm pretty sure you can't have shared extents on an XFS filesystem if you
> _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.

You're confusing reflink with rmap. :)

rmapbt does all the reverse mapping tracking, reflink just does the
shared data extent tracking.

But given that anyone who wants to use DAX with reflink is going to
have to mkfs their filesystem anyway (to turn on reflink) requiring
that rmapbt is also turned on is not a big deal. Especially as we
can check it at mount time in the kernel...

Cheers,

Dave.

Darrick J. Wong April 28, 2020, 3:37 p.m. UTC | #7

On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > >   fsdax.
> > > > > > > 
> > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > > 
> > > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > > a huge overhead.
> > > > > > 
> > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > yet...
> > > > > 
> > > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > complicated.  Let me think about it...
> > > > 
> > > > That's why I want the -user of this association- to do a filesystem
> > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > from it's own extent tracking infrastructure, and there's zero
> > > > runtime overhead when there are no errors present.
> > > > 
> > > > At the moment, this "dax association" is used to "report" a storage
> > > > media error directly to userspace. I say "report" because what it
> > > > does is kill userspace processes dead. The storage media error
> > > > actually needs to be reported to the owner of the storage media,
> > > > which in the case of FS-DAX is the filesytem.
> > > 
> > > Understood.
> > > 
> > > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > > found how to use this tracking in rmap.  Do you have any ideas?
> > > 
> > > > 
> > > > That way the filesystem can then look up all the owners of that bad
> > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > appropriate action. e.g.
> > > 
> > > I tried writing a function to look up all the owners' info of one block in
> > > xfs for memory-failure use.  It was dropped in this patchset because I found
> > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > > by default, rmapbt is disabled.  I am not sure if it matters...
> > 
> > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> 
> You're confusing reflink with rmap. :)
> 
> rmapbt does all the reverse mapping tracking, reflink just does the
> shared data extent tracking.
> 
> But given that anyone who wants to use DAX with reflink is going to
> have to mkfs their filesystem anyway (to turn on reflink) requiring
> that rmapbt is also turned on is not a big deal. Especially as we
> can check it at mount time in the kernel...

Are we going to turn on rmap by default?  The last I checked, it did
have a 10-20% performance cost on extreme metadata-heavy workloads.
Or do we only enable it by default if mkfs detects a pmem device?

(Admittedly, most people do not run fsx as a productivity app; the
normal hit is usually 3-5% which might not be such a big deal since you
also get (half of) online fsck. :P)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

Dave Chinner April 28, 2020, 10:02 p.m. UTC | #8

On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote:
> > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote:
> > > > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > > > >   fsdax.
> > > > > > > > 
> > > > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > > > 
> > > > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > > > a huge overhead.
> > > > > > > 
> > > > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > > > yet...
> > > > > > 
> > > > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > > > to obtain this info from FS when associating a page.  May be a bit
> > > > > > complicated.  Let me think about it...
> > > > > 
> > > > > That's why I want the -user of this association- to do a filesystem
> > > > > callout instead of keeping it's own naive tracking infrastructure.
> > > > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > > > from it's own extent tracking infrastructure, and there's zero
> > > > > runtime overhead when there are no errors present.
> > > > > 
> > > > > At the moment, this "dax association" is used to "report" a storage
> > > > > media error directly to userspace. I say "report" because what it
> > > > > does is kill userspace processes dead. The storage media error
> > > > > actually needs to be reported to the owner of the storage media,
> > > > > which in the case of FS-DAX is the filesytem.
> > > > 
> > > > Understood.
> > > > 
> > > > BTW, this is the usage in memory-failure, so what about rmap?  I have not
> > > > found how to use this tracking in rmap.  Do you have any ideas?
> > > > 
> > > > > 
> > > > > That way the filesystem can then look up all the owners of that bad
> > > > > media range (i.e. the filesystem block it corresponds to) and take
> > > > > appropriate action. e.g.
> > > > 
> > > > I tried writing a function to look up all the owners' info of one block in
> > > > xfs for memory-failure use.  It was dropped in this patchset because I found
> > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs.  But
> > > > by default, rmapbt is disabled.  I am not sure if it matters...
> > > 
> > > I'm pretty sure you can't have shared extents on an XFS filesystem if you
> > > _don't_ have the rmapbt feature enabled.  I mean, that's why it exists.
> > 
> > You're confusing reflink with rmap. :)
> > 
> > rmapbt does all the reverse mapping tracking, reflink just does the
> > shared data extent tracking.
> > 
> > But given that anyone who wants to use DAX with reflink is going to
> > have to mkfs their filesystem anyway (to turn on reflink) requiring
> > that rmapbt is also turned on is not a big deal. Especially as we
> > can check it at mount time in the kernel...
> 
> Are we going to turn on rmap by default?  The last I checked, it did
> have a 10-20% performance cost on extreme metadata-heavy workloads.
> Or do we only enable it by default if mkfs detects a pmem device?

Just have the kernel refuse to mount a reflink enabled filesystem on
a DAX capable device unless -o dax=never or rmapbt is enabled.

That'll get the message across pretty quickly....

> (Admittedly, most people do not run fsx as a productivity app; the
> normal hit is usually 3-5% which might not be such a big deal since you
> also get (half of) online fsck. :P)

I have not noticed the overhead at all on any of my production
machines since I enabled it way on all of them way back when....

And, really, pmem is a _very poor choice_ for metadata intensive
applications on XFS as pmem is completely synchronous.  XFS has an
async IO model for it's metadata that *must* be buffered (so no
DAX!) and the synchronous nature of pmem completely defeats the
architectural IO pipelining XFS uses to allow thousands of
concurrent metadata IOs in flight. OTOH, pmem IO depth is limited to
the number of CPUs that are concurrently issuing IO, so it really,
really sucks compared to a handful of high end nvme SSDs on PCIe
4.0....

So with that in mind, I see little reason to care about the small
additional overhead of rmapbt on FS-DAX installations that require
reflink...

Cheers,

Dave.

Ruan Shiyang June 4, 2020, 7:37 a.m. UTC | #9

On 2020/4/28 下午2:43, Dave Chinner wrote:
> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>
>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>
>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>   This patchset is a try to resolve the shared 'page cache' problem for
>>>>   fsdax.
>>>>
>>>>   In order to track multiple mappings and indexes on one page, I
>>>>   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>   will be associated more than once if is shared.  At the second time we
>>>>   associate this entry, we create this rb-tree and store its root in
>>>>   page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>   dax_associate_entry() and delete it when dax_disassociate_entry().
>>>
>>> Do we really want to track all of this on a per-page basis?  I would
>>> have thought a per-extent basis was more useful.  Essentially, create
>>> a new address_space for each shared extent.  Per page just seems like
>>> a huge overhead.
>>>
>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>> yet...
>>
>> But the extent info is maintained by filesystem.  I think we need a way
>> to obtain this info from FS when associating a page.  May be a bit
>> complicated.  Let me think about it...
> 
> That's why I want the -user of this association- to do a filesystem
> callout instead of keeping it's own naive tracking infrastructure.
> The filesystem can do an efficient, on-demand reverse mapping lookup
> from it's own extent tracking infrastructure, and there's zero
> runtime overhead when there are no errors present.

Hi Dave,

I ran into some difficulties when trying to implement the per-extent 
rmap tracking.  So, I re-read your comments and found that I was 
misunderstanding what you described here.

I think what you mean is: we don't need the in-memory dax-rmap tracking 
now.  Just ask the FS for the owner's information that associate with 
one page when memory-failure.  So, the per-page (even per-extent) 
dax-rmap is needless in this case.  Is this right?

Based on this, we only need to store the extent information of a fsdax 
page in its ->mapping (by searching from FS).  Then obtain the owners of 
this page (also by searching from FS) when memory-failure or other rmap 
case occurs.

So, a fsdax page is no longer associated with a specific file, but with 
a FS(or the pmem device).  I think it's easier to understand and implement.


--
Thanks,
Ruan Shiyang.
> 
> At the moment, this "dax association" is used to "report" a storage
> media error directly to userspace. I say "report" because what it
> does is kill userspace processes dead. The storage media error
> actually needs to be reported to the owner of the storage media,
> which in the case of FS-DAX is the filesytem.
> 
> That way the filesystem can then look up all the owners of that bad
> media range (i.e. the filesystem block it corresponds to) and take
> appropriate action. e.g.
> 
> - if it falls in filesytem metadata, shutdown the filesystem
> - if it falls in user data, call the "kill userspace dead" routines
>    for each mapping/index tuple the filesystem finds for the given
>    LBA address that the media error occurred.
> 
> Right now if the media error is in filesystem metadata, the
> filesystem isn't even told about it. The filesystem can't even shut
> down - the error is just dropped on the floor and it won't be until
> the filesystem next tries to reference that metadata that we notice
> there is an issue.
> 
> Cheers,
> 
> Dave.
>

Darrick J. Wong June 4, 2020, 2:51 p.m. UTC | #10

On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> 
> 
> On 2020/4/28 下午2:43, Dave Chinner wrote:
> > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > 
> > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > 
> > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > >   fsdax.
> > > > > 
> > > > >   In order to track multiple mappings and indexes on one page, I
> > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > >   will be associated more than once if is shared.  At the second time we
> > > > >   associate this entry, we create this rb-tree and store its root in
> > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > 
> > > > Do we really want to track all of this on a per-page basis?  I would
> > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > a new address_space for each shared extent.  Per page just seems like
> > > > a huge overhead.
> > > > 
> > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > yet...
> > > 
> > > But the extent info is maintained by filesystem.  I think we need a way
> > > to obtain this info from FS when associating a page.  May be a bit
> > > complicated.  Let me think about it...
> > 
> > That's why I want the -user of this association- to do a filesystem
> > callout instead of keeping it's own naive tracking infrastructure.
> > The filesystem can do an efficient, on-demand reverse mapping lookup
> > from it's own extent tracking infrastructure, and there's zero
> > runtime overhead when there are no errors present.
> 
> Hi Dave,
> 
> I ran into some difficulties when trying to implement the per-extent rmap
> tracking.  So, I re-read your comments and found that I was misunderstanding
> what you described here.
> 
> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> Just ask the FS for the owner's information that associate with one page
> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> needless in this case.  Is this right?

Right.  XFS already has its own rmap tree.

> Based on this, we only need to store the extent information of a fsdax page
> in its ->mapping (by searching from FS).  Then obtain the owners of this
> page (also by searching from FS) when memory-failure or other rmap case
> occurs.

I don't even think you need that much.  All you need is the "physical"
offset of that page within the pmem device (e.g. 'this is the 307th 4k
page == offset 1257472 since the start of /dev/pmem0') and xfs can look
up the owner of that range of physical storage and deal with it as
needed.

> So, a fsdax page is no longer associated with a specific file, but with a
> FS(or the pmem device).  I think it's easier to understand and implement.

Yes.  I also suspect this will be necessary to support reflink...

--D

> 
> --
> Thanks,
> Ruan Shiyang.
> > 
> > At the moment, this "dax association" is used to "report" a storage
> > media error directly to userspace. I say "report" because what it
> > does is kill userspace processes dead. The storage media error
> > actually needs to be reported to the owner of the storage media,
> > which in the case of FS-DAX is the filesytem.
> > 
> > That way the filesystem can then look up all the owners of that bad
> > media range (i.e. the filesystem block it corresponds to) and take
> > appropriate action. e.g.
> > 
> > - if it falls in filesytem metadata, shutdown the filesystem
> > - if it falls in user data, call the "kill userspace dead" routines
> >    for each mapping/index tuple the filesystem finds for the given
> >    LBA address that the media error occurred.
> > 
> > Right now if the media error is in filesystem metadata, the
> > filesystem isn't even told about it. The filesystem can't even shut
> > down - the error is just dropped on the floor and it won't be until
> > the filesystem next tries to reference that metadata that we notice
> > there is an issue.
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
>

Dave Chinner June 5, 2020, 1:30 a.m. UTC | #11

On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
> > 
> > 
> > On 2020/4/28 下午2:43, Dave Chinner wrote:
> > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
> > > > 
> > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
> > > > 
> > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
> > > > > >   This patchset is a try to resolve the shared 'page cache' problem for
> > > > > >   fsdax.
> > > > > > 
> > > > > >   In order to track multiple mappings and indexes on one page, I
> > > > > >   introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
> > > > > >   will be associated more than once if is shared.  At the second time we
> > > > > >   associate this entry, we create this rb-tree and store its root in
> > > > > >   page->private(not used in fsdax).  Insert (->mapping, ->index) when
> > > > > >   dax_associate_entry() and delete it when dax_disassociate_entry().
> > > > > 
> > > > > Do we really want to track all of this on a per-page basis?  I would
> > > > > have thought a per-extent basis was more useful.  Essentially, create
> > > > > a new address_space for each shared extent.  Per page just seems like
> > > > > a huge overhead.
> > > > > 
> > > > Per-extent tracking is a nice idea for me.  I haven't thought of it
> > > > yet...
> > > > 
> > > > But the extent info is maintained by filesystem.  I think we need a way
> > > > to obtain this info from FS when associating a page.  May be a bit
> > > > complicated.  Let me think about it...
> > > 
> > > That's why I want the -user of this association- to do a filesystem
> > > callout instead of keeping it's own naive tracking infrastructure.
> > > The filesystem can do an efficient, on-demand reverse mapping lookup
> > > from it's own extent tracking infrastructure, and there's zero
> > > runtime overhead when there are no errors present.
> > 
> > Hi Dave,
> > 
> > I ran into some difficulties when trying to implement the per-extent rmap
> > tracking.  So, I re-read your comments and found that I was misunderstanding
> > what you described here.
> > 
> > I think what you mean is: we don't need the in-memory dax-rmap tracking now.
> > Just ask the FS for the owner's information that associate with one page
> > when memory-failure.  So, the per-page (even per-extent) dax-rmap is
> > needless in this case.  Is this right?
> 
> Right.  XFS already has its own rmap tree.

*nod*

> > Based on this, we only need to store the extent information of a fsdax page
> > in its ->mapping (by searching from FS).  Then obtain the owners of this
> > page (also by searching from FS) when memory-failure or other rmap case
> > occurs.
> 
> I don't even think you need that much.  All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Right. If we have the dax device associated with the page that had
the failure, then we can determine the offset of the page into the
block device address space and that's all we need to find the owner
of the page in the filesystem.

Note that there may actually be no owner - the page that had the
fault might land in free space, in which case we can simply zero
the page and clear the error.

> > So, a fsdax page is no longer associated with a specific file, but with a
> > FS(or the pmem device).  I think it's easier to understand and implement.

Effectively, yes. But we shouldn't need to actually associate the
page with anything at the filesystem level because it is already
associated with a DAX device at a lower level via a dev_pagemap.
The hardware page fault already runs thought this code
memory_failure_dev_pagemap() before it gets to the DAX code, so
really all we need to is have that function pass us the page, offset
into the device and, say, the struct dax_device associated with that
page so we can get to the filesystem superblock we can then use for
rmap lookups on...

Cheers,

Dave.

Ruan Shiyang June 5, 2020, 2:11 a.m. UTC | #12

On 2020/6/4 下午10:51, Darrick J. Wong wrote:
> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>
>>
>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>
>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>>>
>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>>    This patchset is a try to resolve the shared 'page cache' problem for
>>>>>>    fsdax.
>>>>>>
>>>>>>    In order to track multiple mappings and indexes on one page, I
>>>>>>    introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>>>    will be associated more than once if is shared.  At the second time we
>>>>>>    associate this entry, we create this rb-tree and store its root in
>>>>>>    page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>>>    dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>
>>>>> Do we really want to track all of this on a per-page basis?  I would
>>>>> have thought a per-extent basis was more useful.  Essentially, create
>>>>> a new address_space for each shared extent.  Per page just seems like
>>>>> a huge overhead.
>>>>>
>>>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>>>> yet...
>>>>
>>>> But the extent info is maintained by filesystem.  I think we need a way
>>>> to obtain this info from FS when associating a page.  May be a bit
>>>> complicated.  Let me think about it...
>>>
>>> That's why I want the -user of this association- to do a filesystem
>>> callout instead of keeping it's own naive tracking infrastructure.
>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>> from it's own extent tracking infrastructure, and there's zero
>>> runtime overhead when there are no errors present.
>>
>> Hi Dave,
>>
>> I ran into some difficulties when trying to implement the per-extent rmap
>> tracking.  So, I re-read your comments and found that I was misunderstanding
>> what you described here.
>>
>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>> Just ask the FS for the owner's information that associate with one page
>> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
>> needless in this case.  Is this right?
> 
> Right.  XFS already has its own rmap tree.
> 
>> Based on this, we only need to store the extent information of a fsdax page
>> in its ->mapping (by searching from FS).  Then obtain the owners of this
>> page (also by searching from FS) when memory-failure or other rmap case
>> occurs.
> 
> I don't even think you need that much.  All you need is the "physical"
> offset of that page within the pmem device (e.g. 'this is the 307th 4k
> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
> up the owner of that range of physical storage and deal with it as
> needed.

Yes, I think so.

> 
>> So, a fsdax page is no longer associated with a specific file, but with a
>> FS(or the pmem device).  I think it's easier to understand and implement.
> 
> Yes.  I also suspect this will be necessary to support reflink...
> 
> --D

OK, Thank you very much.


--
Thanks,
Ruan Shiyang.

> 
>>
>> --
>> Thanks,
>> Ruan Shiyang.
>>>
>>> At the moment, this "dax association" is used to "report" a storage
>>> media error directly to userspace. I say "report" because what it
>>> does is kill userspace processes dead. The storage media error
>>> actually needs to be reported to the owner of the storage media,
>>> which in the case of FS-DAX is the filesytem.
>>>
>>> That way the filesystem can then look up all the owners of that bad
>>> media range (i.e. the filesystem block it corresponds to) and take
>>> appropriate action. e.g.
>>>
>>> - if it falls in filesytem metadata, shutdown the filesystem
>>> - if it falls in user data, call the "kill userspace dead" routines
>>>     for each mapping/index tuple the filesystem finds for the given
>>>     LBA address that the media error occurred.
>>>
>>> Right now if the media error is in filesystem metadata, the
>>> filesystem isn't even told about it. The filesystem can't even shut
>>> down - the error is just dropped on the floor and it won't be until
>>> the filesystem next tries to reference that metadata that we notice
>>> there is an issue.
>>>
>>> Cheers,
>>>
>>> Dave.
>>>
>>
>>
> 
>

Ruan Shiyang June 5, 2020, 2:30 a.m. UTC | #13

On 2020/6/5 上午9:30, Dave Chinner wrote:
> On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote:
>> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote:
>>>
>>>
>>> On 2020/4/28 下午2:43, Dave Chinner wrote:
>>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote:
>>>>>
>>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道:
>>>>>
>>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote:
>>>>>>>    This patchset is a try to resolve the shared 'page cache' problem for
>>>>>>>    fsdax.
>>>>>>>
>>>>>>>    In order to track multiple mappings and indexes on one page, I
>>>>>>>    introduced a dax-rmap rb-tree to manage the relationship.  A dax entry
>>>>>>>    will be associated more than once if is shared.  At the second time we
>>>>>>>    associate this entry, we create this rb-tree and store its root in
>>>>>>>    page->private(not used in fsdax).  Insert (->mapping, ->index) when
>>>>>>>    dax_associate_entry() and delete it when dax_disassociate_entry().
>>>>>>
>>>>>> Do we really want to track all of this on a per-page basis?  I would
>>>>>> have thought a per-extent basis was more useful.  Essentially, create
>>>>>> a new address_space for each shared extent.  Per page just seems like
>>>>>> a huge overhead.
>>>>>>
>>>>> Per-extent tracking is a nice idea for me.  I haven't thought of it
>>>>> yet...
>>>>>
>>>>> But the extent info is maintained by filesystem.  I think we need a way
>>>>> to obtain this info from FS when associating a page.  May be a bit
>>>>> complicated.  Let me think about it...
>>>>
>>>> That's why I want the -user of this association- to do a filesystem
>>>> callout instead of keeping it's own naive tracking infrastructure.
>>>> The filesystem can do an efficient, on-demand reverse mapping lookup
>>>> from it's own extent tracking infrastructure, and there's zero
>>>> runtime overhead when there are no errors present.
>>>
>>> Hi Dave,
>>>
>>> I ran into some difficulties when trying to implement the per-extent rmap
>>> tracking.  So, I re-read your comments and found that I was misunderstanding
>>> what you described here.
>>>
>>> I think what you mean is: we don't need the in-memory dax-rmap tracking now.
>>> Just ask the FS for the owner's information that associate with one page
>>> when memory-failure.  So, the per-page (even per-extent) dax-rmap is
>>> needless in this case.  Is this right?
>>
>> Right.  XFS already has its own rmap tree.
> 
> *nod*
> 
>>> Based on this, we only need to store the extent information of a fsdax page
>>> in its ->mapping (by searching from FS).  Then obtain the owners of this
>>> page (also by searching from FS) when memory-failure or other rmap case
>>> occurs.
>>
>> I don't even think you need that much.  All you need is the "physical"
>> offset of that page within the pmem device (e.g. 'this is the 307th 4k
>> page == offset 1257472 since the start of /dev/pmem0') and xfs can look
>> up the owner of that range of physical storage and deal with it as
>> needed.
> 
> Right. If we have the dax device associated with the page that had
> the failure, then we can determine the offset of the page into the
> block device address space and that's all we need to find the owner
> of the page in the filesystem.
> 
> Note that there may actually be no owner - the page that had the
> fault might land in free space, in which case we can simply zero
> the page and clear the error.

OK.  Thanks for pointing out.

> 
>>> So, a fsdax page is no longer associated with a specific file, but with a
>>> FS(or the pmem device).  I think it's easier to understand and implement.
> 
> Effectively, yes. But we shouldn't need to actually associate the
> page with anything at the filesystem level because it is already
> associated with a DAX device at a lower level via a dev_pagemap.
> The hardware page fault already runs thought this code
> memory_failure_dev_pagemap() before it gets to the DAX code, so
> really all we need to is have that function pass us the page, offset
> into the device and, say, the struct dax_device associated with that
> page so we can get to the filesystem superblock we can then use for
> rmap lookups on...
> 

OK.  I was just thinking about how can I execute the FS rmap search from 
the memory-failure.  Thanks again for pointing out. :)


--
Thanks,
Ruan Shiyang.

> Cheers,
> 
> Dave.
>

[RFC,0/8] dax: Add a dax-rmap tree to support reflink

Message

Comments