Message ID | 20200427084750.136031-1-ruansy.fnst@cn.fujitsu.com (mailing list archive) |
---|---|
Headers | show |
Series | dax: Add a dax-rmap tree to support reflink | expand |
On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > This patchset is a try to resolve the shared 'page cache' problem for > fsdax. > > In order to track multiple mappings and indexes on one page, I > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > will be associated more than once if is shared. At the second time we > associate this entry, we create this rb-tree and store its root in > page->private(not used in fsdax). Insert (->mapping, ->index) when > dax_associate_entry() and delete it when dax_disassociate_entry(). Do we really want to track all of this on a per-page basis? I would have thought a per-extent basis was more useful. Essentially, create a new address_space for each shared extent. Per page just seems like a huge overhead.
在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: >> This patchset is a try to resolve the shared 'page cache' problem for >> fsdax. >> >> In order to track multiple mappings and indexes on one page, I >> introduced a dax-rmap rb-tree to manage the relationship. A dax entry >> will be associated more than once if is shared. At the second time we >> associate this entry, we create this rb-tree and store its root in >> page->private(not used in fsdax). Insert (->mapping, ->index) when >> dax_associate_entry() and delete it when dax_disassociate_entry(). > >Do we really want to track all of this on a per-page basis? I would >have thought a per-extent basis was more useful. Essentially, create >a new address_space for each shared extent. Per page just seems like >a huge overhead. > Per-extent tracking is a nice idea for me. I haven't thought of it yet... But the extent info is maintained by filesystem. I think we need a way to obtain this info from FS when associating a page. May be a bit complicated. Let me think about it... -- Thanks, Ruan Shiyang.
On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > >On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > >> This patchset is a try to resolve the shared 'page cache' problem for > >> fsdax. > >> > >> In order to track multiple mappings and indexes on one page, I > >> introduced a dax-rmap rb-tree to manage the relationship. A dax entry > >> will be associated more than once if is shared. At the second time we > >> associate this entry, we create this rb-tree and store its root in > >> page->private(not used in fsdax). Insert (->mapping, ->index) when > >> dax_associate_entry() and delete it when dax_disassociate_entry(). > > > >Do we really want to track all of this on a per-page basis? I would > >have thought a per-extent basis was more useful. Essentially, create > >a new address_space for each shared extent. Per page just seems like > >a huge overhead. > > > Per-extent tracking is a nice idea for me. I haven't thought of it > yet... > > But the extent info is maintained by filesystem. I think we need a way > to obtain this info from FS when associating a page. May be a bit > complicated. Let me think about it... That's why I want the -user of this association- to do a filesystem callout instead of keeping it's own naive tracking infrastructure. The filesystem can do an efficient, on-demand reverse mapping lookup from it's own extent tracking infrastructure, and there's zero runtime overhead when there are no errors present. At the moment, this "dax association" is used to "report" a storage media error directly to userspace. I say "report" because what it does is kill userspace processes dead. The storage media error actually needs to be reported to the owner of the storage media, which in the case of FS-DAX is the filesytem. That way the filesystem can then look up all the owners of that bad media range (i.e. the filesystem block it corresponds to) and take appropriate action. e.g. - if it falls in filesytem metadata, shutdown the filesystem - if it falls in user data, call the "kill userspace dead" routines for each mapping/index tuple the filesystem finds for the given LBA address that the media error occurred. Right now if the media error is in filesystem metadata, the filesystem isn't even told about it. The filesystem can't even shut down - the error is just dropped on the floor and it won't be until the filesystem next tries to reference that metadata that we notice there is an issue. Cheers, Dave.
On 2020/4/28 下午2:43, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: >> >> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: >> >>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: >>>> This patchset is a try to resolve the shared 'page cache' problem for >>>> fsdax. >>>> >>>> In order to track multiple mappings and indexes on one page, I >>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry >>>> will be associated more than once if is shared. At the second time we >>>> associate this entry, we create this rb-tree and store its root in >>>> page->private(not used in fsdax). Insert (->mapping, ->index) when >>>> dax_associate_entry() and delete it when dax_disassociate_entry(). >>> >>> Do we really want to track all of this on a per-page basis? I would >>> have thought a per-extent basis was more useful. Essentially, create >>> a new address_space for each shared extent. Per page just seems like >>> a huge overhead. >>> >> Per-extent tracking is a nice idea for me. I haven't thought of it >> yet... >> >> But the extent info is maintained by filesystem. I think we need a way >> to obtain this info from FS when associating a page. May be a bit >> complicated. Let me think about it... > > That's why I want the -user of this association- to do a filesystem > callout instead of keeping it's own naive tracking infrastructure. > The filesystem can do an efficient, on-demand reverse mapping lookup > from it's own extent tracking infrastructure, and there's zero > runtime overhead when there are no errors present. > > At the moment, this "dax association" is used to "report" a storage > media error directly to userspace. I say "report" because what it > does is kill userspace processes dead. The storage media error > actually needs to be reported to the owner of the storage media, > which in the case of FS-DAX is the filesytem. Understood. BTW, this is the usage in memory-failure, so what about rmap? I have not found how to use this tracking in rmap. Do you have any ideas? > > That way the filesystem can then look up all the owners of that bad > media range (i.e. the filesystem block it corresponds to) and take > appropriate action. e.g. I tried writing a function to look up all the owners' info of one block in xfs for memory-failure use. It was dropped in this patchset because I found out that this lookup function needs 'rmapbt' to be enabled when mkfs. But by default, rmapbt is disabled. I am not sure if it matters... > > - if it falls in filesytem metadata, shutdown the filesystem > - if it falls in user data, call the "kill userspace dead" routines > for each mapping/index tuple the filesystem finds for the given > LBA address that the media error occurred > > Right now if the media error is in filesystem metadata, the > filesystem isn't even told about it. The filesystem can't even shut > down - the error is just dropped on the floor and it won't be until > the filesystem next tries to reference that metadata that we notice > there is an issue. Understood. Thanks. > > Cheers, > > Dave. > -- Thanks, Ruan Shiyang.
On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > On 2020/4/28 下午2:43, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > fsdax. > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > will be associated more than once if is shared. At the second time we > > > > > associate this entry, we create this rb-tree and store its root in > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > have thought a per-extent basis was more useful. Essentially, create > > > > a new address_space for each shared extent. Per page just seems like > > > > a huge overhead. > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > yet... > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > to obtain this info from FS when associating a page. May be a bit > > > complicated. Let me think about it... > > > > That's why I want the -user of this association- to do a filesystem > > callout instead of keeping it's own naive tracking infrastructure. > > The filesystem can do an efficient, on-demand reverse mapping lookup > > from it's own extent tracking infrastructure, and there's zero > > runtime overhead when there are no errors present. > > > > At the moment, this "dax association" is used to "report" a storage > > media error directly to userspace. I say "report" because what it > > does is kill userspace processes dead. The storage media error > > actually needs to be reported to the owner of the storage media, > > which in the case of FS-DAX is the filesytem. > > Understood. > > BTW, this is the usage in memory-failure, so what about rmap? I have not > found how to use this tracking in rmap. Do you have any ideas? > > > > > That way the filesystem can then look up all the owners of that bad > > media range (i.e. the filesystem block it corresponds to) and take > > appropriate action. e.g. > > I tried writing a function to look up all the owners' info of one block in > xfs for memory-failure use. It was dropped in this patchset because I found > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > by default, rmapbt is disabled. I am not sure if it matters... I'm pretty sure you can't have shared extents on an XFS filesystem if you _don't_ have the rmapbt feature enabled. I mean, that's why it exists.
On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > > fsdax. > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > > will be associated more than once if is shared. At the second time we > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > a new address_space for each shared extent. Per page just seems like > > > > > a huge overhead. > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > yet... > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > to obtain this info from FS when associating a page. May be a bit > > > > complicated. Let me think about it... > > > > > > That's why I want the -user of this association- to do a filesystem > > > callout instead of keeping it's own naive tracking infrastructure. > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > from it's own extent tracking infrastructure, and there's zero > > > runtime overhead when there are no errors present. > > > > > > At the moment, this "dax association" is used to "report" a storage > > > media error directly to userspace. I say "report" because what it > > > does is kill userspace processes dead. The storage media error > > > actually needs to be reported to the owner of the storage media, > > > which in the case of FS-DAX is the filesytem. > > > > Understood. > > > > BTW, this is the usage in memory-failure, so what about rmap? I have not > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > media range (i.e. the filesystem block it corresponds to) and take > > > appropriate action. e.g. > > > > I tried writing a function to look up all the owners' info of one block in > > xfs for memory-failure use. It was dropped in this patchset because I found > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > > by default, rmapbt is disabled. I am not sure if it matters... > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. You're confusing reflink with rmap. :) rmapbt does all the reverse mapping tracking, reflink just does the shared data extent tracking. But given that anyone who wants to use DAX with reflink is going to have to mkfs their filesystem anyway (to turn on reflink) requiring that rmapbt is also turned on is not a big deal. Especially as we can check it at mount time in the kernel... Cheers, Dave.
On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > > > fsdax. > > > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > > > will be associated more than once if is shared. At the second time we > > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > > a new address_space for each shared extent. Per page just seems like > > > > > > a huge overhead. > > > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > > yet... > > > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > > to obtain this info from FS when associating a page. May be a bit > > > > > complicated. Let me think about it... > > > > > > > > That's why I want the -user of this association- to do a filesystem > > > > callout instead of keeping it's own naive tracking infrastructure. > > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > > from it's own extent tracking infrastructure, and there's zero > > > > runtime overhead when there are no errors present. > > > > > > > > At the moment, this "dax association" is used to "report" a storage > > > > media error directly to userspace. I say "report" because what it > > > > does is kill userspace processes dead. The storage media error > > > > actually needs to be reported to the owner of the storage media, > > > > which in the case of FS-DAX is the filesytem. > > > > > > Understood. > > > > > > BTW, this is the usage in memory-failure, so what about rmap? I have not > > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > > media range (i.e. the filesystem block it corresponds to) and take > > > > appropriate action. e.g. > > > > > > I tried writing a function to look up all the owners' info of one block in > > > xfs for memory-failure use. It was dropped in this patchset because I found > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > > > by default, rmapbt is disabled. I am not sure if it matters... > > > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. > > You're confusing reflink with rmap. :) > > rmapbt does all the reverse mapping tracking, reflink just does the > shared data extent tracking. > > But given that anyone who wants to use DAX with reflink is going to > have to mkfs their filesystem anyway (to turn on reflink) requiring > that rmapbt is also turned on is not a big deal. Especially as we > can check it at mount time in the kernel... Are we going to turn on rmap by default? The last I checked, it did have a 10-20% performance cost on extreme metadata-heavy workloads. Or do we only enable it by default if mkfs detects a pmem device? (Admittedly, most people do not run fsx as a productivity app; the normal hit is usually 3-5% which might not be such a big deal since you also get (half of) online fsck. :P) --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Apr 28, 2020 at 08:37:32AM -0700, Darrick J. Wong wrote: > On Tue, Apr 28, 2020 at 09:24:41PM +1000, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 04:16:36AM -0700, Matthew Wilcox wrote: > > > On Tue, Apr 28, 2020 at 05:32:41PM +0800, Ruan Shiyang wrote: > > > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > > > > fsdax. > > > > > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > > > > will be associated more than once if is shared. At the second time we > > > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > > > a new address_space for each shared extent. Per page just seems like > > > > > > > a huge overhead. > > > > > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > > > yet... > > > > > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > > > to obtain this info from FS when associating a page. May be a bit > > > > > > complicated. Let me think about it... > > > > > > > > > > That's why I want the -user of this association- to do a filesystem > > > > > callout instead of keeping it's own naive tracking infrastructure. > > > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > > > from it's own extent tracking infrastructure, and there's zero > > > > > runtime overhead when there are no errors present. > > > > > > > > > > At the moment, this "dax association" is used to "report" a storage > > > > > media error directly to userspace. I say "report" because what it > > > > > does is kill userspace processes dead. The storage media error > > > > > actually needs to be reported to the owner of the storage media, > > > > > which in the case of FS-DAX is the filesytem. > > > > > > > > Understood. > > > > > > > > BTW, this is the usage in memory-failure, so what about rmap? I have not > > > > found how to use this tracking in rmap. Do you have any ideas? > > > > > > > > > > > > > > That way the filesystem can then look up all the owners of that bad > > > > > media range (i.e. the filesystem block it corresponds to) and take > > > > > appropriate action. e.g. > > > > > > > > I tried writing a function to look up all the owners' info of one block in > > > > xfs for memory-failure use. It was dropped in this patchset because I found > > > > out that this lookup function needs 'rmapbt' to be enabled when mkfs. But > > > > by default, rmapbt is disabled. I am not sure if it matters... > > > > > > I'm pretty sure you can't have shared extents on an XFS filesystem if you > > > _don't_ have the rmapbt feature enabled. I mean, that's why it exists. > > > > You're confusing reflink with rmap. :) > > > > rmapbt does all the reverse mapping tracking, reflink just does the > > shared data extent tracking. > > > > But given that anyone who wants to use DAX with reflink is going to > > have to mkfs their filesystem anyway (to turn on reflink) requiring > > that rmapbt is also turned on is not a big deal. Especially as we > > can check it at mount time in the kernel... > > Are we going to turn on rmap by default? The last I checked, it did > have a 10-20% performance cost on extreme metadata-heavy workloads. > Or do we only enable it by default if mkfs detects a pmem device? Just have the kernel refuse to mount a reflink enabled filesystem on a DAX capable device unless -o dax=never or rmapbt is enabled. That'll get the message across pretty quickly.... > (Admittedly, most people do not run fsx as a productivity app; the > normal hit is usually 3-5% which might not be such a big deal since you > also get (half of) online fsck. :P) I have not noticed the overhead at all on any of my production machines since I enabled it way on all of them way back when.... And, really, pmem is a _very poor choice_ for metadata intensive applications on XFS as pmem is completely synchronous. XFS has an async IO model for it's metadata that *must* be buffered (so no DAX!) and the synchronous nature of pmem completely defeats the architectural IO pipelining XFS uses to allow thousands of concurrent metadata IOs in flight. OTOH, pmem IO depth is limited to the number of CPUs that are concurrently issuing IO, so it really, really sucks compared to a handful of high end nvme SSDs on PCIe 4.0.... So with that in mind, I see little reason to care about the small additional overhead of rmapbt on FS-DAX installations that require reflink... Cheers, Dave.
On 2020/4/28 下午2:43, Dave Chinner wrote: > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: >> >> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: >> >>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: >>>> This patchset is a try to resolve the shared 'page cache' problem for >>>> fsdax. >>>> >>>> In order to track multiple mappings and indexes on one page, I >>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry >>>> will be associated more than once if is shared. At the second time we >>>> associate this entry, we create this rb-tree and store its root in >>>> page->private(not used in fsdax). Insert (->mapping, ->index) when >>>> dax_associate_entry() and delete it when dax_disassociate_entry(). >>> >>> Do we really want to track all of this on a per-page basis? I would >>> have thought a per-extent basis was more useful. Essentially, create >>> a new address_space for each shared extent. Per page just seems like >>> a huge overhead. >>> >> Per-extent tracking is a nice idea for me. I haven't thought of it >> yet... >> >> But the extent info is maintained by filesystem. I think we need a way >> to obtain this info from FS when associating a page. May be a bit >> complicated. Let me think about it... > > That's why I want the -user of this association- to do a filesystem > callout instead of keeping it's own naive tracking infrastructure. > The filesystem can do an efficient, on-demand reverse mapping lookup > from it's own extent tracking infrastructure, and there's zero > runtime overhead when there are no errors present. Hi Dave, I ran into some difficulties when trying to implement the per-extent rmap tracking. So, I re-read your comments and found that I was misunderstanding what you described here. I think what you mean is: we don't need the in-memory dax-rmap tracking now. Just ask the FS for the owner's information that associate with one page when memory-failure. So, the per-page (even per-extent) dax-rmap is needless in this case. Is this right? Based on this, we only need to store the extent information of a fsdax page in its ->mapping (by searching from FS). Then obtain the owners of this page (also by searching from FS) when memory-failure or other rmap case occurs. So, a fsdax page is no longer associated with a specific file, but with a FS(or the pmem device). I think it's easier to understand and implement. -- Thanks, Ruan Shiyang. > > At the moment, this "dax association" is used to "report" a storage > media error directly to userspace. I say "report" because what it > does is kill userspace processes dead. The storage media error > actually needs to be reported to the owner of the storage media, > which in the case of FS-DAX is the filesytem. > > That way the filesystem can then look up all the owners of that bad > media range (i.e. the filesystem block it corresponds to) and take > appropriate action. e.g. > > - if it falls in filesytem metadata, shutdown the filesystem > - if it falls in user data, call the "kill userspace dead" routines > for each mapping/index tuple the filesystem finds for the given > LBA address that the media error occurred. > > Right now if the media error is in filesystem metadata, the > filesystem isn't even told about it. The filesystem can't even shut > down - the error is just dropped on the floor and it won't be until > the filesystem next tries to reference that metadata that we notice > there is an issue. > > Cheers, > > Dave. >
On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote: > > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > fsdax. > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > will be associated more than once if is shared. At the second time we > > > > > associate this entry, we create this rb-tree and store its root in > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > have thought a per-extent basis was more useful. Essentially, create > > > > a new address_space for each shared extent. Per page just seems like > > > > a huge overhead. > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > yet... > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > to obtain this info from FS when associating a page. May be a bit > > > complicated. Let me think about it... > > > > That's why I want the -user of this association- to do a filesystem > > callout instead of keeping it's own naive tracking infrastructure. > > The filesystem can do an efficient, on-demand reverse mapping lookup > > from it's own extent tracking infrastructure, and there's zero > > runtime overhead when there are no errors present. > > Hi Dave, > > I ran into some difficulties when trying to implement the per-extent rmap > tracking. So, I re-read your comments and found that I was misunderstanding > what you described here. > > I think what you mean is: we don't need the in-memory dax-rmap tracking now. > Just ask the FS for the owner's information that associate with one page > when memory-failure. So, the per-page (even per-extent) dax-rmap is > needless in this case. Is this right? Right. XFS already has its own rmap tree. > Based on this, we only need to store the extent information of a fsdax page > in its ->mapping (by searching from FS). Then obtain the owners of this > page (also by searching from FS) when memory-failure or other rmap case > occurs. I don't even think you need that much. All you need is the "physical" offset of that page within the pmem device (e.g. 'this is the 307th 4k page == offset 1257472 since the start of /dev/pmem0') and xfs can look up the owner of that range of physical storage and deal with it as needed. > So, a fsdax page is no longer associated with a specific file, but with a > FS(or the pmem device). I think it's easier to understand and implement. Yes. I also suspect this will be necessary to support reflink... --D > > -- > Thanks, > Ruan Shiyang. > > > > At the moment, this "dax association" is used to "report" a storage > > media error directly to userspace. I say "report" because what it > > does is kill userspace processes dead. The storage media error > > actually needs to be reported to the owner of the storage media, > > which in the case of FS-DAX is the filesytem. > > > > That way the filesystem can then look up all the owners of that bad > > media range (i.e. the filesystem block it corresponds to) and take > > appropriate action. e.g. > > > > - if it falls in filesytem metadata, shutdown the filesystem > > - if it falls in user data, call the "kill userspace dead" routines > > for each mapping/index tuple the filesystem finds for the given > > LBA address that the media error occurred. > > > > Right now if the media error is in filesystem metadata, the > > filesystem isn't even told about it. The filesystem can't even shut > > down - the error is just dropped on the floor and it won't be until > > the filesystem next tries to reference that metadata that we notice > > there is an issue. > > > > Cheers, > > > > Dave. > > > >
On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote: > On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote: > > > > > > On 2020/4/28 下午2:43, Dave Chinner wrote: > > > On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: > > > > > > > > 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: > > > > > > > > > On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: > > > > > > This patchset is a try to resolve the shared 'page cache' problem for > > > > > > fsdax. > > > > > > > > > > > > In order to track multiple mappings and indexes on one page, I > > > > > > introduced a dax-rmap rb-tree to manage the relationship. A dax entry > > > > > > will be associated more than once if is shared. At the second time we > > > > > > associate this entry, we create this rb-tree and store its root in > > > > > > page->private(not used in fsdax). Insert (->mapping, ->index) when > > > > > > dax_associate_entry() and delete it when dax_disassociate_entry(). > > > > > > > > > > Do we really want to track all of this on a per-page basis? I would > > > > > have thought a per-extent basis was more useful. Essentially, create > > > > > a new address_space for each shared extent. Per page just seems like > > > > > a huge overhead. > > > > > > > > > Per-extent tracking is a nice idea for me. I haven't thought of it > > > > yet... > > > > > > > > But the extent info is maintained by filesystem. I think we need a way > > > > to obtain this info from FS when associating a page. May be a bit > > > > complicated. Let me think about it... > > > > > > That's why I want the -user of this association- to do a filesystem > > > callout instead of keeping it's own naive tracking infrastructure. > > > The filesystem can do an efficient, on-demand reverse mapping lookup > > > from it's own extent tracking infrastructure, and there's zero > > > runtime overhead when there are no errors present. > > > > Hi Dave, > > > > I ran into some difficulties when trying to implement the per-extent rmap > > tracking. So, I re-read your comments and found that I was misunderstanding > > what you described here. > > > > I think what you mean is: we don't need the in-memory dax-rmap tracking now. > > Just ask the FS for the owner's information that associate with one page > > when memory-failure. So, the per-page (even per-extent) dax-rmap is > > needless in this case. Is this right? > > Right. XFS already has its own rmap tree. *nod* > > Based on this, we only need to store the extent information of a fsdax page > > in its ->mapping (by searching from FS). Then obtain the owners of this > > page (also by searching from FS) when memory-failure or other rmap case > > occurs. > > I don't even think you need that much. All you need is the "physical" > offset of that page within the pmem device (e.g. 'this is the 307th 4k > page == offset 1257472 since the start of /dev/pmem0') and xfs can look > up the owner of that range of physical storage and deal with it as > needed. Right. If we have the dax device associated with the page that had the failure, then we can determine the offset of the page into the block device address space and that's all we need to find the owner of the page in the filesystem. Note that there may actually be no owner - the page that had the fault might land in free space, in which case we can simply zero the page and clear the error. > > So, a fsdax page is no longer associated with a specific file, but with a > > FS(or the pmem device). I think it's easier to understand and implement. Effectively, yes. But we shouldn't need to actually associate the page with anything at the filesystem level because it is already associated with a DAX device at a lower level via a dev_pagemap. The hardware page fault already runs thought this code memory_failure_dev_pagemap() before it gets to the DAX code, so really all we need to is have that function pass us the page, offset into the device and, say, the struct dax_device associated with that page so we can get to the filesystem superblock we can then use for rmap lookups on... Cheers, Dave.
On 2020/6/4 下午10:51, Darrick J. Wong wrote: > On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote: >> >> >> On 2020/4/28 下午2:43, Dave Chinner wrote: >>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: >>>> >>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: >>>> >>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: >>>>>> This patchset is a try to resolve the shared 'page cache' problem for >>>>>> fsdax. >>>>>> >>>>>> In order to track multiple mappings and indexes on one page, I >>>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry >>>>>> will be associated more than once if is shared. At the second time we >>>>>> associate this entry, we create this rb-tree and store its root in >>>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when >>>>>> dax_associate_entry() and delete it when dax_disassociate_entry(). >>>>> >>>>> Do we really want to track all of this on a per-page basis? I would >>>>> have thought a per-extent basis was more useful. Essentially, create >>>>> a new address_space for each shared extent. Per page just seems like >>>>> a huge overhead. >>>>> >>>> Per-extent tracking is a nice idea for me. I haven't thought of it >>>> yet... >>>> >>>> But the extent info is maintained by filesystem. I think we need a way >>>> to obtain this info from FS when associating a page. May be a bit >>>> complicated. Let me think about it... >>> >>> That's why I want the -user of this association- to do a filesystem >>> callout instead of keeping it's own naive tracking infrastructure. >>> The filesystem can do an efficient, on-demand reverse mapping lookup >>> from it's own extent tracking infrastructure, and there's zero >>> runtime overhead when there are no errors present. >> >> Hi Dave, >> >> I ran into some difficulties when trying to implement the per-extent rmap >> tracking. So, I re-read your comments and found that I was misunderstanding >> what you described here. >> >> I think what you mean is: we don't need the in-memory dax-rmap tracking now. >> Just ask the FS for the owner's information that associate with one page >> when memory-failure. So, the per-page (even per-extent) dax-rmap is >> needless in this case. Is this right? > > Right. XFS already has its own rmap tree. > >> Based on this, we only need to store the extent information of a fsdax page >> in its ->mapping (by searching from FS). Then obtain the owners of this >> page (also by searching from FS) when memory-failure or other rmap case >> occurs. > > I don't even think you need that much. All you need is the "physical" > offset of that page within the pmem device (e.g. 'this is the 307th 4k > page == offset 1257472 since the start of /dev/pmem0') and xfs can look > up the owner of that range of physical storage and deal with it as > needed. Yes, I think so. > >> So, a fsdax page is no longer associated with a specific file, but with a >> FS(or the pmem device). I think it's easier to understand and implement. > > Yes. I also suspect this will be necessary to support reflink... > > --D OK, Thank you very much. -- Thanks, Ruan Shiyang. > >> >> -- >> Thanks, >> Ruan Shiyang. >>> >>> At the moment, this "dax association" is used to "report" a storage >>> media error directly to userspace. I say "report" because what it >>> does is kill userspace processes dead. The storage media error >>> actually needs to be reported to the owner of the storage media, >>> which in the case of FS-DAX is the filesytem. >>> >>> That way the filesystem can then look up all the owners of that bad >>> media range (i.e. the filesystem block it corresponds to) and take >>> appropriate action. e.g. >>> >>> - if it falls in filesytem metadata, shutdown the filesystem >>> - if it falls in user data, call the "kill userspace dead" routines >>> for each mapping/index tuple the filesystem finds for the given >>> LBA address that the media error occurred. >>> >>> Right now if the media error is in filesystem metadata, the >>> filesystem isn't even told about it. The filesystem can't even shut >>> down - the error is just dropped on the floor and it won't be until >>> the filesystem next tries to reference that metadata that we notice >>> there is an issue. >>> >>> Cheers, >>> >>> Dave. >>> >> >> > >
On 2020/6/5 上午9:30, Dave Chinner wrote: > On Thu, Jun 04, 2020 at 07:51:07AM -0700, Darrick J. Wong wrote: >> On Thu, Jun 04, 2020 at 03:37:42PM +0800, Ruan Shiyang wrote: >>> >>> >>> On 2020/4/28 下午2:43, Dave Chinner wrote: >>>> On Tue, Apr 28, 2020 at 06:09:47AM +0000, Ruan, Shiyang wrote: >>>>> >>>>> 在 2020/4/27 20:28:36, "Matthew Wilcox" <willy@infradead.org> 写道: >>>>> >>>>>> On Mon, Apr 27, 2020 at 04:47:42PM +0800, Shiyang Ruan wrote: >>>>>>> This patchset is a try to resolve the shared 'page cache' problem for >>>>>>> fsdax. >>>>>>> >>>>>>> In order to track multiple mappings and indexes on one page, I >>>>>>> introduced a dax-rmap rb-tree to manage the relationship. A dax entry >>>>>>> will be associated more than once if is shared. At the second time we >>>>>>> associate this entry, we create this rb-tree and store its root in >>>>>>> page->private(not used in fsdax). Insert (->mapping, ->index) when >>>>>>> dax_associate_entry() and delete it when dax_disassociate_entry(). >>>>>> >>>>>> Do we really want to track all of this on a per-page basis? I would >>>>>> have thought a per-extent basis was more useful. Essentially, create >>>>>> a new address_space for each shared extent. Per page just seems like >>>>>> a huge overhead. >>>>>> >>>>> Per-extent tracking is a nice idea for me. I haven't thought of it >>>>> yet... >>>>> >>>>> But the extent info is maintained by filesystem. I think we need a way >>>>> to obtain this info from FS when associating a page. May be a bit >>>>> complicated. Let me think about it... >>>> >>>> That's why I want the -user of this association- to do a filesystem >>>> callout instead of keeping it's own naive tracking infrastructure. >>>> The filesystem can do an efficient, on-demand reverse mapping lookup >>>> from it's own extent tracking infrastructure, and there's zero >>>> runtime overhead when there are no errors present. >>> >>> Hi Dave, >>> >>> I ran into some difficulties when trying to implement the per-extent rmap >>> tracking. So, I re-read your comments and found that I was misunderstanding >>> what you described here. >>> >>> I think what you mean is: we don't need the in-memory dax-rmap tracking now. >>> Just ask the FS for the owner's information that associate with one page >>> when memory-failure. So, the per-page (even per-extent) dax-rmap is >>> needless in this case. Is this right? >> >> Right. XFS already has its own rmap tree. > > *nod* > >>> Based on this, we only need to store the extent information of a fsdax page >>> in its ->mapping (by searching from FS). Then obtain the owners of this >>> page (also by searching from FS) when memory-failure or other rmap case >>> occurs. >> >> I don't even think you need that much. All you need is the "physical" >> offset of that page within the pmem device (e.g. 'this is the 307th 4k >> page == offset 1257472 since the start of /dev/pmem0') and xfs can look >> up the owner of that range of physical storage and deal with it as >> needed. > > Right. If we have the dax device associated with the page that had > the failure, then we can determine the offset of the page into the > block device address space and that's all we need to find the owner > of the page in the filesystem. > > Note that there may actually be no owner - the page that had the > fault might land in free space, in which case we can simply zero > the page and clear the error. OK. Thanks for pointing out. > >>> So, a fsdax page is no longer associated with a specific file, but with a >>> FS(or the pmem device). I think it's easier to understand and implement. > > Effectively, yes. But we shouldn't need to actually associate the > page with anything at the filesystem level because it is already > associated with a DAX device at a lower level via a dev_pagemap. > The hardware page fault already runs thought this code > memory_failure_dev_pagemap() before it gets to the DAX code, so > really all we need to is have that function pass us the page, offset > into the device and, say, the struct dax_device associated with that > page so we can get to the filesystem superblock we can then use for > rmap lookups on... > OK. I was just thinking about how can I execute the FS rmap search from the memory-failure. Thanks again for pointing out. :) -- Thanks, Ruan Shiyang. > Cheers, > > Dave. >