[05/11] filesystem-dax: set page->index

Message ID	152699999778.24093.18007971664703285330.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.100 as permitted sender) client-ip=134.134.136.100; Subject: [PATCH 05/11] filesystem-dax: set page->index From: Dan Williams <dan.j.williams@intel.com> To: linux-nvdimm@lists.01.org Cc: Jan Kara <jack@suse.cz>, Christoph Hellwig <hch@lst.de>, Matthew Wilcox <mawilcox@microsoft.com>, Ross Zwisler <ross.zwisler@linux.intel.com>, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, tony.luck@intel.com Date: Tue, 22 May 2018 07:39:57 -0700 Message-ID: <152699999778.24093.18007971664703285330.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <152699997165.24093.12194490924829406111.stgit@dwillia2-desk3.amr.corp.intel.com> References: <152699997165.24093.12194490924829406111.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org Precedence: bulk

Dan Williams May 22, 2018, 2:39 p.m. UTC

In support of enabling memory_failure() handling for filesystem-dax
mappings, set ->index to the pgoff of the page. The rmap implementation
requires ->index to bound the search through the vma interval tree. The
index is set and cleared at dax_associate_entry() and
dax_disassociate_entry() time respectively.

Cc: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/dax.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Jan Kara May 23, 2018, 8:40 a.m. UTC | #1

On Tue 22-05-18 07:39:57, Dan Williams wrote:
> In support of enabling memory_failure() handling for filesystem-dax
> mappings, set ->index to the pgoff of the page. The rmap implementation
> requires ->index to bound the search through the vma interval tree. The
> index is set and cleared at dax_associate_entry() and
> dax_disassociate_entry() time respectively.
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  fs/dax.c |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index aaec72ded1b6..2e4682cd7c69 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>  	for (pfn = dax_radix_pfn(entry); \
>  			pfn < dax_radix_end_pfn(entry); pfn++)
>  
> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> +		struct vm_area_struct *vma, unsigned long address)
>  {
> -	unsigned long pfn;
> +	unsigned long size = dax_entry_size(entry), pfn, index;
> +	int i = 0;
>  
>  	if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>  		return;
>  
> +	index = linear_page_index(vma, address & ~(size - 1));
>  	for_each_mapped_pfn(entry, pfn) {
>  		struct page *page = pfn_to_page(pfn);
>  
>  		WARN_ON_ONCE(page->mapping);
>  		page->mapping = mapping;
> +		page->index = index + i++;
>  	}
>  }

Hum, this just made me think: How is this going to work with XFS reflink?
In fact is not the page->mapping association already broken by XFS reflink?
Because with reflink we can have two or more mappings pointing to the same
physical blocks (i.e., pages in DAX case)...

								Honza

Dan Williams May 30, 2018, 1:38 a.m. UTC | #2

On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 22-05-18 07:39:57, Dan Williams wrote:
>> In support of enabling memory_failure() handling for filesystem-dax
>> mappings, set ->index to the pgoff of the page. The rmap implementation
>> requires ->index to bound the search through the vma interval tree. The
>> index is set and cleared at dax_associate_entry() and
>> dax_disassociate_entry() time respectively.
>>
>> Cc: Jan Kara <jack@suse.cz>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  fs/dax.c |   11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/dax.c b/fs/dax.c
>> index aaec72ded1b6..2e4682cd7c69 100644
>> --- a/fs/dax.c
>> +++ b/fs/dax.c
>> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>>       for (pfn = dax_radix_pfn(entry); \
>>                       pfn < dax_radix_end_pfn(entry); pfn++)
>>
>> -static void dax_associate_entry(void *entry, struct address_space *mapping)
>> +static void dax_associate_entry(void *entry, struct address_space *mapping,
>> +             struct vm_area_struct *vma, unsigned long address)
>>  {
>> -     unsigned long pfn;
>> +     unsigned long size = dax_entry_size(entry), pfn, index;
>> +     int i = 0;
>>
>>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>>               return;
>>
>> +     index = linear_page_index(vma, address & ~(size - 1));
>>       for_each_mapped_pfn(entry, pfn) {
>>               struct page *page = pfn_to_page(pfn);
>>
>>               WARN_ON_ONCE(page->mapping);
>>               page->mapping = mapping;
>> +             page->index = index + i++;
>>       }
>>  }
>
> Hum, this just made me think: How is this going to work with XFS reflink?
> In fact is not the page->mapping association already broken by XFS reflink?
> Because with reflink we can have two or more mappings pointing to the same
> physical blocks (i.e., pages in DAX case)...

Good question. I assume we are ok in the non-DAX reflink case because
rmap of failing / poison pages is only relative to the specific page
cache page for a given inode in the reflink. However, DAX would seem
to break this because we only get one shared 'struct page' for all
possible mappings of the physical file block. I think this means for
iterating over the rmap of "where is this page mapped" would require
iterating over the other "sibling" inodes that know about the given
physical file block.

As far as I can see reflink+dax would require teaching kernel code
paths that ->mapping may not be a singular relationship. Something
along the line's of what Jerome was presenting at LSF to create a
special value to indicate, "call back into the filesystem (or the page
owner)" to perform this operation.

In the meantime the kernel crashes when userspace accesses poisoned
pmem via DAX. I assume that reworking rmap for the dax+reflink case
should not block dax poison handling? Yell if you disagree.

Jan Kara May 30, 2018, 8:13 a.m. UTC | #3

On Tue 29-05-18 18:38:41, Dan Williams wrote:
> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
> >> In support of enabling memory_failure() handling for filesystem-dax
> >> mappings, set ->index to the pgoff of the page. The rmap implementation
> >> requires ->index to bound the search through the vma interval tree. The
> >> index is set and cleared at dax_associate_entry() and
> >> dax_disassociate_entry() time respectively.
> >>
> >> Cc: Jan Kara <jack@suse.cz>
> >> Cc: Christoph Hellwig <hch@lst.de>
> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> ---
> >>  fs/dax.c |   11 ++++++++---
> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/dax.c b/fs/dax.c
> >> index aaec72ded1b6..2e4682cd7c69 100644
> >> --- a/fs/dax.c
> >> +++ b/fs/dax.c
> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
> >>       for (pfn = dax_radix_pfn(entry); \
> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
> >>
> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> >> +             struct vm_area_struct *vma, unsigned long address)
> >>  {
> >> -     unsigned long pfn;
> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
> >> +     int i = 0;
> >>
> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >>               return;
> >>
> >> +     index = linear_page_index(vma, address & ~(size - 1));
> >>       for_each_mapped_pfn(entry, pfn) {
> >>               struct page *page = pfn_to_page(pfn);
> >>
> >>               WARN_ON_ONCE(page->mapping);
> >>               page->mapping = mapping;
> >> +             page->index = index + i++;
> >>       }
> >>  }
> >
> > Hum, this just made me think: How is this going to work with XFS reflink?
> > In fact is not the page->mapping association already broken by XFS reflink?
> > Because with reflink we can have two or more mappings pointing to the same
> > physical blocks (i.e., pages in DAX case)...
> 
> Good question. I assume we are ok in the non-DAX reflink case because
> rmap of failing / poison pages is only relative to the specific page
> cache page for a given inode in the reflink. However, DAX would seem
> to break this because we only get one shared 'struct page' for all
> possible mappings of the physical file block. I think this means for
> iterating over the rmap of "where is this page mapped" would require
> iterating over the other "sibling" inodes that know about the given
> physical file block.
> 
> As far as I can see reflink+dax would require teaching kernel code
> paths that ->mapping may not be a singular relationship. Something
> along the line's of what Jerome was presenting at LSF to create a
> special value to indicate, "call back into the filesystem (or the page
> owner)" to perform this operation.
> 
> In the meantime the kernel crashes when userspace accesses poisoned
> pmem via DAX. I assume that reworking rmap for the dax+reflink case
> should not block dax poison handling? Yell if you disagree.

The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
page->mapping to warn if truncate collides with a busy page" in
particular), DAX was perfectly fine with reflinks since we never needed
page->mapping. Now this series adds even page->index dependency which makes
life for rmap with reflinks even harder. So if nothing else we should at
least make sure reflinked filesystems cannot be mounted with dax mount
option for now and seriously start looking into how to implement rmap with
reflinked files for DAX because this noticeably reduces its usefulness.

								Honza

Dan Williams May 30, 2018, 11:21 p.m. UTC | #4

On Wed, May 30, 2018 at 1:13 AM, Jan Kara <jack@suse.cz> wrote:
> On Tue 29-05-18 18:38:41, Dan Williams wrote:
>> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
>> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
>> >> In support of enabling memory_failure() handling for filesystem-dax
>> >> mappings, set ->index to the pgoff of the page. The rmap implementation
>> >> requires ->index to bound the search through the vma interval tree. The
>> >> index is set and cleared at dax_associate_entry() and
>> >> dax_disassociate_entry() time respectively.
>> >>
>> >> Cc: Jan Kara <jack@suse.cz>
>> >> Cc: Christoph Hellwig <hch@lst.de>
>> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
>> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> >> ---
>> >>  fs/dax.c |   11 ++++++++---
>> >>  1 file changed, 8 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/fs/dax.c b/fs/dax.c
>> >> index aaec72ded1b6..2e4682cd7c69 100644
>> >> --- a/fs/dax.c
>> >> +++ b/fs/dax.c
>> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
>> >>       for (pfn = dax_radix_pfn(entry); \
>> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
>> >>
>> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
>> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
>> >> +             struct vm_area_struct *vma, unsigned long address)
>> >>  {
>> >> -     unsigned long pfn;
>> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
>> >> +     int i = 0;
>> >>
>> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
>> >>               return;
>> >>
>> >> +     index = linear_page_index(vma, address & ~(size - 1));
>> >>       for_each_mapped_pfn(entry, pfn) {
>> >>               struct page *page = pfn_to_page(pfn);
>> >>
>> >>               WARN_ON_ONCE(page->mapping);
>> >>               page->mapping = mapping;
>> >> +             page->index = index + i++;
>> >>       }
>> >>  }
>> >
>> > Hum, this just made me think: How is this going to work with XFS reflink?
>> > In fact is not the page->mapping association already broken by XFS reflink?
>> > Because with reflink we can have two or more mappings pointing to the same
>> > physical blocks (i.e., pages in DAX case)...
>>
>> Good question. I assume we are ok in the non-DAX reflink case because
>> rmap of failing / poison pages is only relative to the specific page
>> cache page for a given inode in the reflink. However, DAX would seem
>> to break this because we only get one shared 'struct page' for all
>> possible mappings of the physical file block. I think this means for
>> iterating over the rmap of "where is this page mapped" would require
>> iterating over the other "sibling" inodes that know about the given
>> physical file block.
>>
>> As far as I can see reflink+dax would require teaching kernel code
>> paths that ->mapping may not be a singular relationship. Something
>> along the line's of what Jerome was presenting at LSF to create a
>> special value to indicate, "call back into the filesystem (or the page
>> owner)" to perform this operation.
>>
>> In the meantime the kernel crashes when userspace accesses poisoned
>> pmem via DAX. I assume that reworking rmap for the dax+reflink case
>> should not block dax poison handling? Yell if you disagree.
>
> The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
> page->mapping to warn if truncate collides with a busy page" in
> particular), DAX was perfectly fine with reflinks since we never needed
> page->mapping.

Sure, but if this rmap series had come first I still would have needed
to implement ->mapping. So unless we invent a general ->mapping
replacement and switch all mapping users, it was always going to
collide with DAX eventually.

> Now this series adds even page->index dependency which makes
> life for rmap with reflinks even harder. So if nothing else we should at
> least make sure reflinked filesystems cannot be mounted with dax mount
> option for now and seriously start looking into how to implement rmap with
> reflinked files for DAX because this noticeably reduces its usefulness.

This restriction is already in place. In xfs_reflink_remap_range() we have:

        /* Don't share DAX file data for now. */
        if (IS_DAX(inode_in) || IS_DAX(inode_out))
                goto out_unlock;

All this said, perhaps we don't need to set ->link, it would just mean
a wider search through the rmap tree to find if the given page is
mapped. So, I think I can forgo setting ->link if I teach the rmap
code to search the entire ->mapping.

Jan Kara May 31, 2018, 10:08 a.m. UTC | #5

On Wed 30-05-18 16:21:33, Dan Williams wrote:
> On Wed, May 30, 2018 at 1:13 AM, Jan Kara <jack@suse.cz> wrote:
> > On Tue 29-05-18 18:38:41, Dan Williams wrote:
> >> On Wed, May 23, 2018 at 1:40 AM, Jan Kara <jack@suse.cz> wrote:
> >> > On Tue 22-05-18 07:39:57, Dan Williams wrote:
> >> >> In support of enabling memory_failure() handling for filesystem-dax
> >> >> mappings, set ->index to the pgoff of the page. The rmap implementation
> >> >> requires ->index to bound the search through the vma interval tree. The
> >> >> index is set and cleared at dax_associate_entry() and
> >> >> dax_disassociate_entry() time respectively.
> >> >>
> >> >> Cc: Jan Kara <jack@suse.cz>
> >> >> Cc: Christoph Hellwig <hch@lst.de>
> >> >> Cc: Matthew Wilcox <mawilcox@microsoft.com>
> >> >> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> >> >> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> >> >> ---
> >> >>  fs/dax.c |   11 ++++++++---
> >> >>  1 file changed, 8 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/fs/dax.c b/fs/dax.c
> >> >> index aaec72ded1b6..2e4682cd7c69 100644
> >> >> --- a/fs/dax.c
> >> >> +++ b/fs/dax.c
> >> >> @@ -319,18 +319,22 @@ static unsigned long dax_radix_end_pfn(void *entry)
> >> >>       for (pfn = dax_radix_pfn(entry); \
> >> >>                       pfn < dax_radix_end_pfn(entry); pfn++)
> >> >>
> >> >> -static void dax_associate_entry(void *entry, struct address_space *mapping)
> >> >> +static void dax_associate_entry(void *entry, struct address_space *mapping,
> >> >> +             struct vm_area_struct *vma, unsigned long address)
> >> >>  {
> >> >> -     unsigned long pfn;
> >> >> +     unsigned long size = dax_entry_size(entry), pfn, index;
> >> >> +     int i = 0;
> >> >>
> >> >>       if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> >> >>               return;
> >> >>
> >> >> +     index = linear_page_index(vma, address & ~(size - 1));
> >> >>       for_each_mapped_pfn(entry, pfn) {
> >> >>               struct page *page = pfn_to_page(pfn);
> >> >>
> >> >>               WARN_ON_ONCE(page->mapping);
> >> >>               page->mapping = mapping;
> >> >> +             page->index = index + i++;
> >> >>       }
> >> >>  }
> >> >
> >> > Hum, this just made me think: How is this going to work with XFS reflink?
> >> > In fact is not the page->mapping association already broken by XFS reflink?
> >> > Because with reflink we can have two or more mappings pointing to the same
> >> > physical blocks (i.e., pages in DAX case)...
> >>
> >> Good question. I assume we are ok in the non-DAX reflink case because
> >> rmap of failing / poison pages is only relative to the specific page
> >> cache page for a given inode in the reflink. However, DAX would seem
> >> to break this because we only get one shared 'struct page' for all
> >> possible mappings of the physical file block. I think this means for
> >> iterating over the rmap of "where is this page mapped" would require
> >> iterating over the other "sibling" inodes that know about the given
> >> physical file block.
> >>
> >> As far as I can see reflink+dax would require teaching kernel code
> >> paths that ->mapping may not be a singular relationship. Something
> >> along the line's of what Jerome was presenting at LSF to create a
> >> special value to indicate, "call back into the filesystem (or the page
> >> owner)" to perform this operation.
> >>
> >> In the meantime the kernel crashes when userspace accesses poisoned
> >> pmem via DAX. I assume that reworking rmap for the dax+reflink case
> >> should not block dax poison handling? Yell if you disagree.
> >
> > The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
> > page->mapping to warn if truncate collides with a busy page" in
> > particular), DAX was perfectly fine with reflinks since we never needed
> > page->mapping.
> 
> Sure, but if this rmap series had come first I still would have needed
> to implement ->mapping. So unless we invent a general ->mapping
> replacement and switch all mapping users, it was always going to
> collide with DAX eventually.

Yes, my comment was more in direction that life would be easier if we could
keep DAX without rmap support but I guess that's just too cumbersome.

> > Now this series adds even page->index dependency which makes
> > life for rmap with reflinks even harder. So if nothing else we should at
> > least make sure reflinked filesystems cannot be mounted with dax mount
> > option for now and seriously start looking into how to implement rmap with
> > reflinked files for DAX because this noticeably reduces its usefulness.
> 
> This restriction is already in place. In xfs_reflink_remap_range() we have:
> 
>         /* Don't share DAX file data for now. */
>         if (IS_DAX(inode_in) || IS_DAX(inode_out))
>                 goto out_unlock;

OK, good.

> All this said, perhaps we don't need to set ->link, it would just mean
> a wider search through the rmap tree to find if the given page is
> mapped. So, I think I can forgo setting ->link if I teach the rmap
> code to search the entire ->mapping.

I guess you mean ->index in the above. And now when thinking about it I don't
think it is worth the complications to avoid using ->index.

								Honza

Dan Williams May 31, 2018, 9:49 p.m. UTC | #6

On Thu, May 31, 2018 at 3:08 AM, Jan Kara <jack@suse.cz> wrote:
[..]
>> >> As far as I can see reflink+dax would require teaching kernel code
>> >> paths that ->mapping may not be a singular relationship. Something
>> >> along the line's of what Jerome was presenting at LSF to create a
>> >> special value to indicate, "call back into the filesystem (or the page
>> >> owner)" to perform this operation.
>> >>
>> >> In the meantime the kernel crashes when userspace accesses poisoned
>> >> pmem via DAX. I assume that reworking rmap for the dax+reflink case
>> >> should not block dax poison handling? Yell if you disagree.
>> >
>> > The thing is, up until get_user_pages() vs truncate series ("fs, dax: use
>> > page->mapping to warn if truncate collides with a busy page" in
>> > particular), DAX was perfectly fine with reflinks since we never needed
>> > page->mapping.
>>
>> Sure, but if this rmap series had come first I still would have needed
>> to implement ->mapping. So unless we invent a general ->mapping
>> replacement and switch all mapping users, it was always going to
>> collide with DAX eventually.
>
> Yes, my comment was more in direction that life would be easier if we could
> keep DAX without rmap support but I guess that's just too cumbersome.

I'm open to deeper reworks later. As it stands currently just calling
madvise(..., MADV_HWPOISON) on a DAX mapping causes a page reference
to be leaked because the madvise code has no clue about proper
handling of DAX pages, and consuming real poison leads to a fatal
condition / reset.

I think fixing those bugs with the current rmap dependencies on
->mapping and ->index is step1 and step2 is a longer term solution for
dax rmap that does also allow reflink. I.e. it's an rmap > reflink
argument for now.

>
>> > Now this series adds even page->index dependency which makes
>> > life for rmap with reflinks even harder. So if nothing else we should at
>> > least make sure reflinked filesystems cannot be mounted with dax mount
>> > option for now and seriously start looking into how to implement rmap with
>> > reflinked files for DAX because this noticeably reduces its usefulness.
>>
>> This restriction is already in place. In xfs_reflink_remap_range() we have:
>>
>>         /* Don't share DAX file data for now. */
>>         if (IS_DAX(inode_in) || IS_DAX(inode_out))
>>                 goto out_unlock;
>
> OK, good.
>
>> All this said, perhaps we don't need to set ->link, it would just mean
>> a wider search through the rmap tree to find if the given page is
>> mapped. So, I think I can forgo setting ->link if I teach the rmap
>> code to search the entire ->mapping.
>
> I guess you mean ->index in the above. And now when thinking about it I don't
> think it is worth the complications to avoid using ->index.

Ok, and yes I meant ->index... sorry, too much struct page on the
brain presently.

[05/11] filesystem-dax: set page->index

Commit Message

Comments

Patch