diff mbox

dax: Split pmd map when fallback on COW

Message ID 1448309120-20911-1-git-send-email-toshi.kani@hpe.com (mailing list archive)
State Superseded
Headers show

Commit Message

Kani, Toshi Nov. 23, 2015, 8:05 p.m. UTC
An infinite loop of PMD faults was observed when attempted to
mlock() a private read-only PMD mmap'd range of a DAX file.

__dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when
falling back to PTE on COW.  However, __handle_mm_fault()
returns without falling back to handle_pte_fault() because
a PMD map is present in this case.

Change __dax_pmd_fault() to split the PMD map, if present,
before returning with VM_FAULT_FALLBACK.

Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 fs/dax.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Dan Williams Nov. 23, 2015, 8:45 p.m. UTC | #1
On Mon, Nov 23, 2015 at 12:05 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> An infinite loop of PMD faults was observed when attempted to
> mlock() a private read-only PMD mmap'd range of a DAX file.
>
> __dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when
> falling back to PTE on COW.  However, __handle_mm_fault()
> returns without falling back to handle_pte_fault() because
> a PMD map is present in this case.
>
> Change __dax_pmd_fault() to split the PMD map, if present,
> before returning with VM_FAULT_FALLBACK.
>
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>

I thought the patch from Ross already addressed the infinite loop:

https://patchwork.kernel.org/patch/7653731/

> ---
>  fs/dax.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 43671b6..3405583 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -546,8 +546,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>                 return VM_FAULT_FALLBACK;
>
>         /* Fall back to PTEs if we're going to COW */
> -       if (write && !(vma->vm_flags & VM_SHARED))
> +       if (write && !(vma->vm_flags & VM_SHARED)) {
> +               split_huge_page_pmd(vma, address, pmd);
>                 return VM_FAULT_FALLBACK;
> +       }
>         /* If the PMD would extend outside the VMA */
>         if (pmd_addr < vma->vm_start)
>                 return VM_FAULT_FALLBACK;

This is a nop if CONFIG_TRANSPARENT_HUGEPAGE=n, so I don't think it's
a complete fix.
Kani, Toshi Nov. 23, 2015, 8:45 p.m. UTC | #2
On Mon, 2015-11-23 at 12:45 -0800, Dan Williams wrote:
> On Mon, Nov 23, 2015 at 12:05 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > An infinite loop of PMD faults was observed when attempted to
> > mlock() a private read-only PMD mmap'd range of a DAX file.
> > 
> > __dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when
> > falling back to PTE on COW.  However, __handle_mm_fault()
> > returns without falling back to handle_pte_fault() because
> > a PMD map is present in this case.
> > 
> > Change __dax_pmd_fault() to split the PMD map, if present,
> > before returning with VM_FAULT_FALLBACK.
> > 
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> 
> I thought the patch from Ross already addressed the infinite loop:
> 
> https://patchwork.kernel.org/patch/7653731/

This fixes a different issue.  I hit this one while testing my other patch along
with the Ross's patch.

> > ---
> >  fs/dax.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 43671b6..3405583 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -546,8 +546,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma,
> > unsigned long address,
> >                 return VM_FAULT_FALLBACK;
> > 
> >         /* Fall back to PTEs if we're going to COW */
> > -       if (write && !(vma->vm_flags & VM_SHARED))
> > +       if (write && !(vma->vm_flags & VM_SHARED)) {
> > +               split_huge_page_pmd(vma, address, pmd);
> >                 return VM_FAULT_FALLBACK;
> > +       }
> >         /* If the PMD would extend outside the VMA */
> >         if (pmd_addr < vma->vm_start)
> >                 return VM_FAULT_FALLBACK;
> 
> This is a nop if CONFIG_TRANSPARENT_HUGEPAGE=n, so I don't think it's
> a complete fix.

Well, __dax_pmd_fault() itself depends on CONFIG_TRANSPARENT_HUGEPAGE.

Thanks,
-Toshi
Dan Williams Nov. 23, 2015, 8:56 p.m. UTC | #3
On Mon, Nov 23, 2015 at 12:45 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Mon, 2015-11-23 at 12:45 -0800, Dan Williams wrote:
>> On Mon, Nov 23, 2015 at 12:05 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
[..]
>> This is a nop if CONFIG_TRANSPARENT_HUGEPAGE=n, so I don't think it's
>> a complete fix.
>
> Well, __dax_pmd_fault() itself depends on CONFIG_TRANSPARENT_HUGEPAGE.
>

Indeed it is... I think that's wrong because transparent huge pages
rely on struct page??
Kani, Toshi Nov. 23, 2015, 9:04 p.m. UTC | #4
On Mon, 2015-11-23 at 12:56 -0800, Dan Williams wrote:
> On Mon, Nov 23, 2015 at 12:45 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Mon, 2015-11-23 at 12:45 -0800, Dan Williams wrote:
> > > On Mon, Nov 23, 2015 at 12:05 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> [..]
> > > This is a nop if CONFIG_TRANSPARENT_HUGEPAGE=n, so I don't think it's
> > > a complete fix.
> > 
> > Well, __dax_pmd_fault() itself depends on CONFIG_TRANSPARENT_HUGEPAGE.
> > 
> 
> Indeed it is... I think that's wrong because transparent huge pages
> rely on struct page??

I do not think this issue is related with struct page.  wp_huge_pmd() calls
either do_huge_pmd_wp_page() or dax_pmd_fault().  do_huge_pmd_wp_page() splits a
pmd page when it returns with VM_FAULT_FALLBACK.  So, this change keeps them
consistent on VM_FAULT_FALLBACK.

Thanks,
-Toshi
Kani, Toshi Nov. 23, 2015, 10:58 p.m. UTC | #5
On Mon, 2015-11-23 at 13:05 -0700, Toshi Kani wrote:
> An infinite loop of PMD faults was observed when attempted to
> mlock() a private read-only PMD mmap'd range of a DAX file.

Typo: the above description should be (remove "read-only"): 

An infinite loop of PMD faults was observed when attempted to mlock() a private
PMD mmap'd range of a DAX file.

-Toshi

> __dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when
> falling back to PTE on COW.  However, __handle_mm_fault()
> returns without falling back to handle_pte_fault() because
> a PMD map is present in this case.
> 
> Change __dax_pmd_fault() to split the PMD map, if present,
> before returning with VM_FAULT_FALLBACK.
Dan Williams Nov. 24, 2015, 5:08 p.m. UTC | #6
On Mon, Nov 23, 2015 at 12:05 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> An infinite loop of PMD faults was observed when attempted to
> mlock() a private read-only PMD mmap'd range of a DAX file.
>
> __dax_pmd_fault() simply returns with VM_FAULT_FALLBACK when
> falling back to PTE on COW.  However, __handle_mm_fault()
> returns without falling back to handle_pte_fault() because
> a PMD map is present in this case.
>
> Change __dax_pmd_fault() to split the PMD map, if present,
> before returning with VM_FAULT_FALLBACK.
>
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  fs/dax.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 43671b6..3405583 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -546,8 +546,10 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
>                 return VM_FAULT_FALLBACK;
>
>         /* Fall back to PTEs if we're going to COW */
> -       if (write && !(vma->vm_flags & VM_SHARED))
> +       if (write && !(vma->vm_flags & VM_SHARED)) {
> +               split_huge_page_pmd(vma, address, pmd);
>                 return VM_FAULT_FALLBACK;
> +       }

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

I took a closer look at dax's CONFIG_TRANSPARENT_HUGEPAGE interactions
and it turns out THP is a performance enhancement not a functional
dependency.  I.e. a performance enhancement to use a huge_zero_page
where available, but not a requirement.

I'll fold this in with my series make pmd_trans_huge() return false
for non-huge_zero_page dax mappings, and in that case I'll need to
up-level the call to  pmdp_huge_clear_flush_notify() from
__split_huge_page_pmd.
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 43671b6..3405583 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -546,8 +546,10 @@  int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
 		return VM_FAULT_FALLBACK;
 
 	/* Fall back to PTEs if we're going to COW */
-	if (write && !(vma->vm_flags & VM_SHARED))
+	if (write && !(vma->vm_flags & VM_SHARED)) {
+		split_huge_page_pmd(vma, address, pmd);
 		return VM_FAULT_FALLBACK;
+	}
 	/* If the PMD would extend outside the VMA */
 	if (pmd_addr < vma->vm_start)
 		return VM_FAULT_FALLBACK;