diff mbox

mm: Fix mmap MAP_POPULATE for DAX pmd mapping

Message ID 1448309082-20851-1-git-send-email-toshi.kani@hpe.com (mailing list archive)
State Superseded
Headers show

Commit Message

Kani, Toshi Nov. 23, 2015, 8:04 p.m. UTC
The following oops was observed when mmap() with MAP_POPULATE
pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
expects that a target address has a struct page.

  BUG: unable to handle kernel paging request at ffffea0012220000
  follow_trans_huge_pmd+0xba/0x390
  follow_page_mask+0x33d/0x420
  __get_user_pages+0xdc/0x800
  populate_vma_page_range+0xb5/0xe0
  __mm_populate+0xc5/0x150
  vm_mmap_pgoff+0xd5/0xe0
  SyS_mmap_pgoff+0x1c1/0x290
  SyS_mmap+0x1b/0x30

Fix it by making the PMD pre-fault handling consistent with PTE.
After pre-faulted in faultin_page(), follow_page_mask() calls
follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
and returns with -EEXIST.

Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
---
 mm/huge_memory.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

Comments

Dan Williams Nov. 23, 2015, 8:53 p.m. UTC | #1
On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> The following oops was observed when mmap() with MAP_POPULATE
> pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> expects that a target address has a struct page.
>
>   BUG: unable to handle kernel paging request at ffffea0012220000
>   follow_trans_huge_pmd+0xba/0x390
>   follow_page_mask+0x33d/0x420
>   __get_user_pages+0xdc/0x800
>   populate_vma_page_range+0xb5/0xe0
>   __mm_populate+0xc5/0x150
>   vm_mmap_pgoff+0xd5/0xe0
>   SyS_mmap_pgoff+0x1c1/0x290
>   SyS_mmap+0x1b/0x30
>
> Fix it by making the PMD pre-fault handling consistent with PTE.
> After pre-faulted in faultin_page(), follow_page_mask() calls
> follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> and returns with -EEXIST.

As of 4.4.-rc2 DAX pmd mappings are disabled.  So we have time to do
something more comprehensive in 4.5.

>
> Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---
>  mm/huge_memory.c |   34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d5b8920..f56e034 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
[..]
> @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>         if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
>                 goto out;
>
> +       /* pfn map does not have a struct page */
> +       if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
> +               ret = follow_pfn_pmd(vma, addr, pmd, flags);
> +               page = ERR_PTR(ret);
> +               goto out;
> +       }
> +
>         page = pmd_page(*pmd);
>         VM_BUG_ON_PAGE(!PageHead(page), page);
>         if (flags & FOLL_TOUCH) {

I think it is already problematic that dax pmd mappings are getting
confused with transparent huge pages.  They're more closely related to
a hugetlbfs pmd mappings in that they are mapping an explicit
allocation.  I have some pending patches to address this dax-pmd vs
hugetlb-pmd vs thp-pmd classification that I will post shortly.

By the way, I'm collecting DAX pmd regression tests [1], is this just
a simple crash upon using MAP_POPULATE?

[1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c
Kani, Toshi Nov. 23, 2015, 10:15 p.m. UTC | #2
On Mon, 2015-11-23 at 12:53 -0800, Dan Williams wrote:
> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > The following oops was observed when mmap() with MAP_POPULATE
> > pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> > expects that a target address has a struct page.
> > 
> >   BUG: unable to handle kernel paging request at ffffea0012220000
> >   follow_trans_huge_pmd+0xba/0x390
> >   follow_page_mask+0x33d/0x420
> >   __get_user_pages+0xdc/0x800
> >   populate_vma_page_range+0xb5/0xe0
> >   __mm_populate+0xc5/0x150
> >   vm_mmap_pgoff+0xd5/0xe0
> >   SyS_mmap_pgoff+0x1c1/0x290
> >   SyS_mmap+0x1b/0x30
> > 
> > Fix it by making the PMD pre-fault handling consistent with PTE.
> > After pre-faulted in faultin_page(), follow_page_mask() calls
> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> > for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> > and returns with -EEXIST.
> 
> As of 4.4.-rc2 DAX pmd mappings are disabled.  So we have time to do
> something more comprehensive in 4.5.

Yes, I noticed during my testing that I could not use pmd...

> > Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> >  mm/huge_memory.c |   34 ++++++++++++++++++++++++++++++++++
> >  1 file changed, 34 insertions(+)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d5b8920..f56e034 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> [..]
> > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct
> > vm_area_struct *vma,
> >         if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
> >                 goto out;
> > 
> > +       /* pfn map does not have a struct page */
> > +       if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
> > +               ret = follow_pfn_pmd(vma, addr, pmd, flags);
> > +               page = ERR_PTR(ret);
> > +               goto out;
> > +       }
> > +
> >         page = pmd_page(*pmd);
> >         VM_BUG_ON_PAGE(!PageHead(page), page);
> >         if (flags & FOLL_TOUCH) {
> 
> I think it is already problematic that dax pmd mappings are getting
> confused with transparent huge pages.  

We had the same issue with dax pte mapping [1], and this change extends the pfn
map handling to pmd.  So, this problem is not specific to pmd.

[1] https://lkml.org/lkml/2015/6/23/181

> They're more closely related to
> a hugetlbfs pmd mappings in that they are mapping an explicit
> allocation.  I have some pending patches to address this dax-pmd vs
> hugetlb-pmd vs thp-pmd classification that I will post shortly.

Not sure which way is better, but I am certainly interested in your changes.

> By the way, I'm collecting DAX pmd regression tests [1], is this just
> a simple crash upon using MAP_POPULATE?
> 
> [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c

Yes, this issue is easy to reproduce with MAP_POPULATE.  In case it helps,
attached are the test I used for testing the patches.  Sorry, the code is messy
since it was only intended for my internal use...

 - The test was originally written for the pte change [1] and comments in
test.sh (ex. mlock fail, ok) reflect the results without the pte change.
 - For the pmd test, I modified test-mmap.c to call posix_memalign() before
mmap().  By calling free(), the 2MB-aligned address from posix_memalign() can be
used for mmap().  This keeps the mmap'd address aligned on 2MB.
 - I created test file(s) with dd (i.e. all blocks written) in my test.
 - The other infinite loop issue (fixed by my other patch) was found by the test
case with option "-LMSr".

Thanks,
-Toshi
Dan Williams Nov. 30, 2015, 10:08 p.m. UTC | #3
On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> The following oops was observed when mmap() with MAP_POPULATE
> pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> expects that a target address has a struct page.
>
>   BUG: unable to handle kernel paging request at ffffea0012220000
>   follow_trans_huge_pmd+0xba/0x390
>   follow_page_mask+0x33d/0x420
>   __get_user_pages+0xdc/0x800
>   populate_vma_page_range+0xb5/0xe0
>   __mm_populate+0xc5/0x150
>   vm_mmap_pgoff+0xd5/0xe0
>   SyS_mmap_pgoff+0x1c1/0x290
>   SyS_mmap+0x1b/0x30
>
> Fix it by making the PMD pre-fault handling consistent with PTE.
> After pre-faulted in faultin_page(), follow_page_mask() calls
> follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> and returns with -EEXIST.
>
> Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Matthew Wilcox <willy@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> ---

Hey Toshi,

I ended up fixing this differently with follow_pmd_devmap() introduced
in this series:

https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html

Does the latest libnvdimm-pending branch [1] pass your test case?

[1]: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending
Kani, Toshi Dec. 2, 2015, 2:19 a.m. UTC | #4
On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote:
> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > The following oops was observed when mmap() with MAP_POPULATE
> > pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> > expects that a target address has a struct page.
> > 
> >   BUG: unable to handle kernel paging request at ffffea0012220000
> >   follow_trans_huge_pmd+0xba/0x390
> >   follow_page_mask+0x33d/0x420
> >   __get_user_pages+0xdc/0x800
> >   populate_vma_page_range+0xb5/0xe0
> >   __mm_populate+0xc5/0x150
> >   vm_mmap_pgoff+0xd5/0xe0
> >   SyS_mmap_pgoff+0x1c1/0x290
> >   SyS_mmap+0x1b/0x30
> > 
> > Fix it by making the PMD pre-fault handling consistent with PTE.
> > After pre-faulted in faultin_page(), follow_page_mask() calls
> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> > for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> > and returns with -EEXIST.
> > 
> > Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Cc: Matthew Wilcox <willy@linux.intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > ---
> 
> Hey Toshi,
> 
> I ended up fixing this differently with follow_pmd_devmap() introduced
> in this series:
> 
> https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html
> 
> Does the latest libnvdimm-pending branch [1] pass your test case?

Hi Dan,

I ran several test cases, and they all hit the case "pfn not in memmap" in
__dax_pmd_fault() during mmap(MAP_POPULATE).  Looking at the dax.pfn, PFN_DEV is
set but PFN_MAP is not.  I have not looked into why, but I thought I let you
know first.  I've also seen the test thread got hung up at the end sometime.  

I also noticed that reason is not set in the case below.

                if (length < PMD_SIZE
                                || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) {
                        dax_unmap_atomic(bdev, &dax);
                          goto fallback;
                }

Thanks,
-Toshi
Dan Williams Dec. 2, 2015, 3:45 a.m. UTC | #5
On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote:
>> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > The following oops was observed when mmap() with MAP_POPULATE
>> > pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
>> > expects that a target address has a struct page.
>> >
>> >   BUG: unable to handle kernel paging request at ffffea0012220000
>> >   follow_trans_huge_pmd+0xba/0x390
>> >   follow_page_mask+0x33d/0x420
>> >   __get_user_pages+0xdc/0x800
>> >   populate_vma_page_range+0xb5/0xe0
>> >   __mm_populate+0xc5/0x150
>> >   vm_mmap_pgoff+0xd5/0xe0
>> >   SyS_mmap_pgoff+0x1c1/0x290
>> >   SyS_mmap+0x1b/0x30
>> >
>> > Fix it by making the PMD pre-fault handling consistent with PTE.
>> > After pre-faulted in faultin_page(), follow_page_mask() calls
>> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
>> > for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
>> > and returns with -EEXIST.
>> >
>> > Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
>> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
>> > Cc: Andrew Morton <akpm@linux-foundation.org>
>> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
>> > Cc: Matthew Wilcox <willy@linux.intel.com>
>> > Cc: Dan Williams <dan.j.williams@intel.com>
>> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
>> > ---
>>
>> Hey Toshi,
>>
>> I ended up fixing this differently with follow_pmd_devmap() introduced
>> in this series:
>>
>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html
>>
>> Does the latest libnvdimm-pending branch [1] pass your test case?
>
> Hi Dan,
>
> I ran several test cases, and they all hit the case "pfn not in memmap" in
> __dax_pmd_fault() during mmap(MAP_POPULATE).  Looking at the dax.pfn, PFN_DEV is
> set but PFN_MAP is not.  I have not looked into why, but I thought I let you
> know first.  I've also seen the test thread got hung up at the end sometime.

That PFN_MAP flag will not be set by default for NFIT-defined
persistent memory.  See pmem_should_map_pages() for pmem namespaces
that will have it set by default, currently only e820 type-12 memory
ranges.

NFIT-defined persistent memory can have a memmap array dynamically
allocated by setting up a pfn device (similar to setting up a btt).
We don't map it by default because the NFIT may describe hundreds of
gigabytes of persistent and the overhead of the memmap may be too
large to locate the memmap in ram.

I have a pending patch in libnvdimm-pending that allows the capacity
for the memmap to come from pmem instead of ram:

https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/commit/?h=libnvdimm-pending&id=3117a24e07fe

> I also noticed that reason is not set in the case below.
>
>                 if (length < PMD_SIZE
>                                 || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) {
>                         dax_unmap_atomic(bdev, &dax);
>                           goto fallback;
>                 }

Thanks, I'll fix that up.
Dan Williams Dec. 2, 2015, 5:01 p.m. UTC | #6
On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> Oh, I see.  I will setup the memmap array and run the tests again.
>
> But, why does the PMD mapping depend on the memmap array?  We have observed
> major performance improvement with PMD.  This feature should always be enabled
> with DAX regardless of the option to allocate the memmap array.
>

Several factors drove this decision, I'm open to considering
alternatives but here's the reasoning:

1/ DAX pmd mappings caused crashes in the get_user_pages path leading
to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
mappings don't crash and instead trigger -EFAULT is due to the
_PAGE_SPECIAL pte bit.

2/ To enable get_user_pages for DAX, in both the page and huge-page
case, we need a new pte bit _PAGE_DEVMAP.

3/ Given the pte bits are hard to come I'm assuming we won't get two,
i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.

End result is that DAX pmd mappings must be fully enabled through the
get_user_pages paths with _PAGE_DEVMAP or turned off completely.  In
general I think the "page less" DAX implementation was a good starting
point, but we need to shift to page-backed by default until we can
teach more of the kernel to operate on bare pfns.  That "default" will
need to be enforced by userspace tooling.
Kani, Toshi Dec. 2, 2015, 5:43 p.m. UTC | #7
On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote:
> On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote:
> > > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > The following oops was observed when mmap() with MAP_POPULATE
> > > > pre-faulted pmd mappings of a DAX file.  follow_trans_huge_pmd()
> > > > expects that a target address has a struct page.
> > > > 
> > > >   BUG: unable to handle kernel paging request at ffffea0012220000
> > > >   follow_trans_huge_pmd+0xba/0x390
> > > >   follow_page_mask+0x33d/0x420
> > > >   __get_user_pages+0xdc/0x800
> > > >   populate_vma_page_range+0xb5/0xe0
> > > >   __mm_populate+0xc5/0x150
> > > >   vm_mmap_pgoff+0xd5/0xe0
> > > >   SyS_mmap_pgoff+0x1c1/0x290
> > > >   SyS_mmap+0x1b/0x30
> > > > 
> > > > Fix it by making the PMD pre-fault handling consistent with PTE.
> > > > After pre-faulted in faultin_page(), follow_page_mask() calls
> > > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd()
> > > > for VM_PFNMAP or VM_MIXEDMAP.  follow_pfn_pmd() handles FOLL_TOUCH
> > > > and returns with -EEXIST.
> > > > 
> > > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com>
> > > > Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
> > > > Cc: Andrew Morton <akpm@linux-foundation.org>
> > > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > > > Cc: Matthew Wilcox <willy@linux.intel.com>
> > > > Cc: Dan Williams <dan.j.williams@intel.com>
> > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> > > > ---
> > > 
> > > Hey Toshi,
> > > 
> > > I ended up fixing this differently with follow_pmd_devmap() introduced
> > > in this series:
> > > 
> > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html
> > > 
> > > Does the latest libnvdimm-pending branch [1] pass your test case?
> > 
> > Hi Dan,
> > 
> > I ran several test cases, and they all hit the case "pfn not in memmap" in
> > __dax_pmd_fault() during mmap(MAP_POPULATE).  Looking at the dax.pfn,
> > PFN_DEV is
> > set but PFN_MAP is not.  I have not looked into why, but I thought I let you
> > know first.  I've also seen the test thread got hung up at the end sometime.
> 
> That PFN_MAP flag will not be set by default for NFIT-defined
> persistent memory.  See pmem_should_map_pages() for pmem namespaces
> that will have it set by default, currently only e820 type-12 memory
> ranges.
> 
> NFIT-defined persistent memory can have a memmap array dynamically
> allocated by setting up a pfn device (similar to setting up a btt).
> We don't map it by default because the NFIT may describe hundreds of
> gigabytes of persistent and the overhead of the memmap may be too
> large to locate the memmap in ram.

Oh, I see.  I will setup the memmap array and run the tests again.

But, why does the PMD mapping depend on the memmap array?  We have observed
major performance improvement with PMD.  This feature should always be enabled
with DAX regardless of the option to allocate the memmap array.

Thanks,
-Toshi
Dan Williams Dec. 2, 2015, 6:06 p.m. UTC | #8
On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> Oh, I see.  I will setup the memmap array and run the tests again.
>>
>> But, why does the PMD mapping depend on the memmap array?  We have observed
>> major performance improvement with PMD.  This feature should always be enabled
>> with DAX regardless of the option to allocate the memmap array.
>>
>
> Several factors drove this decision, I'm open to considering
> alternatives but here's the reasoning:
>
> 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
> to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
> mappings don't crash and instead trigger -EFAULT is due to the
> _PAGE_SPECIAL pte bit.
>
> 2/ To enable get_user_pages for DAX, in both the page and huge-page
> case, we need a new pte bit _PAGE_DEVMAP.
>
> 3/ Given the pte bits are hard to come I'm assuming we won't get two,
> i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
> could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.

Actually, Dave says they aren't that hard to come by for pmds, so we
could go add _PMD_SPECIAL if we really wanted to support the limited
page-less DAX-pmd case.

But I'm still of the opinion that we run away from the page-less case
until it can be made a full class citizen with O_DIRECT for pfn
support.
Dan Williams Dec. 2, 2015, 7 p.m. UTC | #9
On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
>> On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote:
>> > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > > Oh, I see.  I will setup the memmap array and run the tests again.
>> > >
>> > > But, why does the PMD mapping depend on the memmap array?  We have
>> > > observed major performance improvement with PMD.  This feature should
>> > > always be enabled with DAX regardless of the option to allocate the memmap
>> > > array.
>> > >
>> >
>> > Several factors drove this decision, I'm open to considering
>> > alternatives but here's the reasoning:
>> >
>> > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
>> > to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
>> > mappings don't crash and instead trigger -EFAULT is due to the
>> > _PAGE_SPECIAL pte bit.
>> >
>> > 2/ To enable get_user_pages for DAX, in both the page and huge-page
>> > case, we need a new pte bit _PAGE_DEVMAP.
>> >
>> > 3/ Given the pte bits are hard to come I'm assuming we won't get two,
>> > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
>> > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.
>>
>> Actually, Dave says they aren't that hard to come by for pmds, so we
>> could go add _PMD_SPECIAL if we really wanted to support the limited
>> page-less DAX-pmd case.
>>
>> But I'm still of the opinion that we run away from the page-less case
>> until it can be made a full class citizen with O_DIRECT for pfn
>> support.
>
> I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can
> be substituted by the following check when we do not have the memmap.
>
>         if ((vma->vm_flags & VM_PFNMAP) ||
>             ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {
>
> This is what I did in this patch for follow_trans_huge_pmd(), although I missed
> the pfn_valid() check.

That works for __get_user_pages but not __get_user_pages_fast where we
don't have access to the vma.
Kani, Toshi Dec. 2, 2015, 7:26 p.m. UTC | #10
On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote:
> > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > Oh, I see.  I will setup the memmap array and run the tests again.
> > > 
> > > But, why does the PMD mapping depend on the memmap array?  We have 
> > > observed major performance improvement with PMD.  This feature should 
> > > always be enabled with DAX regardless of the option to allocate the memmap
> > > array.
> > > 
> > 
> > Several factors drove this decision, I'm open to considering
> > alternatives but here's the reasoning:
> > 
> > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
> > to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
> > mappings don't crash and instead trigger -EFAULT is due to the
> > _PAGE_SPECIAL pte bit.
> > 
> > 2/ To enable get_user_pages for DAX, in both the page and huge-page
> > case, we need a new pte bit _PAGE_DEVMAP.
> > 
> > 3/ Given the pte bits are hard to come I'm assuming we won't get two,
> > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
> > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.
> 
> Actually, Dave says they aren't that hard to come by for pmds, so we
> could go add _PMD_SPECIAL if we really wanted to support the limited
> page-less DAX-pmd case.
> 
> But I'm still of the opinion that we run away from the page-less case
> until it can be made a full class citizen with O_DIRECT for pfn
> support.

I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can
be substituted by the following check when we do not have the memmap.

	if ((vma->vm_flags & VM_PFNMAP) ||
	    ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {

This is what I did in this patch for follow_trans_huge_pmd(), although I missed
the pfn_valid() check.

Thanks,
-Toshi
Dan Williams Dec. 2, 2015, 7:57 p.m. UTC | #11
On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote:
>> On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote:
>> > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
>> > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com>
>> > > > wrote:
>> > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > > > > > Oh, I see.  I will setup the memmap array and run the tests again.
>> > > > > >
>> > > > > > But, why does the PMD mapping depend on the memmap array?  We have
>> > > > > > observed major performance improvement with PMD.  This feature
>> > > > > > should always be enabled with DAX regardless of the option to
>> > > > > > allocate the memmap array.
>> > > > > >
>> > > > >
>> > > > > Several factors drove this decision, I'm open to considering
>> > > > > alternatives but here's the reasoning:
>> > > > >
>> > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
>> > > > > to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
>> > > > > mappings don't crash and instead trigger -EFAULT is due to the
>> > > > > _PAGE_SPECIAL pte bit.
>> > > > >
>> > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page
>> > > > > case, we need a new pte bit _PAGE_DEVMAP.
>> > > > >
>> > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two,
>> > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
>> > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.
>> > > >
>> > > > Actually, Dave says they aren't that hard to come by for pmds, so we
>> > > > could go add _PMD_SPECIAL if we really wanted to support the limited
>> > > > page-less DAX-pmd case.
>> > > >
>> > > > But I'm still of the opinion that we run away from the page-less case
>> > > > until it can be made a full class citizen with O_DIRECT for pfn
>> > > > support.
>> > >
>> > > I may be missing something, but per vm_normal_page(), I think
>> > > _PAGE_SPECIAL can be substituted by the following check when we do not
>> > > have the memmap.
>> > >
>> > >         if ((vma->vm_flags & VM_PFNMAP) ||
>> > >             ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {
>> > >
>> > > This is what I did in this patch for follow_trans_huge_pmd(), although I
>> > > missed the pfn_valid() check.
>> >
>> > That works for __get_user_pages but not __get_user_pages_fast where we
>> > don't have access to the vma.
>>
>> __get_user_page_fast already refers current->mm, so we should be able to get
>> the vma, and pass it down to gup_pud_range().
>
> Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the
> !pfn_valid() condition is met, so that we do not add the code to the main path
> of __get_user_pages_fast.

The whole point of __get_user_page_fast() is to avoid the overhead of
taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
__get_user_pages_fast that it needs to fallback to the
__get_user_pages slow path.
Kani, Toshi Dec. 2, 2015, 8:02 p.m. UTC | #12
On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
> > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com>
> > > wrote:
> > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > > Oh, I see.  I will setup the memmap array and run the tests again.
> > > > > 
> > > > > But, why does the PMD mapping depend on the memmap array?  We have
> > > > > observed major performance improvement with PMD.  This feature should
> > > > > always be enabled with DAX regardless of the option to allocate the
> > > > > memmap
> > > > > array.
> > > > > 
> > > > 
> > > > Several factors drove this decision, I'm open to considering
> > > > alternatives but here's the reasoning:
> > > > 
> > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
> > > > to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
> > > > mappings don't crash and instead trigger -EFAULT is due to the
> > > > _PAGE_SPECIAL pte bit.
> > > > 
> > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page
> > > > case, we need a new pte bit _PAGE_DEVMAP.
> > > > 
> > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two,
> > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
> > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.
> > > 
> > > Actually, Dave says they aren't that hard to come by for pmds, so we
> > > could go add _PMD_SPECIAL if we really wanted to support the limited
> > > page-less DAX-pmd case.
> > > 
> > > But I'm still of the opinion that we run away from the page-less case
> > > until it can be made a full class citizen with O_DIRECT for pfn
> > > support.
> > 
> > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL
> > can
> > be substituted by the following check when we do not have the memmap.
> > 
> >         if ((vma->vm_flags & VM_PFNMAP) ||
> >             ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {
> > 
> > This is what I did in this patch for follow_trans_huge_pmd(), although I
> > missed
> > the pfn_valid() check.
> 
> That works for __get_user_pages but not __get_user_pages_fast where we
> don't have access to the vma.

__get_user_page_fast already refers current->mm, so we should be able to get the
vma, and pass it down to gup_pud_range().

Thanks,
-Toshi
Kani, Toshi Dec. 2, 2015, 8:12 p.m. UTC | #13
On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote:
> On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote:
> > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
> > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com>
> > > > wrote:
> > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > > > Oh, I see.  I will setup the memmap array and run the tests again.
> > > > > > 
> > > > > > But, why does the PMD mapping depend on the memmap array?  We have
> > > > > > observed major performance improvement with PMD.  This feature 
> > > > > > should always be enabled with DAX regardless of the option to 
> > > > > > allocate the memmap array.
> > > > > > 
> > > > > 
> > > > > Several factors drove this decision, I'm open to considering
> > > > > alternatives but here's the reasoning:
> > > > > 
> > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading
> > > > > to commit e82c9ed41e8 "dax: disable pmd mappings".  The reason pte
> > > > > mappings don't crash and instead trigger -EFAULT is due to the
> > > > > _PAGE_SPECIAL pte bit.
> > > > > 
> > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page
> > > > > case, we need a new pte bit _PAGE_DEVMAP.
> > > > > 
> > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two,
> > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds.  Even if we
> > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it.
> > > > 
> > > > Actually, Dave says they aren't that hard to come by for pmds, so we
> > > > could go add _PMD_SPECIAL if we really wanted to support the limited
> > > > page-less DAX-pmd case.
> > > > 
> > > > But I'm still of the opinion that we run away from the page-less case
> > > > until it can be made a full class citizen with O_DIRECT for pfn
> > > > support.
> > > 
> > > I may be missing something, but per vm_normal_page(), I think 
> > > _PAGE_SPECIAL can be substituted by the following check when we do not
> > > have the memmap.
> > > 
> > >         if ((vma->vm_flags & VM_PFNMAP) ||
> > >             ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {
> > > 
> > > This is what I did in this patch for follow_trans_huge_pmd(), although I
> > > missed the pfn_valid() check.
> > 
> > That works for __get_user_pages but not __get_user_pages_fast where we
> > don't have access to the vma.
> 
> __get_user_page_fast already refers current->mm, so we should be able to get 
> the vma, and pass it down to gup_pud_range().

Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the
!pfn_valid() condition is met, so that we do not add the code to the main path
of __get_user_pages_fast.

Thanks,
-Toshi
Dan Williams Dec. 2, 2015, 8:54 p.m. UTC | #14
On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
[..]
>> The whole point of __get_user_page_fast() is to avoid the overhead of
>> taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
>> __get_user_pages_fast that it needs to fallback to the
>> __get_user_pages slow path.
>
> I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
> instead of VM_BUG_ON.

Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
per node... what happens when pmem is located below that point.  I
haven't been able to convince myself that we won't get false
positives, but maybe I'm missing something.
Kani, Toshi Dec. 2, 2015, 9:37 p.m. UTC | #15
On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote:
> > > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote:
> > > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote:
> > > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <
> > > > > > dan.j.williams@intel.com>
> > > > > > wrote:
> > > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com>
> > > > > > > wrote:
> > > > > > > > Oh, I see.  I will setup the memmap array and run the tests
> > > > > > > > again.
> > > > > > > > 
> > > > > > > > But, why does the PMD mapping depend on the memmap array?  We 
> > > > > > > > have observed major performance improvement with PMD.  This 
> > > > > > > > feature should always be enabled with DAX regardless of the 
> > > > > > > > option to allocate the memmap array.
> > > > > > > > 
> > > > > > > 
> > > > > > > Several factors drove this decision, I'm open to considering
> > > > > > > alternatives but here's the reasoning:
> > > > > > > 
> > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path 
> > > > > > > leading to commit e82c9ed41e8 "dax: disable pmd mappings".  The 
> > > > > > > reason pte mappings don't crash and instead trigger -EFAULT is due 
> > > > > > > to the _PAGE_SPECIAL pte bit.
> > > > > > > 
> > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge
> > > > > > > -page case, we need a new pte bit _PAGE_DEVMAP.
> > > > > > > 
> > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get 
> > > > > > > two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. 
> > > > > > >  Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of 
> > > > > > > pursuing it.
> > > > > > 
> > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we
> > > > > > could go add _PMD_SPECIAL if we really wanted to support the limited
> > > > > > page-less DAX-pmd case.
> > > > > > 
> > > > > > But I'm still of the opinion that we run away from the page-less 
> > > > > > case until it can be made a full class citizen with O_DIRECT for pfn
> > > > > > support.
> > > > > 
> > > > > I may be missing something, but per vm_normal_page(), I think
> > > > > _PAGE_SPECIAL can be substituted by the following check when we do not
> > > > > have the memmap.
> > > > > 
> > > > >         if ((vma->vm_flags & VM_PFNMAP) ||
> > > > >             ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) {
> > > > > 
> > > > > This is what I did in this patch for follow_trans_huge_pmd(), although 
> > > > > I missed the pfn_valid() check.
> > > > 
> > > > That works for __get_user_pages but not __get_user_pages_fast where we
> > > > don't have access to the vma.
> > > 
> > > __get_user_page_fast already refers current->mm, so we should be able to 
> > > get the vma, and pass it down to gup_pud_range().
> > 
> > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when 
> > the !pfn_valid() condition is met, so that we do not add the code to the 
> > main path of __get_user_pages_fast.
> 
> The whole point of __get_user_page_fast() is to avoid the overhead of
> taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
> __get_user_pages_fast that it needs to fallback to the
> __get_user_pages slow path.

I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
instead of VM_BUG_ON.

Thanks,
-Toshi
Kani, Toshi Dec. 2, 2015, 9:55 p.m. UTC | #16
On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
> [..]
> > > The whole point of __get_user_page_fast() is to avoid the overhead of
> > > taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
> > > __get_user_pages_fast that it needs to fallback to the
> > > __get_user_pages slow path.
> > 
> > I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
> > instead of VM_BUG_ON.
> 
> Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
> per node... what happens when pmem is located below that point.  I
> haven't been able to convince myself that we won't get false
> positives, but maybe I'm missing something.

I believe we use the version of pfn_valid() in linux/mmzone.h.

Thanks,
-Toshi
Dave Hansen Dec. 2, 2015, 10 p.m. UTC | #17
On 12/02/2015 12:54 PM, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
> [..]
>>> >> The whole point of __get_user_page_fast() is to avoid the overhead of
>>> >> taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
>>> >> __get_user_pages_fast that it needs to fallback to the
>>> >> __get_user_pages slow path.
>> >
>> > I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
>> > instead of VM_BUG_ON.
> Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
> per node... what happens when pmem is located below that point.  I
> haven't been able to convince myself that we won't get false
> positives, but maybe I'm missing something.

With sparsemem at least, it makes sure that you're looking at a valid
_section_.  See the pfn_valid() at ~include/linux/mmzone.h:1222.
Dan Williams Dec. 2, 2015, 10:03 p.m. UTC | #18
On Wed, Dec 2, 2015 at 2:00 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/02/2015 12:54 PM, Dan Williams wrote:
>> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
>> [..]
>>>> >> The whole point of __get_user_page_fast() is to avoid the overhead of
>>>> >> taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
>>>> >> __get_user_pages_fast that it needs to fallback to the
>>>> >> __get_user_pages slow path.
>>> >
>>> > I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
>>> > instead of VM_BUG_ON.
>> Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
>> per node... what happens when pmem is located below that point.  I
>> haven't been able to convince myself that we won't get false
>> positives, but maybe I'm missing something.
>
> With sparsemem at least, it makes sure that you're looking at a valid
> _section_.  See the pfn_valid() at ~include/linux/mmzone.h:1222.

At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD".
Dave Hansen Dec. 2, 2015, 10:09 p.m. UTC | #19
On 12/02/2015 02:03 PM, Dan Williams wrote:
>>> >> Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
>>> >> per node... what happens when pmem is located below that point.  I
>>> >> haven't been able to convince myself that we won't get false
>>> >> positives, but maybe I'm missing something.
>> >
>> > With sparsemem at least, it makes sure that you're looking at a valid
>> > _section_.  See the pfn_valid() at ~include/linux/mmzone.h:1222.
> At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD".

Yeah, it seems like an awful layering violation.  But, sparsemem is
turned on everywhere (all the distros/users) that we care about, as far
as I know.
Dan Williams Dec. 2, 2015, 11:33 p.m. UTC | #20
On Wed, Dec 2, 2015 at 4:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote:
>> On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote:
>> > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote:
>  :
>> > > >
>> > > > Hey Toshi,
>> > > >
>> > > > I ended up fixing this differently with follow_pmd_devmap() introduced
>> > > > in this series:
>> > > >
>> > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html
>> > > >
>> > > > Does the latest libnvdimm-pending branch [1] pass your test case?
>> > >
>> > > Hi Dan,
>> > >
>> > > I ran several test cases, and they all hit the case "pfn not in memmap" in
>> > > __dax_pmd_fault() during mmap(MAP_POPULATE).  Looking at the dax.pfn,
>> > > PFN_DEV is set but PFN_MAP is not.  I have not looked into why, but I
>> > > thought I let you know first.  I've also seen the test thread got hung up
>> > > at the end sometime.
>> >
>> > That PFN_MAP flag will not be set by default for NFIT-defined
>> > persistent memory.  See pmem_should_map_pages() for pmem namespaces
>> > that will have it set by default, currently only e820 type-12 memory
>> > ranges.
>> >
>> > NFIT-defined persistent memory can have a memmap array dynamically
>> > allocated by setting up a pfn device (similar to setting up a btt).
>> > We don't map it by default because the NFIT may describe hundreds of
>> > gigabytes of persistent and the overhead of the memmap may be too
>> > large to locate the memmap in ram.
>>
>> Oh, I see.  I will setup the memmap array and run the tests again.
>
> I setup a pfn device, and ran a few test cases again.  Yes, it solved the
> PFN_MAP issue.  However, I am no longer able to allocate FS blocks aligned by
> 2MB, so PMD faults fall back to PTE.  They are off by 2 pages, which I suspect
> due to the pfn metadata.If I pass a 2MB-aligned+2pages virtual address to
> mmap(MAP_POPULATE), the mmap() call gets hung up.

Ok, I need to switch over from my memmap=ss!nn config.  We just need
to pad the info block reservation to 2M.  As for the MAP_POPULATE
hang, I'll take a look.

Right now I'm in the process of rebasing the whole set on top of -mm
which has a pending THP re-works from Kirill.
Kani, Toshi Dec. 3, 2015, 12:21 a.m. UTC | #21
On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote:
> On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote:
> > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote:
 :
> > > > 
> > > > Hey Toshi,
> > > > 
> > > > I ended up fixing this differently with follow_pmd_devmap() introduced
> > > > in this series:
> > > > 
> > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html
> > > > 
> > > > Does the latest libnvdimm-pending branch [1] pass your test case?
> > > 
> > > Hi Dan,
> > > 
> > > I ran several test cases, and they all hit the case "pfn not in memmap" in
> > > __dax_pmd_fault() during mmap(MAP_POPULATE).  Looking at the dax.pfn,
> > > PFN_DEV is set but PFN_MAP is not.  I have not looked into why, but I 
> > > thought I let you know first.  I've also seen the test thread got hung up 
> > > at the end sometime.
> > 
> > That PFN_MAP flag will not be set by default for NFIT-defined
> > persistent memory.  See pmem_should_map_pages() for pmem namespaces
> > that will have it set by default, currently only e820 type-12 memory
> > ranges.
> > 
> > NFIT-defined persistent memory can have a memmap array dynamically
> > allocated by setting up a pfn device (similar to setting up a btt).
> > We don't map it by default because the NFIT may describe hundreds of
> > gigabytes of persistent and the overhead of the memmap may be too
> > large to locate the memmap in ram.
> 
> Oh, I see.  I will setup the memmap array and run the tests again.

I setup a pfn device, and ran a few test cases again.  Yes, it solved the
PFN_MAP issue.  However, I am no longer able to allocate FS blocks aligned by
2MB, so PMD faults fall back to PTE.  They are off by 2 pages, which I suspect
due to the pfn metadata.  If I pass a 2MB-aligned+2pages virtual address to
mmap(MAP_POPULATE), the mmap() call gets hung up.

Thanks,
-Toshi
Dan Williams Dec. 3, 2015, 11:43 p.m. UTC | #22
On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote:
>> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
>> [..]
>> > > The whole point of __get_user_page_fast() is to avoid the overhead of
>> > > taking the mm semaphore to access the vma.  _PAGE_SPECIAL simply tells
>> > > __get_user_pages_fast that it needs to fallback to the
>> > > __get_user_pages slow path.
>> >
>> > I see.  Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(),
>> > instead of VM_BUG_ON.
>>
>> Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
>> per node... what happens when pmem is located below that point.  I
>> haven't been able to convince myself that we won't get false
>> positives, but maybe I'm missing something.
>
> I believe we use the version of pfn_valid() in linux/mmzone.h.

Talking this over with Dave we came to the conclusion that it would be
safer to be explicit about the pmd not being mapped.  He points out
that unless a platform can guarantee that persistent memory is always
section aligned we might get false positive pfn_valid() indications.
Given the get_user_pages_fast() path is arch specific we can simply
have an arch specific pmd bit and not worry about generically enabling
a "pmd special" bit for now.
Kani, Toshi Dec. 4, 2015, 4:55 p.m. UTC | #23
On Thu, 2015-12-03 at 15:43 -0800, Dan Williams wrote:
> On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani <toshi.kani@hpe.com> wrote:
> > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote:
> > > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com>
> > > wrote:
> > > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote:
> > > [..]
> > > > > The whole point of __get_user_page_fast() is to avoid the 
> > > > > overhead of taking the mm semaphore to access the vma. 
> > > > >  _PAGE_SPECIAL simply tells
> > > > > __get_user_pages_fast that it needs to fallback to the
> > > > > __get_user_pages slow path.
> > > > 
> > > > I see.  Then, I think gup_huge_pmd() can simply return 0 when 
> > > > !pfn_valid(), instead of VM_BUG_ON.
> > > 
> > > Is pfn_valid() a reliable check?  It seems to be based on a max_pfn
> > > per node... what happens when pmem is located below that point.  I
> > > haven't been able to convince myself that we won't get false
> > > positives, but maybe I'm missing something.
> > 
> > I believe we use the version of pfn_valid() in linux/mmzone.h.
> 
> Talking this over with Dave we came to the conclusion that it would be
> safer to be explicit about the pmd not being mapped.  He points out
> that unless a platform can guarantee that persistent memory is always
> section aligned we might get false positive pfn_valid() indications.
> Given the get_user_pages_fast() path is arch specific we can simply
> have an arch specific pmd bit and not worry about generically enabling
> a "pmd special" bit for now.

Sounds good to me.  Thanks!
-Toshi
diff mbox

Patch

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5b8920..f56e034 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1267,6 +1267,32 @@  out_unlock:
 	return ret;
 }
 
+/*
+ * Follow a pmd inserted by vmf_insert_pfn_pmd(). See follow_pfn_pte() for pte.
+ */
+static int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address,
+		pmd_t *pmd, unsigned int flags)
+{
+	/* No page to get reference */
+	if (flags & FOLL_GET)
+		return -EFAULT;
+
+	if (flags & FOLL_TOUCH) {
+		pmd_t entry = *pmd;
+
+		/* Set the dirty bit per follow_trans_huge_pmd() */
+		entry = pmd_mkyoung(pmd_mkdirty(entry));
+
+		if (!pmd_same(*pmd, entry)) {
+			set_pmd_at(vma->vm_mm, address, pmd, entry);
+			update_mmu_cache_pmd(vma, address, pmd);
+		}
+	}
+
+	/* Proper page table entry exists, but no corresponding struct page */
+	return -EEXIST;
+}
+
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 				   unsigned long addr,
 				   pmd_t *pmd,
@@ -1274,6 +1300,7 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct page *page = NULL;
+	int ret;
 
 	assert_spin_locked(pmd_lockptr(mm, pmd));
 
@@ -1288,6 +1315,13 @@  struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 	if ((flags & FOLL_NUMA) && pmd_protnone(*pmd))
 		goto out;
 
+	/* pfn map does not have a struct page */
+	if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+		ret = follow_pfn_pmd(vma, addr, pmd, flags);
+		page = ERR_PTR(ret);
+		goto out;
+	}
+
 	page = pmd_page(*pmd);
 	VM_BUG_ON_PAGE(!PageHead(page), page);
 	if (flags & FOLL_TOUCH) {