Message ID | 1448309082-20851-1-git-send-email-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do something more comprehensive in 4.5. > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Matthew Wilcox <willy@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d5b8920..f56e034 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c [..] > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > goto out; > > + /* pfn map does not have a struct page */ > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > + page = ERR_PTR(ret); > + goto out; > + } > + > page = pmd_page(*pmd); > VM_BUG_ON_PAGE(!PageHead(page), page); > if (flags & FOLL_TOUCH) { I think it is already problematic that dax pmd mappings are getting confused with transparent huge pages. They're more closely related to a hugetlbfs pmd mappings in that they are mapping an explicit allocation. I have some pending patches to address this dax-pmd vs hugetlb-pmd vs thp-pmd classification that I will post shortly. By the way, I'm collecting DAX pmd regression tests [1], is this just a simple crash upon using MAP_POPULATE? [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c
On Mon, 2015-11-23 at 12:53 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do > something more comprehensive in 4.5. Yes, I noticed during my testing that I could not use pmd... > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> > > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Cc: Matthew Wilcox <willy@linux.intel.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > > --- > > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d5b8920..f56e034 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > [..] > > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct > > vm_area_struct *vma, > > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > > goto out; > > > > + /* pfn map does not have a struct page */ > > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > > + page = ERR_PTR(ret); > > + goto out; > > + } > > + > > page = pmd_page(*pmd); > > VM_BUG_ON_PAGE(!PageHead(page), page); > > if (flags & FOLL_TOUCH) { > > I think it is already problematic that dax pmd mappings are getting > confused with transparent huge pages. We had the same issue with dax pte mapping [1], and this change extends the pfn map handling to pmd. So, this problem is not specific to pmd. [1] https://lkml.org/lkml/2015/6/23/181 > They're more closely related to > a hugetlbfs pmd mappings in that they are mapping an explicit > allocation. I have some pending patches to address this dax-pmd vs > hugetlb-pmd vs thp-pmd classification that I will post shortly. Not sure which way is better, but I am certainly interested in your changes. > By the way, I'm collecting DAX pmd regression tests [1], is this just > a simple crash upon using MAP_POPULATE? > > [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c Yes, this issue is easy to reproduce with MAP_POPULATE. In case it helps, attached are the test I used for testing the patches. Sorry, the code is messy since it was only intended for my internal use... - The test was originally written for the pte change [1] and comments in test.sh (ex. mlock fail, ok) reflect the results without the pte change. - For the pmd test, I modified test-mmap.c to call posix_memalign() before mmap(). By calling free(), the 2MB-aligned address from posix_memalign() can be used for mmap(). This keeps the mmap'd address aligned on 2MB. - I created test file(s) with dd (i.e. all blocks written) in my test. - The other infinite loop issue (fixed by my other patch) was found by the test case with option "-LMSr". Thanks, -Toshi
On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > Cc: Matthew Wilcox <willy@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > --- Hey Toshi, I ended up fixing this differently with follow_pmd_devmap() introduced in this series: https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html Does the latest libnvdimm-pending branch [1] pass your test case? [1]: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending
On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> > > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > Cc: Matthew Wilcox <willy@linux.intel.com> > > Cc: Dan Williams <dan.j.williams@intel.com> > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > > --- > > Hey Toshi, > > I ended up fixing this differently with follow_pmd_devmap() introduced > in this series: > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > Does the latest libnvdimm-pending branch [1] pass your test case? Hi Dan, I ran several test cases, and they all hit the case "pfn not in memmap" in __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I thought I let you know first. I've also seen the test thread got hung up at the end sometime. I also noticed that reason is not set in the case below. if (length < PMD_SIZE || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { dax_unmap_atomic(bdev, &dax); goto fallback; } Thanks, -Toshi
On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: >> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > The following oops was observed when mmap() with MAP_POPULATE >> > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() >> > expects that a target address has a struct page. >> > >> > BUG: unable to handle kernel paging request at ffffea0012220000 >> > follow_trans_huge_pmd+0xba/0x390 >> > follow_page_mask+0x33d/0x420 >> > __get_user_pages+0xdc/0x800 >> > populate_vma_page_range+0xb5/0xe0 >> > __mm_populate+0xc5/0x150 >> > vm_mmap_pgoff+0xd5/0xe0 >> > SyS_mmap_pgoff+0x1c1/0x290 >> > SyS_mmap+0x1b/0x30 >> > >> > Fix it by making the PMD pre-fault handling consistent with PTE. >> > After pre-faulted in faultin_page(), follow_page_mask() calls >> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() >> > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH >> > and returns with -EEXIST. >> > >> > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> >> > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> >> > Cc: Andrew Morton <akpm@linux-foundation.org> >> > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> >> > Cc: Matthew Wilcox <willy@linux.intel.com> >> > Cc: Dan Williams <dan.j.williams@intel.com> >> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> >> > --- >> >> Hey Toshi, >> >> I ended up fixing this differently with follow_pmd_devmap() introduced >> in this series: >> >> https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> >> Does the latest libnvdimm-pending branch [1] pass your test case? > > Hi Dan, > > I ran several test cases, and they all hit the case "pfn not in memmap" in > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is > set but PFN_MAP is not. I have not looked into why, but I thought I let you > know first. I've also seen the test thread got hung up at the end sometime. That PFN_MAP flag will not be set by default for NFIT-defined persistent memory. See pmem_should_map_pages() for pmem namespaces that will have it set by default, currently only e820 type-12 memory ranges. NFIT-defined persistent memory can have a memmap array dynamically allocated by setting up a pfn device (similar to setting up a btt). We don't map it by default because the NFIT may describe hundreds of gigabytes of persistent and the overhead of the memmap may be too large to locate the memmap in ram. I have a pending patch in libnvdimm-pending that allows the capacity for the memmap to come from pmem instead of ram: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/commit/?h=libnvdimm-pending&id=3117a24e07fe > I also noticed that reason is not set in the case below. > > if (length < PMD_SIZE > || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { > dax_unmap_atomic(bdev, &dax); > goto fallback; > } Thanks, I'll fix that up.
On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > Oh, I see. I will setup the memmap array and run the tests again. > > But, why does the PMD mapping depend on the memmap array? We have observed > major performance improvement with PMD. This feature should always be enabled > with DAX regardless of the option to allocate the memmap array. > Several factors drove this decision, I'm open to considering alternatives but here's the reasoning: 1/ DAX pmd mappings caused crashes in the get_user_pages path leading to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte mappings don't crash and instead trigger -EFAULT is due to the _PAGE_SPECIAL pte bit. 2/ To enable get_user_pages for DAX, in both the page and huge-page case, we need a new pte bit _PAGE_DEVMAP. 3/ Given the pte bits are hard to come I'm assuming we won't get two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. End result is that DAX pmd mappings must be fully enabled through the get_user_pages paths with _PAGE_DEVMAP or turned off completely. In general I think the "page less" DAX implementation was a good starting point, but we need to shift to page-backed by default until we can teach more of the kernel to operate on bare pfns. That "default" will need to be enforced by userspace tooling.
On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > > > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > > The following oops was observed when mmap() with MAP_POPULATE > > > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > > > expects that a target address has a struct page. > > > > > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > > > follow_trans_huge_pmd+0xba/0x390 > > > > follow_page_mask+0x33d/0x420 > > > > __get_user_pages+0xdc/0x800 > > > > populate_vma_page_range+0xb5/0xe0 > > > > __mm_populate+0xc5/0x150 > > > > vm_mmap_pgoff+0xd5/0xe0 > > > > SyS_mmap_pgoff+0x1c1/0x290 > > > > SyS_mmap+0x1b/0x30 > > > > > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > > > After pre-faulted in faultin_page(), follow_page_mask() calls > > > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > > > and returns with -EEXIST. > > > > > > > > Reported-by: Mauricio Porto <mauricio.porto@hpe.com> > > > > Signed-off-by: Toshi Kani <toshi.kani@hpe.com> > > > > Cc: Andrew Morton <akpm@linux-foundation.org> > > > > Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> > > > > Cc: Matthew Wilcox <willy@linux.intel.com> > > > > Cc: Dan Williams <dan.j.williams@intel.com> > > > > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > > > > --- > > > > > > Hey Toshi, > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > in this series: > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > Hi Dan, > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > PFN_DEV is > > set but PFN_MAP is not. I have not looked into why, but I thought I let you > > know first. I've also seen the test thread got hung up at the end sometime. > > That PFN_MAP flag will not be set by default for NFIT-defined > persistent memory. See pmem_should_map_pages() for pmem namespaces > that will have it set by default, currently only e820 type-12 memory > ranges. > > NFIT-defined persistent memory can have a memmap array dynamically > allocated by setting up a pfn device (similar to setting up a btt). > We don't map it by default because the NFIT may describe hundreds of > gigabytes of persistent and the overhead of the memmap may be too > large to locate the memmap in ram. Oh, I see. I will setup the memmap array and run the tests again. But, why does the PMD mapping depend on the memmap array? We have observed major performance improvement with PMD. This feature should always be enabled with DAX regardless of the option to allocate the memmap array. Thanks, -Toshi
On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote: > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: >> Oh, I see. I will setup the memmap array and run the tests again. >> >> But, why does the PMD mapping depend on the memmap array? We have observed >> major performance improvement with PMD. This feature should always be enabled >> with DAX regardless of the option to allocate the memmap array. >> > > Several factors drove this decision, I'm open to considering > alternatives but here's the reasoning: > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > mappings don't crash and instead trigger -EFAULT is due to the > _PAGE_SPECIAL pte bit. > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > case, we need a new pte bit _PAGE_DEVMAP. > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. Actually, Dave says they aren't that hard to come by for pmds, so we could go add _PMD_SPECIAL if we really wanted to support the limited page-less DAX-pmd case. But I'm still of the opinion that we run away from the page-less case until it can be made a full class citizen with O_DIRECT for pfn support.
On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote: >> > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > Oh, I see. I will setup the memmap array and run the tests again. >> > > >> > > But, why does the PMD mapping depend on the memmap array? We have >> > > observed major performance improvement with PMD. This feature should >> > > always be enabled with DAX regardless of the option to allocate the memmap >> > > array. >> > > >> > >> > Several factors drove this decision, I'm open to considering >> > alternatives but here's the reasoning: >> > >> > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > mappings don't crash and instead trigger -EFAULT is due to the >> > _PAGE_SPECIAL pte bit. >> > >> > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > case, we need a new pte bit _PAGE_DEVMAP. >> > >> > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> >> Actually, Dave says they aren't that hard to come by for pmds, so we >> could go add _PMD_SPECIAL if we really wanted to support the limited >> page-less DAX-pmd case. >> >> But I'm still of the opinion that we run away from the page-less case >> until it can be made a full class citizen with O_DIRECT for pfn >> support. > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can > be substituted by the following check when we do not have the memmap. > > if ((vma->vm_flags & VM_PFNMAP) || > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > This is what I did in this patch for follow_trans_huge_pmd(), although I missed > the pfn_valid() check. That works for __get_user_pages but not __get_user_pages_fast where we don't have access to the vma.
On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> wrote: > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > observed major performance improvement with PMD. This feature should > > > always be enabled with DAX regardless of the option to allocate the memmap > > > array. > > > > > > > Several factors drove this decision, I'm open to considering > > alternatives but here's the reasoning: > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > mappings don't crash and instead trigger -EFAULT is due to the > > _PAGE_SPECIAL pte bit. > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > case, we need a new pte bit _PAGE_DEVMAP. > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > Actually, Dave says they aren't that hard to come by for pmds, so we > could go add _PMD_SPECIAL if we really wanted to support the limited > page-less DAX-pmd case. > > But I'm still of the opinion that we run away from the page-less case > until it can be made a full class citizen with O_DIRECT for pfn > support. I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can be substituted by the following check when we do not have the memmap. if ((vma->vm_flags & VM_PFNMAP) || ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { This is what I did in this patch for follow_trans_huge_pmd(), although I missed the pfn_valid() check. Thanks, -Toshi
On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: >> On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: >> > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> >> > > > wrote: >> > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > > > > Oh, I see. I will setup the memmap array and run the tests again. >> > > > > > >> > > > > > But, why does the PMD mapping depend on the memmap array? We have >> > > > > > observed major performance improvement with PMD. This feature >> > > > > > should always be enabled with DAX regardless of the option to >> > > > > > allocate the memmap array. >> > > > > > >> > > > > >> > > > > Several factors drove this decision, I'm open to considering >> > > > > alternatives but here's the reasoning: >> > > > > >> > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > > > > mappings don't crash and instead trigger -EFAULT is due to the >> > > > > _PAGE_SPECIAL pte bit. >> > > > > >> > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > > > > case, we need a new pte bit _PAGE_DEVMAP. >> > > > > >> > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> > > > >> > > > Actually, Dave says they aren't that hard to come by for pmds, so we >> > > > could go add _PMD_SPECIAL if we really wanted to support the limited >> > > > page-less DAX-pmd case. >> > > > >> > > > But I'm still of the opinion that we run away from the page-less case >> > > > until it can be made a full class citizen with O_DIRECT for pfn >> > > > support. >> > > >> > > I may be missing something, but per vm_normal_page(), I think >> > > _PAGE_SPECIAL can be substituted by the following check when we do not >> > > have the memmap. >> > > >> > > if ((vma->vm_flags & VM_PFNMAP) || >> > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { >> > > >> > > This is what I did in this patch for follow_trans_huge_pmd(), although I >> > > missed the pfn_valid() check. >> > >> > That works for __get_user_pages but not __get_user_pages_fast where we >> > don't have access to the vma. >> >> __get_user_page_fast already refers current->mm, so we should be able to get >> the vma, and pass it down to gup_pud_range(). > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the > !pfn_valid() condition is met, so that we do not add the code to the main path > of __get_user_pages_fast. The whole point of __get_user_page_fast() is to avoid the overhead of taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells __get_user_pages_fast that it needs to fallback to the __get_user_pages slow path.
On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> > > > wrote: > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > observed major performance improvement with PMD. This feature should > > > > > always be enabled with DAX regardless of the option to allocate the > > > > > memmap > > > > > array. > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > alternatives but here's the reasoning: > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > _PAGE_SPECIAL pte bit. > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > page-less DAX-pmd case. > > > > > > But I'm still of the opinion that we run away from the page-less case > > > until it can be made a full class citizen with O_DIRECT for pfn > > > support. > > > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL > > can > > be substituted by the following check when we do not have the memmap. > > > > if ((vma->vm_flags & VM_PFNMAP) || > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > missed > > the pfn_valid() check. > > That works for __get_user_pages but not __get_user_pages_fast where we > don't have access to the vma. __get_user_page_fast already refers current->mm, so we should be able to get the vma, and pass it down to gup_pud_range(). Thanks, -Toshi
On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams <dan.j.williams@intel.com> > > > > wrote: > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > > observed major performance improvement with PMD. This feature > > > > > > should always be enabled with DAX regardless of the option to > > > > > > allocate the memmap array. > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > alternatives but here's the reasoning: > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > > _PAGE_SPECIAL pte bit. > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > page-less DAX-pmd case. > > > > > > > > But I'm still of the opinion that we run away from the page-less case > > > > until it can be made a full class citizen with O_DIRECT for pfn > > > > support. > > > > > > I may be missing something, but per vm_normal_page(), I think > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > have the memmap. > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > > missed the pfn_valid() check. > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > don't have access to the vma. > > __get_user_page_fast already refers current->mm, so we should be able to get > the vma, and pass it down to gup_pud_range(). Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the !pfn_valid() condition is met, so that we do not add the code to the main path of __get_user_pages_fast. Thanks, -Toshi
On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: [..] >> The whole point of __get_user_page_fast() is to avoid the overhead of >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> __get_user_pages_fast that it needs to fallback to the >> __get_user_pages slow path. > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > instead of VM_BUG_ON. Is pfn_valid() a reliable check? It seems to be based on a max_pfn per node... what happens when pmem is located below that point. I haven't been able to convince myself that we won't get false positives, but maybe I'm missing something.
On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > > > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams < > > > > > > dan.j.williams@intel.com> > > > > > > wrote: > > > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani <toshi.kani@hpe.com> > > > > > > > wrote: > > > > > > > > Oh, I see. I will setup the memmap array and run the tests > > > > > > > > again. > > > > > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We > > > > > > > > have observed major performance improvement with PMD. This > > > > > > > > feature should always be enabled with DAX regardless of the > > > > > > > > option to allocate the memmap array. > > > > > > > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > > > alternatives but here's the reasoning: > > > > > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path > > > > > > > leading to commit e82c9ed41e8 "dax: disable pmd mappings". The > > > > > > > reason pte mappings don't crash and instead trigger -EFAULT is due > > > > > > > to the _PAGE_SPECIAL pte bit. > > > > > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge > > > > > > > -page case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get > > > > > > > two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. > > > > > > > Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of > > > > > > > pursuing it. > > > > > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > > > page-less DAX-pmd case. > > > > > > > > > > > > But I'm still of the opinion that we run away from the page-less > > > > > > case until it can be made a full class citizen with O_DIRECT for pfn > > > > > > support. > > > > > > > > > > I may be missing something, but per vm_normal_page(), I think > > > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > > > have the memmap. > > > > > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although > > > > > I missed the pfn_valid() check. > > > > > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > > > don't have access to the vma. > > > > > > __get_user_page_fast already refers current->mm, so we should be able to > > > get the vma, and pass it down to gup_pud_range(). > > > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when > > the !pfn_valid() condition is met, so that we do not add the code to the > > main path of __get_user_pages_fast. > > The whole point of __get_user_page_fast() is to avoid the overhead of > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > __get_user_pages_fast that it needs to fallback to the > __get_user_pages slow path. I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), instead of VM_BUG_ON. Thanks, -Toshi
On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] > > > The whole point of __get_user_page_fast() is to avoid the overhead of > > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > > > __get_user_pages_fast that it needs to fallback to the > > > __get_user_pages slow path. > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > > instead of VM_BUG_ON. > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. I believe we use the version of pfn_valid() in linux/mmzone.h. Thanks, -Toshi
On 12/02/2015 12:54 PM, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] >>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>> >> __get_user_pages_fast that it needs to fallback to the >>> >> __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. With sparsemem at least, it makes sure that you're looking at a valid _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222.
On Wed, Dec 2, 2015 at 2:00 PM, Dave Hansen <dave.hansen@intel.com> wrote: > On 12/02/2015 12:54 PM, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >>>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>>> >> __get_user_pages_fast that it needs to fallback to the >>>> >> __get_user_pages slow path. >>> > >>> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >>> > instead of VM_BUG_ON. >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > With sparsemem at least, it makes sure that you're looking at a valid > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD".
On 12/02/2015 02:03 PM, Dan Williams wrote: >>> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >>> >> per node... what happens when pmem is located below that point. I >>> >> haven't been able to convince myself that we won't get false >>> >> positives, but maybe I'm missing something. >> > >> > With sparsemem at least, it makes sure that you're looking at a valid >> > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. > At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD". Yeah, it seems like an awful layering violation. But, sparsemem is turned on everywhere (all the distros/users) that we care about, as far as I know.
On Wed, Dec 2, 2015 at 4:21 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: >> On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: >> > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > : >> > > > >> > > > Hey Toshi, >> > > > >> > > > I ended up fixing this differently with follow_pmd_devmap() introduced >> > > > in this series: >> > > > >> > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> > > > >> > > > Does the latest libnvdimm-pending branch [1] pass your test case? >> > > >> > > Hi Dan, >> > > >> > > I ran several test cases, and they all hit the case "pfn not in memmap" in >> > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, >> > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I >> > > thought I let you know first. I've also seen the test thread got hung up >> > > at the end sometime. >> > >> > That PFN_MAP flag will not be set by default for NFIT-defined >> > persistent memory. See pmem_should_map_pages() for pmem namespaces >> > that will have it set by default, currently only e820 type-12 memory >> > ranges. >> > >> > NFIT-defined persistent memory can have a memmap array dynamically >> > allocated by setting up a pfn device (similar to setting up a btt). >> > We don't map it by default because the NFIT may describe hundreds of >> > gigabytes of persistent and the overhead of the memmap may be too >> > large to locate the memmap in ram. >> >> Oh, I see. I will setup the memmap array and run the tests again. > > I setup a pfn device, and ran a few test cases again. Yes, it solved the > PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by > 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect > due to the pfn metadata.If I pass a 2MB-aligned+2pages virtual address to > mmap(MAP_POPULATE), the mmap() call gets hung up. Ok, I need to switch over from my memmap=ss!nn config. We just need to pad the info block reservation to 2M. As for the MAP_POPULATE hang, I'll take a look. Right now I'm in the process of rebasing the whole set on top of -mm which has a pending THP re-works from Kirill.
On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: > On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: : > > > > > > > > Hey Toshi, > > > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > > in this series: > > > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > > > Hi Dan, > > > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I > > > thought I let you know first. I've also seen the test thread got hung up > > > at the end sometime. > > > > That PFN_MAP flag will not be set by default for NFIT-defined > > persistent memory. See pmem_should_map_pages() for pmem namespaces > > that will have it set by default, currently only e820 type-12 memory > > ranges. > > > > NFIT-defined persistent memory can have a memmap array dynamically > > allocated by setting up a pfn device (similar to setting up a btt). > > We don't map it by default because the NFIT may describe hundreds of > > gigabytes of persistent and the overhead of the memmap may be too > > large to locate the memmap in ram. > > Oh, I see. I will setup the memmap array and run the tests again. I setup a pfn device, and ran a few test cases again. Yes, it solved the PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect due to the pfn metadata. If I pass a 2MB-aligned+2pages virtual address to mmap(MAP_POPULATE), the mmap() call gets hung up. Thanks, -Toshi
On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >> > > The whole point of __get_user_page_fast() is to avoid the overhead of >> > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> > > __get_user_pages_fast that it needs to fallback to the >> > > __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. >> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > I believe we use the version of pfn_valid() in linux/mmzone.h. Talking this over with Dave we came to the conclusion that it would be safer to be explicit about the pmd not being mapped. He points out that unless a platform can guarantee that persistent memory is always section aligned we might get false positive pfn_valid() indications. Given the get_user_pages_fast() path is arch specific we can simply have an arch specific pmd bit and not worry about generically enabling a "pmd special" bit for now.
On Thu, 2015-12-03 at 15:43 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani <toshi.kani@hpe.com> > > > wrote: > > > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > > > [..] > > > > > The whole point of __get_user_page_fast() is to avoid the > > > > > overhead of taking the mm semaphore to access the vma. > > > > > _PAGE_SPECIAL simply tells > > > > > __get_user_pages_fast that it needs to fallback to the > > > > > __get_user_pages slow path. > > > > > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when > > > > !pfn_valid(), instead of VM_BUG_ON. > > > > > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > > > per node... what happens when pmem is located below that point. I > > > haven't been able to convince myself that we won't get false > > > positives, but maybe I'm missing something. > > > > I believe we use the version of pfn_valid() in linux/mmzone.h. > > Talking this over with Dave we came to the conclusion that it would be > safer to be explicit about the pmd not being mapped. He points out > that unless a platform can guarantee that persistent memory is always > section aligned we might get false positive pfn_valid() indications. > Given the get_user_pages_fast() path is arch specific we can simply > have an arch specific pmd bit and not worry about generically enabling > a "pmd special" bit for now. Sounds good to me. Thanks! -Toshi
diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5b8920..f56e034 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1267,6 +1267,32 @@ out_unlock: return ret; } +/* + * Follow a pmd inserted by vmf_insert_pfn_pmd(). See follow_pfn_pte() for pte. + */ +static int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + /* No page to get reference */ + if (flags & FOLL_GET) + return -EFAULT; + + if (flags & FOLL_TOUCH) { + pmd_t entry = *pmd; + + /* Set the dirty bit per follow_trans_huge_pmd() */ + entry = pmd_mkyoung(pmd_mkdirty(entry)); + + if (!pmd_same(*pmd, entry)) { + set_pmd_at(vma->vm_mm, address, pmd, entry); + update_mmu_cache_pmd(vma, address, pmd); + } + } + + /* Proper page table entry exists, but no corresponding struct page */ + return -EEXIST; +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1274,6 +1300,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; struct page *page = NULL; + int ret; assert_spin_locked(pmd_lockptr(mm, pmd)); @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; + /* pfn map does not have a struct page */ + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { + ret = follow_pfn_pmd(vma, addr, pmd, flags); + page = ERR_PTR(ret); + goto out; + } + page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page), page); if (flags & FOLL_TOUCH) {
The following oops was observed when mmap() with MAP_POPULATE pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() expects that a target address has a struct page. BUG: unable to handle kernel paging request at ffffea0012220000 follow_trans_huge_pmd+0xba/0x390 follow_page_mask+0x33d/0x420 __get_user_pages+0xdc/0x800 populate_vma_page_range+0xb5/0xe0 __mm_populate+0xc5/0x150 vm_mmap_pgoff+0xd5/0xe0 SyS_mmap_pgoff+0x1c1/0x290 SyS_mmap+0x1b/0x30 Fix it by making the PMD pre-fault handling consistent with PTE. After pre-faulted in faultin_page(), follow_page_mask() calls follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH and returns with -EEXIST. Reported-by: Mauricio Porto <mauricio.porto@hpe.com> Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> --- mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+)