Message ID | 1459951089-14911-1-git-send-email-toshi.kani@hpe.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Apr 06, 2016 at 07:58:09AM -0600, Toshi Kani wrote: > When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page > size. This feature relies on both mmap virtual address and FS > block data (i.e. physical address) to be aligned by the PMD page > size. Users can use mkfs options to specify FS to align block > allocations. However, aligning mmap() address requires application > changes to mmap() calls, such as: > > - /* let the kernel to assign a mmap addr */ > - mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); > > + /* 1. obtain a PMD-aligned virtual address */ > + ret = posix_memalign(&mptr, PMD_SIZE, fsize); > + if (!ret) > + free(mptr); /* 2. release the virt addr */ > + > + /* 3. then pass the PMD-aligned virt addr to mmap() */ > + mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); > > These changes add unnecessary dependency to DAX and PMD page size > into application code. The kernel should assign a mmap address > appropriate for the operation. I question the need for this patch. Choosing an appropriate base address is the least of the changes needed for an application to take advantage of DAX. The NVML chooses appropriate addresses and gets a properly aligned address without any kernel code. > Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown() > to request PMD_SIZE alignment when the request is for a DAX file and > its mapping range is large enough for using a PMD page. I think this is the wrong place for it, if we decide that this is the right thing to do. The filesystem has a get_unmapped_area() which should be used instead. > @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, > info.align_mask = get_align_mask(); > info.align_offset += get_align_bits(); > } > + if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) { And there's never a need for the IS_ENABLED. IS_DAX() compiles to '0' if CONFIG_FS_DAX is disabled. And where would this end? Would you also change this code to look for 1GB entries if CONFIG_FS_DAX_PUD is enabled? Far better to have this in the individual filesystem (probably calling a common helper in the DAX code).
On Wed, 2016-04-06 at 12:50 -0400, Matthew Wilcox wrote: > On Wed, Apr 06, 2016 at 07:58:09AM -0600, Toshi Kani wrote: > > > > When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page > > size. This feature relies on both mmap virtual address and FS > > block data (i.e. physical address) to be aligned by the PMD page > > size. Users can use mkfs options to specify FS to align block > > allocations. However, aligning mmap() address requires application > > changes to mmap() calls, such as: > > > > - /* let the kernel to assign a mmap addr */ > > - mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); > > > > + /* 1. obtain a PMD-aligned virtual address */ > > + ret = posix_memalign(&mptr, PMD_SIZE, fsize); > > + if (!ret) > > + free(mptr); /* 2. release the virt addr */ > > + > > + /* 3. then pass the PMD-aligned virt addr to mmap() */ > > + mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); > > > > These changes add unnecessary dependency to DAX and PMD page size > > into application code. The kernel should assign a mmap address > > appropriate for the operation. > > I question the need for this patch. Choosing an appropriate base address > is the least of the changes needed for an application to take advantage > of DAX. An application also needs to make sure that a given range [base - base+size] is free in VMA. The above example uses posix_memalign() to find such a range, which in turn calls mmap() with size as (fsize + PMD_SIZE) in this case. > The NVML chooses appropriate addresses and gets a properly aligned > address without any kernel code. An application like NVML can continue to specify a specific address to mmap(). Most existing applications, however, do not specify an address to mmap(). With this patch, specifying an address will remain optional. > > Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown() > > to request PMD_SIZE alignment when the request is for a DAX file and > > its mapping range is large enough for using a PMD page. > > I think this is the wrong place for it, if we decide that this is the > right thing to do. The filesystem has a get_unmapped_area() which > should be used instead. Yes, I considered adding a filesystem entry point, but decided going this way because: - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are arch- specific code. Therefore, this filesystem entry point will need arch- specific implementation. - There is nothing filesystem specific about requesting PMD alignment. > > > > @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned > > long addr, > > info.align_mask = get_align_mask(); > > info.align_offset += get_align_bits(); > > } > > + if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && > > IS_DAX(file_inode(filp))) { > > And there's never a need for the IS_ENABLED. IS_DAX() compiles to '0' if > CONFIG_FS_DAX is disabled. CONFIG_FS_DAX_PMD can be disabled while CONFIG_FS_DAX is enabled. > And where would this end? Would you also change this code to look for > 1GB entries if CONFIG_FS_DAX_PUD is enabled? Far better to have this > in the individual filesystem (probably calling a common helper in the DAX > code). Yes, it can be easily extended to support PUD. This avoids another round of application changes to align with the PUD size. If the PUD support turns out to be filesystem specific, we may need a capability bit in addition to CONFIG_FS_DAX_PUD. Thanks, -Toshi
On Wed, Apr 06, 2016 at 11:44:32AM -0600, Toshi Kani wrote: > > The NVML chooses appropriate addresses and gets a properly aligned > > address without any kernel code. > > An application like NVML can continue to specify a specific address to > mmap(). Most existing applications, however, do not specify an address to > mmap(). With this patch, specifying an address will remain optional. The point is that this *can* be done in userspace. You need to sell us on the advantages of doing it in the kernel. > > I think this is the wrong place for it, if we decide that this is the > > right thing to do. The filesystem has a get_unmapped_area() which > > should be used instead. > > Yes, I considered adding a filesystem entry point, but decided going this > way because: > - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are arch- > specific code. Therefore, this filesystem entry point will need arch- > specific implementation. > - There is nothing filesystem specific about requesting PMD alignment. See http://article.gmane.org/gmane.linux.kernel.mm/149227 for Hugh's approach for shmem. I strongly believe that if we're going to do this i the kernel, we should build on this approach, and not hack something into each architecture's generic get_unmapped_area.
On Thu, 2016-04-07 at 13:41 -0400, Matthew Wilcox wrote: > On Wed, Apr 06, 2016 at 11:44:32AM -0600, Toshi Kani wrote: > > > > > > The NVML chooses appropriate addresses and gets a properly aligned > > > address without any kernel code. > > > > An application like NVML can continue to specify a specific address to > > mmap(). Most existing applications, however, do not specify an address > > to mmap(). With this patch, specifying an address will remain > > optional. > > The point is that this *can* be done in userspace. You need to sell us > on the advantages of doing it in the kernel. Sure. As I said, the point is that we do not need to modify existing applications for using DAX PMD mappings. For instance, fio with "ioengine=mmap" performs I/Os with mmap(). https://github.com/caius/fio/blob/master/engines/mmap.c With this change, unmodified fio can be used for testing with DAX PMD mappings. There are many examples like this, and I do not think we want to modify all applications that we want to evaluate/test with. > > > I think this is the wrong place for it, if we decide that this is the > > > right thing to do. The filesystem has a get_unmapped_area() which > > > should be used instead. > > > > Yes, I considered adding a filesystem entry point, but decided going > > this way because: > > - arch_get_unmapped_area() and arch_get_unmapped_area_topdown() are > > arch-specific code. Therefore, this filesystem entry point will need > > arch-specific implementation. > > - There is nothing filesystem specific about requesting PMD alignment. > > See http://article.gmane.org/gmane.linux.kernel.mm/149227 for Hugh's > approach for shmem. I strongly believe that if we're going to do this > i the kernel, we should build on this approach, and not hack something > into each architecture's generic get_unmapped_area. Thanks for the pointer. Yes, we can call current->mm->get_unmapped_area() with size + PMD_SIZE, and adjust with the alignment in a filesystem entry point. I will update the patch with this approach. -Toshi
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c index 10e0272..a294c66 100644 --- a/arch/x86/kernel/sys_x86_64.c +++ b/arch/x86/kernel/sys_x86_64.c @@ -157,6 +157,13 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr, info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } + if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) { + unsigned long off_end = info.align_offset + len; + unsigned long off_pmd = round_up(info.align_offset, PMD_SIZE); + + if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE)) + info.align_mask |= (PMD_SIZE - 1); + } return vm_unmapped_area(&info); } @@ -200,6 +207,13 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0, info.align_mask = get_align_mask(); info.align_offset += get_align_bits(); } + if (filp && IS_ENABLED(CONFIG_FS_DAX_PMD) && IS_DAX(file_inode(filp))) { + unsigned long off_end = info.align_offset + len; + unsigned long off_pmd = round_up(info.align_offset, PMD_SIZE); + + if ((off_end > off_pmd) && ((off_end - off_pmd) >= PMD_SIZE)) + info.align_mask |= (PMD_SIZE - 1); + } addr = vm_unmapped_area(&info); if (!(addr & ~PAGE_MASK)) return addr;
When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using PMD page size. This feature relies on both mmap virtual address and FS block data (i.e. physical address) to be aligned by the PMD page size. Users can use mkfs options to specify FS to align block allocations. However, aligning mmap() address requires application changes to mmap() calls, such as: - /* let the kernel to assign a mmap addr */ - mptr = mmap(NULL, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); + /* 1. obtain a PMD-aligned virtual address */ + ret = posix_memalign(&mptr, PMD_SIZE, fsize); + if (!ret) + free(mptr); /* 2. release the virt addr */ + + /* 3. then pass the PMD-aligned virt addr to mmap() */ + mptr = mmap(mptr, fsize, PROT_READ|PROT_WRITE, FLAGS, fd, 0); These changes add unnecessary dependency to DAX and PMD page size into application code. The kernel should assign a mmap address appropriate for the operation. Change arch_get_unmapped_area() and arch_get_unmapped_area_topdown() to request PMD_SIZE alignment when the request is for a DAX file and its mapping range is large enough for using a PMD page. Signed-off-by: Toshi Kani <toshi.kani@hpe.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Borislav Petkov <bp@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@linux.intel.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- arch/x86/kernel/sys_x86_64.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+)