diff mbox

[RFC] mm, THP: Map read-only text segments using large THP pages

Message ID 5BB682E1-DD52-4AA9-83E9-DEF091E0C709@oracle.com (mailing list archive)
State New, archived
Headers show

Commit Message

William Kucharski May 14, 2018, 1:12 p.m. UTC
One of the downsides of THP as currently implemented is that it only supports
large page mappings for anonymous pages.

I embarked upon this prototype on the theory that it would be advantageous to 
be able to map large ranges of read-only text pages using THP as well.

The idea is that the kernel will attempt to allocate and map the range using a 
PMD sized THP page upon first fault; if the allocation is successful the page 
will be populated (at present using a call to kernel_read()) and the page will 
be mapped at the PMD level. If memory allocation fails, the page fault routines 
will drop through to the conventional PAGESIZE-oriented routines for mapping 
the faulting page.

Since this approach will map a PMD size block of the memory map at a time, we 
should see a slight uptick in time spent in disk I/O but a substantial drop in 
page faults as well as a reduction in iTLB misses as address ranges will be 
mapped with the larger page. Analysis of a test program that consists of a very 
large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this 
does occur and there is a slight reduction in program execution time.

The text segment as seen from readelf:

LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
               0x000000001ccc19f0 0x000000001ccc19f0  R E    0x200000

As currently implemented for test purposes, the prototype will only use large 
pages to map an executable with a particular filename ("testr"), enabling easy 
comparison of the same executable using 4K and 2M (x64) pages on the same 
kernel. It is understood that this is just a proof of concept implementation 
and much more work regarding enabling the feature and overall system usage of 
it would need to be done before it was submitted as a kernel patch. However, I 
felt it would be worthy to send it out as an RFC so I can find out whether 
there are huge objections from the community to doing this at all, or a better 
understanding of the major concerns that must be assuaged before it would even 
be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
equivalent of "always" and bypass some checks for anonymous pages by simply 
#ifdefing the code out; obviously I would need to determine the right thing to 
do in those cases.

Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" 
follow; the 4K pagesize program was named "foo" and the 2M pagesize program 
"testr" (as noted above) - please note that these numbers do vary from run to 
run, but the orders of magnitude of the differences between the two versions 
remain relatively constant:

4K Pages:

Comments

Christoph Lameter (Ampere) May 14, 2018, 3:19 p.m. UTC | #1
On Mon, 14 May 2018, William Kucharski wrote:

> The idea is that the kernel will attempt to allocate and map the range using a
> PMD sized THP page upon first fault; if the allocation is successful the page
> will be populated (at present using a call to kernel_read()) and the page will
> be mapped at the PMD level. If memory allocation fails, the page fault routines
> will drop through to the conventional PAGESIZE-oriented routines for mapping
> the faulting page.

Cool. This could be controlled by the faultaround logic right? If we get
fault_around_bytes up to huge page size then it is reasonable to use a
huge page direcly.

fault_around_bytes can be set via sysfs so there is a natural way to
control this feature there I think.


> Since this approach will map a PMD size block of the memory map at a time, we
> should see a slight uptick in time spent in disk I/O but a substantial drop in
> page faults as well as a reduction in iTLB misses as address ranges will be
> mapped with the larger page. Analysis of a test program that consists of a very
> large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this
> does occur and there is a slight reduction in program execution time.

I think we would also want such a feature for regular writable pages as
soon as possible.
William Kucharski May 15, 2018, 6:59 a.m. UTC | #2
> On May 14, 2018, at 9:19 AM, Christopher Lameter <cl@linux.com> wrote:
> 
> Cool. This could be controlled by the faultaround logic right? If we get
> fault_around_bytes up to huge page size then it is reasonable to use a
> huge page directly.

It isn't presently but certainly could be; for the prototype it tries to
map a large page when needed and, should that fail, it will fall through
to the normal fault around code.

I would think we would want a separate parameter, as I can see the usefulness
of more fine-grained control. Many users may want to try mapping a large page
if possible, but would prefer a smaller number of bytes to be read in fault
around should we need to fall back to using PAGESIZE pages.

> fault_around_bytes can be set via sysfs so there is a natural way to
> control this feature there I think.

I agree; perhaps I could use "fault_around_thp_bytes" or something similar.

>> Since this approach will map a PMD size block of the memory map at a time, we
>> should see a slight uptick in time spent in disk I/O but a substantial drop in
>> page faults as well as a reduction in iTLB misses as address ranges will be
>> mapped with the larger page. Analysis of a test program that consists of a very
>> large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this
>> does occur and there is a slight reduction in program execution time.
> 
> I think we would also want such a feature for regular writable pages as
> soon as possible.

That is my ultimate long-term goal for this project - full r/w support of large
THP pages; prototyping with read-only text pages seemed like the best first step
to get a sense of the possible benefits.

  -- Bill
Michal Hocko May 17, 2018, 7:57 a.m. UTC | #3
[CCing Kirill and fs-devel]

On Mon 14-05-18 07:12:13, William Kucharski wrote:
> One of the downsides of THP as currently implemented is that it only supports
> large page mappings for anonymous pages.

There is a support for shmem merged already. ext4 was next on the plan
AFAIR but I haven't seen any patches and Kirill was busy with other
stuff IIRC.

> I embarked upon this prototype on the theory that it would be advantageous to 
> be able to map large ranges of read-only text pages using THP as well.

Can the fs really support THP only for read mappings? What if those
pages are to be shared in a writable mapping as well? In other words
can this all work without a full THP support for a particular fs?

Keeping the rest of the email for new CC.

> The idea is that the kernel will attempt to allocate and map the range using a 
> PMD sized THP page upon first fault; if the allocation is successful the page 
> will be populated (at present using a call to kernel_read()) and the page will 
> be mapped at the PMD level. If memory allocation fails, the page fault routines 
> will drop through to the conventional PAGESIZE-oriented routines for mapping 
> the faulting page.
> 
> Since this approach will map a PMD size block of the memory map at a time, we 
> should see a slight uptick in time spent in disk I/O but a substantial drop in 
> page faults as well as a reduction in iTLB misses as address ranges will be 
> mapped with the larger page. Analysis of a test program that consists of a very 
> large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this 
> does occur and there is a slight reduction in program execution time.
> 
> The text segment as seen from readelf:
> 
> LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
>                0x000000001ccc19f0 0x000000001ccc19f0  R E    0x200000
> 
> As currently implemented for test purposes, the prototype will only use large 
> pages to map an executable with a particular filename ("testr"), enabling easy 
> comparison of the same executable using 4K and 2M (x64) pages on the same 
> kernel. It is understood that this is just a proof of concept implementation 
> and much more work regarding enabling the feature and overall system usage of 
> it would need to be done before it was submitted as a kernel patch. However, I 
> felt it would be worthy to send it out as an RFC so I can find out whether 
> there are huge objections from the community to doing this at all, or a better 
> understanding of the major concerns that must be assuaged before it would even 
> be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
> equivalent of "always" and bypass some checks for anonymous pages by simply 
> #ifdefing the code out; obviously I would need to determine the right thing to 
> do in those cases.
> 
> Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" 
> follow; the 4K pagesize program was named "foo" and the 2M pagesize program 
> "testr" (as noted above) - please note that these numbers do vary from run to 
> run, but the orders of magnitude of the differences between the two versions 
> remain relatively constant:
> 
> 4K Pages:
> =========
> Performance counter stats for './foo' (10 runs):
> 
>     307054.450421      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.21% )
>                 0      context-switches:u        #    0.000 K/sec
>                 0      cpu-migrations:u          #    0.000 K/sec
>             7,728      page-faults:u             #    0.025 K/sec                    ( +-  0.00% )
> 1,401,295,823,265      cycles:u                  #    4.564 GHz                      ( +-  0.19% )  (30.77%)
>   562,704,668,718      instructions:u            #    0.40  insn per cycle           ( +-  0.00% )  (38.46%)
>    20,100,243,102      branches:u                #   65.461 M/sec                    ( +-  0.00% )  (38.46%)
>         2,628,944      branch-misses:u           #    0.01% of all branches          ( +-  3.32% )  (38.46%)
>   180,885,880,185      L1-dcache-loads:u         #  589.100 M/sec                    ( +-  0.00% )  (38.46%)
>    40,374,420,279      L1-dcache-load-misses:u   #   22.32% of all L1-dcache hits    ( +-  0.01% )  (38.46%)
>       232,184,583      LLC-loads:u               #    0.756 M/sec                    ( +-  1.48% )  (30.77%)
>        23,990,082      LLC-load-misses:u         #   10.33% of all LL-cache hits     ( +-  1.48% )  (30.77%)
>   <not supported>      L1-icache-loads:u
>    74,897,499,234      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
>   180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
>           707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
>         5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
>     1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
>   <not supported>      L1-dcache-prefetches:u
>   <not supported>      L1-dcache-prefetch-misses:u
> 
> 307.093088771 seconds time elapsed                                          ( +-  0.20% )
> 
> 2M Pages:
> =========
> Performance counter stats for './testr' (10 runs):
> 
>     289504.209769      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.19% )
>                 0      context-switches:u        #    0.000 K/sec
>                 0      cpu-migrations:u          #    0.000 K/sec
>               598      page-faults:u             #    0.002 K/sec                    ( +-  0.03% )
> 1,323,835,488,984      cycles:u                  #    4.573 GHz                      ( +-  0.19% )  (30.77%)
>   562,658,682,055      instructions:u            #    0.43  insn per cycle           ( +-  0.00% )  (38.46%)
>    20,099,662,528      branches:u                #   69.428 M/sec                    ( +-  0.00% )  (38.46%)
>         2,877,086      branch-misses:u           #    0.01% of all branches          ( +-  4.52% )  (38.46%)
>   180,899,297,017      L1-dcache-loads:u         #  624.859 M/sec                    ( +-  0.00% )  (38.46%)
>    40,209,140,089      L1-dcache-load-misses:u   #   22.23% of all L1-dcache hits    ( +-  0.00% )  (38.46%)
>       135,968,232      LLC-loads:u               #    0.470 M/sec                    ( +-  1.56% )  (30.77%)
>         6,704,890      LLC-load-misses:u         #    4.93% of all LL-cache hits     ( +-  1.92% )  (30.77%)
>   <not supported>      L1-icache-loads:u
>    74,955,673,747      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
>   180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
>               835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
>         6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
>        51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
>   <not supported>      L1-dcache-prefetches:u
>   <not supported>      L1-dcache-prefetch-misses:u
> 
> 289.551551387 seconds time elapsed                                          ( +-  0.20% )
> 
> A check of /proc/meminfo with the test program running shows the large mappings:
> 
> ShmemPmdMapped:   471040 kB
> 
> FAQ:
> ====
> Q: What kernel is the prototype based on?
> A: 4.14.0-rc7
> 
> Q: What is the biggest issue you haven't addressed?
> A: Given this is a prototype, there are many. Aside from the fact that I 
>    only map large pages for an  executable of a specific name ("testr"), the 
>    code must be integrated with large page size support in the page cache 
>    as currently multiple iterations of an executable would each use their 
>    own individually allocated THP pages and those pages filled with data 
>    using kernel_read(), which  allows for performance characterization but 
>    would never be acceptable for a production kernel.
> 
>    A good example of the large page support required is the ext4 support
>    outlined in:
> 
>      https://www.mail-archive.com/linux-block@vger.kernel.org/msg04012.html
> 
>    There also need to be configuration options to enable this code at all, 
>    likely only for file systems that support large pages, and more 
>    reasonable fixes for the assumptions that all large THP pages are 
>    anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.)
> 
> Q: Which processes get their text as large pages?
> A: At this point with this implementation it's any process with a read-only
>    text area of the proper size/alignment.
> 
>    An attempt is made to align the address for non-MAP_FIXED addresses.
> 
>    I do not make any attempt to move mappings that take up a majority of a 
>    large page to a large page; I only map a large page if the address 
>    aligns and the map size is larger than or equal to a large page.  
> 
> Q: Which architectures has this been tested on?
> A: At present, only x64.
> 
> Q: How about architectures (ARM, for instance) with multiple large page 
>    sizes that are reasonable for text mappings?
> A: At present a "large page" is just PMD size; it would be possible with
>    additional effort to allow for mapping using PUD-sized pages.
> 
> Q: What about the use of non-PMD large page sizes (on non-x86 architectures)?
> A: I haven't looked into that; I don't have an answer as to how to best 
>    map a page that wasn't sized to be a PMD or PUD.
> 
> Signed-off-by: William Kucharski <william.kucharski@oracle.com>
> 
> ===============================================================
> 
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index ed113ea..f4ac381 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
> 	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
> 		return -EINVAL;
> 
> -	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
> -	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
> +	vma_len = (loff_t)(vma->vm_end - vma->vm_start);	/* length of VMA */
> +	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);	/* add vma->vm_pgoff * PAGESIZE */
> 	/* check for overflow */
> 	if (len < vma_len)
> 		return -EINVAL;
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 87067d2..353bec8 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr;
> #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define HPAGE_PMD_SHIFT PMD_SHIFT
> -#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
> -#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
> -
> -#define HPAGE_PUD_SHIFT PUD_SHIFT
> -#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
> -#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
> +#define HPAGE_PMD_SHIFT		PMD_SHIFT
> +#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
> +#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
> +#define	HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
> +
> +#define HPAGE_PUD_SHIFT		PUD_SHIFT
> +#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
> +#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
> +#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))
> 
> extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1981ed6..7b61c92 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -445,6 +445,14 @@ subsys_initcall(hugepage_init);
> 
> static int __init setup_transparent_hugepage(char *str)
> {
> +#if 1
> +	set_bit(TRANSPARENT_HUGEPAGE_FLAG,
> +		&transparent_hugepage_flags);
> +	clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> +		  &transparent_hugepage_flags);
> +	printk("THP permanently set ON\n");
> +	return 1;
> +#else
> 	int ret = 0;
> 	if (!str)
> 		goto out;
> @@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str)
> 	if (!ret)
> 		pr_warn("transparent_hugepage= cannot parse, ignored\n");
> 	return ret;
> +#endif
> }
> __setup("transparent_hugepage=", setup_transparent_hugepage);
> 
> @@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,
> 
> 	if (addr)
> 		goto out;
> +
> +#if 0
> 	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
> 		goto out;
> +#endif
> 
> 	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
> 	if (addr)
> diff --git a/mm/memory.c b/mm/memory.c
> index a728bed..fc352d8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs);
>  * fault_around_pages() value (and therefore to page order).  This way it's
>  * easier to guarantee that we don't cross page table boundaries.
>  */
> -static int do_fault_around(struct vm_fault *vmf)
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static
> +int do_fault_around_thp(struct vm_fault *vmf)
> +{
> +        struct file *file = vmf->vma->vm_file;
> +	unsigned long address = vmf->address;
> +	pgoff_t start_pgoff = vmf->pgoff;
> +	pgoff_t end_pgoff;
> +	int ret = VM_FAULT_FALLBACK;
> +	int off;
> +
> +	/*
> +	 * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK)
> +	 * or the start of the VMA.
> +	 */
> +	vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start);
> +
> +	/*
> +	 * Not a candidate if the start address calculated above isnt properly
> +	 * aligned
> +	 */
> +	if (vmf->address & HPAGE_PMD_OFFSET)
> +		goto dfa_thp_out;
> +
> +	off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
> +	start_pgoff -= off;
> +
> +	/*
> +	 *  end_pgoff is either end of page table or end of vma
> +	 *  or fault_around_pages() from start_pgoff, depending what is
> +	 *  smallest.
> +	 */
> +	end_pgoff = start_pgoff -
> +		((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
> +		PTRS_PER_PTE - 1;
> +	end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
> +			start_pgoff + PTRS_PER_PTE - 1);
> +
> +	/*
> +	 * Check to see if we could map this request with a large THP page
> +	 * instead.
> +	 */
> +	if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) &&
> +		pmd_none(*vmf->pmd) &&
> +		((end_pgoff - start_pgoff) >=
> +		((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) {
> +		struct page *page;
> +
> +		page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP |
> +			__GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma,
> +			vmf->address, numa_node_id(), 1);
> +
> +		if ((likely(page)) && (PageTransCompound(page))) {
> +			ssize_t bytes_read;
> +			void *pg_vaddr;
> +
> +			prep_transhuge_page(page);
> +			pg_vaddr = page_address(page);
> +
> +			if (likely(pg_vaddr)) {
> +				loff_t loff = (loff_t)
> +					(start_pgoff << PAGE_SHIFT);
> +				bytes_read = kernel_read(file, pg_vaddr,
> +					HPAGE_PMD_SIZE, &loff);
> +				VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE);
> +
> +				smp_wmb(); /* See comment in __pte_alloc() */
> +				ret = alloc_set_pte(vmf, NULL, page);
> +
> +				if (likely(ret == 0)) {
> +					VM_BUG_ON_PAGE(pmd_none(*vmf->pmd),
> +						page);
> +					vmf->page = page;
> +					ret = VM_FAULT_NOPAGE;
> +					goto dfa_thp_out;
> +				}
> +			}
> +
> +			put_page(page);
> +		}
> +	}
> +
> +dfa_thp_out:
> +	vmf->address = address;
> +	VM_BUG_ON(vmf->pte != NULL);
> +	return ret;
> +}
> +#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +
> +static
> +int do_fault_around(struct vm_fault *vmf)
> {
> 	unsigned long address = vmf->address, nr_pages, mask;
> 	pgoff_t start_pgoff = vmf->pgoff;
> @@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf)
> 	struct vm_area_struct *vma = vmf->vma;
> 	int ret = 0;
> 
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/*
> +	 * Check to see if we could map this request with a large THP page
> +	 * instead.
> +	 */
> +	if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) &&
> +		((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name,
> +		"testr", 5)) == 0)) {
> +			ret = do_fault_around_thp(vmf);
> +
> +			if (ret == VM_FAULT_NOPAGE)
> +				return ret;
> +	}
> +#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> 	/*
> 	 * Let's call ->map_pages() first and use ->fault() as fallback
> 	 * if page by the offset is not ready to be mapped (cold cache or
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 680506f..1c281d7 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	struct mm_struct *mm = current->mm;
> 	int pkey = 0;
> 
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	unsigned long thp_maywrite = VM_MAYWRITE;
> +#endif
> +
> 	*populate = 0;
> 
> 	if (!len)
> @@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	/* Obtain the address to map to. we verify (or select) it and ensure
> 	 * that it represents a valid section of the address space.
> 	 */
> -	addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	/*
> +	 *
> +	 * If THP is enabled, and it's a read-only executable that is
> +	 * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a
> +	 * large page aligned virtual address, otherwise use the normal routine.
> +	 * 
> +	 * Note the THP routine will return a normal page size aligned start
> +	 * address in some cases.
> +	 */
> +	if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) &&
> +		(len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) &&
> +		((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) {
> +			addr = thp_get_unmapped_area(file, addr, len, pgoff,
> +				flags);
> +			if (addr && (!(addr & HPAGE_PMD_OFFSET)))
> +				thp_maywrite = 0;
> +	} else {
> +#endif
> +		addr = get_unmapped_area(file, addr, len, pgoff, flags);
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	}
> +#endif
> +
> 	if (offset_in_page(addr))
> 		return addr;
> 
> @@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> 	 * of the memory object, so we don't do any here.
> 	 */
> 	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +			mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC;
> +#else
> 			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
> +#endif
> 
> 	if (flags & MAP_LOCKED)
> 		if (!can_do_mlock())
> diff --git a/mm/rmap.c b/mm/rmap.c
> index b874c47..4fc24f8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound)
> 		}
> 		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
> 			goto out;
> +#if 0
> 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +#endif
> 		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
> 	} else {
> 		if (PageTransCompound(page) && page_mapping(page)) {
> @@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound)
> 		}
> 		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
> 			goto out;
> +#if 0
> 		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
> +#endif
> 		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
> 	} else {
> 		if (!atomic_add_negative(-1, &page->_mapcount))
William Kucharski May 17, 2018, 2:34 p.m. UTC | #4
> On May 17, 2018, at 1:57 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> [CCing Kirill and fs-devel]
> 
> On Mon 14-05-18 07:12:13, William Kucharski wrote:
>> One of the downsides of THP as currently implemented is that it only supports
>> large page mappings for anonymous pages.
> 
> There is a support for shmem merged already. ext4 was next on the plan
> AFAIR but I haven't seen any patches and Kirill was busy with other
> stuff IIRC.

I couldn't find anything that would specifically map text pages with large pages,
so perhaps this could be integrated with that or I may have simply missed changes
that would ultimately provide that functionality.

> 
>> I embarked upon this prototype on the theory that it would be advantageous to 
>> be able to map large ranges of read-only text pages using THP as well.
> 
> Can the fs really support THP only for read mappings? What if those
> pages are to be shared in a writable mapping as well? In other words
> can this all work without a full THP support for a particular fs?

The integration with the page cache would indeed require filesystem support.

The end result I'd like to see is full R/W support for large THP pages; I
thought the RO text mapping proof of concept worthwhile to see what kind of
results we might see and what the thoughts of the community were.

Thanks for the feedback.

  -- Bill
Matthew Wilcox May 17, 2018, 3:23 p.m. UTC | #5
On Mon, May 14, 2018 at 07:12:13AM -0600, William Kucharski wrote:
> One of the downsides of THP as currently implemented is that it only supports
> large page mappings for anonymous pages.

It does also support shmem.

> I embarked upon this prototype on the theory that it would be advantageous to 
> be able to map large ranges of read-only text pages using THP as well.

I'm certain it is.  The other thing I believe is true that we should be
able to share page tables (my motivation is thousands of processes each
mapping the same ridiculously-sized file).  I was hoping this prototype
would have code that would be stealable for that purpose, but you've
gone in a different direction.  Which is fine for a prototype; you've
produced useful numbers.

> As currently implemented for test purposes, the prototype will only use large 
> pages to map an executable with a particular filename ("testr"), enabling easy 
> comparison of the same executable using 4K and 2M (x64) pages on the same 
> kernel. It is understood that this is just a proof of concept implementation 
> and much more work regarding enabling the feature and overall system usage of 
> it would need to be done before it was submitted as a kernel patch. However, I 
> felt it would be worthy to send it out as an RFC so I can find out whether 
> there are huge objections from the community to doing this at all, or a better 
> understanding of the major concerns that must be assuaged before it would even 
> be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the 
> equivalent of "always" and bypass some checks for anonymous pages by simply 
> #ifdefing the code out; obviously I would need to determine the right thing to 
> do in those cases.

Understood that it's completely inappropriate for merging as it stands ;-)

I think the first step is to get variable sized pages in the page cache
working.  Then the map-around functionality can probably just notice if
they're big enough to map with a PMD and make that happen.  I don't immediately
see anything from this PoC that can be used, but it at least gives us a
good point of comparison for any future work.

> 4K Pages:
> =========
> 
>   180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
>           707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
>         5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
>     1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
> 
> 307.093088771 seconds time elapsed                                          ( +-  0.20% )
> 
> 2M Pages:
> =========
> 
>   180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
>               835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
>         6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
>        51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
> 
> 289.551551387 seconds time elapsed                                          ( +-  0.20% )

I think that really tells the story.  We almost entirely eliminate
dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4%
of what they were.  Does this test represent any kind of real world load,
or is it designed to show the best possible improvement?

> Q: How about architectures (ARM, for instance) with multiple large page 
>    sizes that are reasonable for text mappings?
> A: At present a "large page" is just PMD size; it would be possible with
>    additional effort to allow for mapping using PUD-sized pages.
> 
> Q: What about the use of non-PMD large page sizes (on non-x86 architectures)?
> A: I haven't looked into that; I don't have an answer as to how to best 
>    map a page that wasn't sized to be a PMD or PUD.

Yes, we really make no effort to support the kind of arbitrary page sizes
supported by IA64 or PA-RISC.  ARM might be interesting; I think you
can mix 64k and 4k pages fairly arbitrarily (judging from the A57 docs).
We don't have any generic interface for inserting TLB entries that are
intermediate in size between a single page and a PMD, so we'll have to
devise something like that.

I can't find any information on what page sizes SPARC supports.
Maybe you could point me at a reference?  All I've managed to find is
the architecture manuals for SPARC which believe it is not their purpose
to mandate an MMU.
Larry Bassel May 17, 2018, 3:40 p.m. UTC | #6
On 17 May 18 08:23, Matthew Wilcox wrote:
> 
> I can't find any information on what page sizes SPARC supports.
> Maybe you could point me at a reference?  All I've managed to find is
> the architecture manuals for SPARC which believe it is not their purpose
> to mandate an MMU.
> 

Page sizes of 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G are allowed
architecturally -- some of these aren't present in some
SPARC machines. Generally 8K, 64K, 4M, 256M, 2G, 16G are
present on modern machines. 

Also note that the SPARC THP page size is 8M (so that it is
PMD aligned).

Larry
William Kucharski May 17, 2018, 5:31 p.m. UTC | #7
> On May 17, 2018, at 9:23 AM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> I'm certain it is.  The other thing I believe is true that we should be
> able to share page tables (my motivation is thousands of processes each
> mapping the same ridiculously-sized file).  I was hoping this prototype
> would have code that would be stealable for that purpose, but you've
> gone in a different direction.  Which is fine for a prototype; you've
> produced useful numbers.

Definitely, and that's why I mentioned integration with the page cache
would be crucial. This prototype allocates pages for each invocation of
the executable, which would never fly on a real system.

> I think the first step is to get variable sized pages in the page cache
> working.  Then the map-around functionality can probably just notice if
> they're big enough to map with a PMD and make that happen.  I don't immediately
> see anything from this PoC that can be used, but it at least gives us a
> good point of comparison for any future work.

Yes, that's the first step to getting actual usable code designed and
working; this prototype was designed just to get something working and
to get a first swag at some performance numbers.

I do think that adding code to map larger pages as a fault_around variant
is a good start as the code is already going to potentially map in
fault_around_bytes from the file to satisfy the fault. It makes sense
to extend that paradigm to be able to tune when large pages might be
read in and/or mapped using large pages extant in the page cache.

Filesystem support becomes more important once writing to large pages
is allowed.

> I think that really tells the story.  We almost entirely eliminate
> dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4%
> of what they were.  Does this test represent any kind of real world load,
> or is it designed to show the best possible improvement?

It's admittedly designed to thrash the caches pretty hard and doesn't
represent any type of actual workload I'm aware of. It basically calls
various routines within a huge text area while scribbling to automatic
arrays declared at the top of each routine. It wasn't designed as a worst
case scenario, but rather as one that would hopefully show some obvious
degree of difference when large text pages were supported.

Thanks for your comments.

    -- Bill
Song Liu May 20, 2018, 6:26 a.m. UTC | #8
On Thu, May 17, 2018 at 10:31 AM, William Kucharski
<william.kucharski@oracle.com> wrote:
>
>
>> On May 17, 2018, at 9:23 AM, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> I'm certain it is.  The other thing I believe is true that we should be
>> able to share page tables (my motivation is thousands of processes each
>> mapping the same ridiculously-sized file).  I was hoping this prototype
>> would have code that would be stealable for that purpose, but you've
>> gone in a different direction.  Which is fine for a prototype; you've
>> produced useful numbers.
>
> Definitely, and that's why I mentioned integration with the page cache
> would be crucial. This prototype allocates pages for each invocation of
> the executable, which would never fly on a real system.
>
>> I think the first step is to get variable sized pages in the page cache
>> working.  Then the map-around functionality can probably just notice if
>> they're big enough to map with a PMD and make that happen.  I don't immediately
>> see anything from this PoC that can be used, but it at least gives us a
>> good point of comparison for any future work.
>
> Yes, that's the first step to getting actual usable code designed and
> working; this prototype was designed just to get something working and
> to get a first swag at some performance numbers.
>
> I do think that adding code to map larger pages as a fault_around variant
> is a good start as the code is already going to potentially map in
> fault_around_bytes from the file to satisfy the fault. It makes sense
> to extend that paradigm to be able to tune when large pages might be
> read in and/or mapped using large pages extant in the page cache.
>
> Filesystem support becomes more important once writing to large pages
> is allowed.
>
>> I think that really tells the story.  We almost entirely eliminate
>> dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4%
>> of what they were.  Does this test represent any kind of real world load,
>> or is it designed to show the best possible improvement?
>
> It's admittedly designed to thrash the caches pretty hard and doesn't
> represent any type of actual workload I'm aware of. It basically calls
> various routines within a huge text area while scribbling to automatic
> arrays declared at the top of each routine. It wasn't designed as a worst
> case scenario, but rather as one that would hopefully show some obvious
> degree of difference when large text pages were supported.
>
> Thanks for your comments.
>
>     -- Bill

We (Facebook) have quite a few real workloads that take advantage of
text on huge
pages. For some of them, we can see savings close to the number above.

Currently, we "hugify" the text region through some hack in user
space. We are very
interested in supporting it natively in the kernel, because the hack
breaks other
features.

We also tested enabling text on huge pages through shmem, and it does work. The
downside is that it requires putting the whole file in memory (or at
least in swap).
This doesn't work very well for large binaries with GBs of debugging data.

Song
diff mbox

Patch

=========
Performance counter stats for './foo' (10 runs):

    307054.450421      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.21% )
                0      context-switches:u        #    0.000 K/sec
                0      cpu-migrations:u          #    0.000 K/sec
            7,728      page-faults:u             #    0.025 K/sec                    ( +-  0.00% )
1,401,295,823,265      cycles:u                  #    4.564 GHz                      ( +-  0.19% )  (30.77%)
  562,704,668,718      instructions:u            #    0.40  insn per cycle           ( +-  0.00% )  (38.46%)
   20,100,243,102      branches:u                #   65.461 M/sec                    ( +-  0.00% )  (38.46%)
        2,628,944      branch-misses:u           #    0.01% of all branches          ( +-  3.32% )  (38.46%)
  180,885,880,185      L1-dcache-loads:u         #  589.100 M/sec                    ( +-  0.00% )  (38.46%)
   40,374,420,279      L1-dcache-load-misses:u   #   22.32% of all L1-dcache hits    ( +-  0.01% )  (38.46%)
      232,184,583      LLC-loads:u               #    0.756 M/sec                    ( +-  1.48% )  (30.77%)
       23,990,082      LLC-load-misses:u         #   10.33% of all LL-cache hits     ( +-  1.48% )  (30.77%)
  <not supported>      L1-icache-loads:u
   74,897,499,234      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
  180,990,026,447      dTLB-loads:u              #  589.440 M/sec                    ( +-  0.00% )  (30.77%)
          707,373      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +-  4.62% )  (30.77%)
        5,583,675      iTLB-loads:u              #    0.018 M/sec                    ( +-  0.31% )  (30.77%)
    1,219,514,499      iTLB-load-misses:u        # 21840.71% of all iTLB cache hits  ( +-  0.01% )  (30.77%)
  <not supported>      L1-dcache-prefetches:u
  <not supported>      L1-dcache-prefetch-misses:u

307.093088771 seconds time elapsed                                          ( +-  0.20% )

2M Pages:
=========
Performance counter stats for './testr' (10 runs):

    289504.209769      task-clock:u (msec)       #    1.000 CPUs utilized            ( +-  0.19% )
                0      context-switches:u        #    0.000 K/sec
                0      cpu-migrations:u          #    0.000 K/sec
              598      page-faults:u             #    0.002 K/sec                    ( +-  0.03% )
1,323,835,488,984      cycles:u                  #    4.573 GHz                      ( +-  0.19% )  (30.77%)
  562,658,682,055      instructions:u            #    0.43  insn per cycle           ( +-  0.00% )  (38.46%)
   20,099,662,528      branches:u                #   69.428 M/sec                    ( +-  0.00% )  (38.46%)
        2,877,086      branch-misses:u           #    0.01% of all branches          ( +-  4.52% )  (38.46%)
  180,899,297,017      L1-dcache-loads:u         #  624.859 M/sec                    ( +-  0.00% )  (38.46%)
   40,209,140,089      L1-dcache-load-misses:u   #   22.23% of all L1-dcache hits    ( +-  0.00% )  (38.46%)
      135,968,232      LLC-loads:u               #    0.470 M/sec                    ( +-  1.56% )  (30.77%)
        6,704,890      LLC-load-misses:u         #    4.93% of all LL-cache hits     ( +-  1.92% )  (30.77%)
  <not supported>      L1-icache-loads:u
   74,955,673,747      L1-icache-load-misses:u                                       ( +-  0.00% )  (30.77%)
  180,987,794,366      dTLB-loads:u              #  625.165 M/sec                    ( +-  0.00% )  (30.77%)
              835      dTLB-load-misses:u        #    0.00% of all dTLB cache hits   ( +- 14.35% )  (30.77%)
        6,386,207      iTLB-loads:u              #    0.022 M/sec                    ( +-  0.42% )  (30.77%)
       51,929,869      iTLB-load-misses:u        #  813.16% of all iTLB cache hits   ( +-  1.61% )  (30.77%)
  <not supported>      L1-dcache-prefetches:u
  <not supported>      L1-dcache-prefetch-misses:u

289.551551387 seconds time elapsed                                          ( +-  0.20% )

A check of /proc/meminfo with the test program running shows the large mappings:

ShmemPmdMapped:   471040 kB

FAQ:
====
Q: What kernel is the prototype based on?
A: 4.14.0-rc7

Q: What is the biggest issue you haven't addressed?
A: Given this is a prototype, there are many. Aside from the fact that I 
   only map large pages for an  executable of a specific name ("testr"), the 
   code must be integrated with large page size support in the page cache 
   as currently multiple iterations of an executable would each use their 
   own individually allocated THP pages and those pages filled with data 
   using kernel_read(), which  allows for performance characterization but 
   would never be acceptable for a production kernel.

   A good example of the large page support required is the ext4 support
   outlined in:

     https://www.mail-archive.com/linux-block@vger.kernel.org/msg04012.html

   There also need to be configuration options to enable this code at all, 
   likely only for file systems that support large pages, and more 
   reasonable fixes for the assumptions that all large THP pages are 
   anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.)

Q: Which processes get their text as large pages?
A: At this point with this implementation it's any process with a read-only
   text area of the proper size/alignment.

   An attempt is made to align the address for non-MAP_FIXED addresses.

   I do not make any attempt to move mappings that take up a majority of a 
   large page to a large page; I only map a large page if the address 
   aligns and the map size is larger than or equal to a large page.  

Q: Which architectures has this been tested on?
A: At present, only x64.

Q: How about architectures (ARM, for instance) with multiple large page 
   sizes that are reasonable for text mappings?
A: At present a "large page" is just PMD size; it would be possible with
   additional effort to allow for mapping using PUD-sized pages.

Q: What about the use of non-PMD large page sizes (on non-x86 architectures)?
A: I haven't looked into that; I don't have an answer as to how to best 
   map a page that wasn't sized to be a PMD or PUD.

Signed-off-by: William Kucharski <william.kucharski@oracle.com>

===============================================================

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index ed113ea..f4ac381 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -146,8 +146,8 @@  static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
	if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT))
		return -EINVAL;

-	vma_len = (loff_t)(vma->vm_end - vma->vm_start);
-	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);
+	vma_len = (loff_t)(vma->vm_end - vma->vm_start);	/* length of VMA */
+	len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT);	/* add vma->vm_pgoff * PAGESIZE */
	/* check for overflow */
	if (len < vma_len)
		return -EINVAL;
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 87067d2..353bec8 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -80,13 +80,15 @@  extern struct kobj_attribute shmem_enabled_attr;
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define HPAGE_PMD_SHIFT PMD_SHIFT
-#define HPAGE_PMD_SIZE	((1UL) << HPAGE_PMD_SHIFT)
-#define HPAGE_PMD_MASK	(~(HPAGE_PMD_SIZE - 1))
-
-#define HPAGE_PUD_SHIFT PUD_SHIFT
-#define HPAGE_PUD_SIZE	((1UL) << HPAGE_PUD_SHIFT)
-#define HPAGE_PUD_MASK	(~(HPAGE_PUD_SIZE - 1))
+#define HPAGE_PMD_SHIFT		PMD_SHIFT
+#define HPAGE_PMD_SIZE		((1UL) << HPAGE_PMD_SHIFT)
+#define	HPAGE_PMD_OFFSET	(HPAGE_PMD_SIZE - 1)
+#define	HPAGE_PMD_MASK		(~(HPAGE_PMD_OFFSET))
+
+#define HPAGE_PUD_SHIFT		PUD_SHIFT
+#define HPAGE_PUD_SIZE		((1UL) << HPAGE_PUD_SHIFT)
+#define	HPAGE_PUD_OFFSET	(HPAGE_PUD_SIZE - 1)
+#define HPAGE_PUD_MASK		(~(HPAGE_PUD_OFFSET))

extern bool is_vma_temporary_stack(struct vm_area_struct *vma);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1981ed6..7b61c92 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -445,6 +445,14 @@  subsys_initcall(hugepage_init);

static int __init setup_transparent_hugepage(char *str)
{
+#if 1
+	set_bit(TRANSPARENT_HUGEPAGE_FLAG,
+		&transparent_hugepage_flags);
+	clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
+		  &transparent_hugepage_flags);
+	printk("THP permanently set ON\n");
+	return 1;
+#else
	int ret = 0;
	if (!str)
		goto out;
@@ -471,6 +479,7 @@  static int __init setup_transparent_hugepage(char *str)
	if (!ret)
		pr_warn("transparent_hugepage= cannot parse, ignored\n");
	return ret;
+#endif
}
__setup("transparent_hugepage=", setup_transparent_hugepage);

@@ -532,8 +541,11 @@  unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr,

	if (addr)
		goto out;
+
+#if 0
	if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD))
		goto out;
+#endif

	addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE);
	if (addr)
diff --git a/mm/memory.c b/mm/memory.c
index a728bed..fc352d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3506,7 +3506,99 @@  late_initcall(fault_around_debugfs);
 * fault_around_pages() value (and therefore to page order).  This way it's
 * easier to guarantee that we don't cross page table boundaries.
 */
-static int do_fault_around(struct vm_fault *vmf)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static
+int do_fault_around_thp(struct vm_fault *vmf)
+{
+        struct file *file = vmf->vma->vm_file;
+	unsigned long address = vmf->address;
+	pgoff_t start_pgoff = vmf->pgoff;
+	pgoff_t end_pgoff;
+	int ret = VM_FAULT_FALLBACK;
+	int off;
+
+	/*
+	 * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK)
+	 * or the start of the VMA.
+	 */
+	vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start);
+
+	/*
+	 * Not a candidate if the start address calculated above isnt properly
+	 * aligned
+	 */
+	if (vmf->address & HPAGE_PMD_OFFSET)
+		goto dfa_thp_out;
+
+	off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+	start_pgoff -= off;
+
+	/*
+	 *  end_pgoff is either end of page table or end of vma
+	 *  or fault_around_pages() from start_pgoff, depending what is
+	 *  smallest.
+	 */
+	end_pgoff = start_pgoff -
+		((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) +
+		PTRS_PER_PTE - 1;
+	end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1,
+			start_pgoff + PTRS_PER_PTE - 1);
+
+	/*
+	 * Check to see if we could map this request with a large THP page
+	 * instead.
+	 */
+	if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) &&
+		pmd_none(*vmf->pmd) &&
+		((end_pgoff - start_pgoff) >=
+		((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) {
+		struct page *page;
+
+		page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP |
+			__GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma,
+			vmf->address, numa_node_id(), 1);
+
+		if ((likely(page)) && (PageTransCompound(page))) {
+			ssize_t bytes_read;
+			void *pg_vaddr;
+
+			prep_transhuge_page(page);
+			pg_vaddr = page_address(page);
+
+			if (likely(pg_vaddr)) {
+				loff_t loff = (loff_t)
+					(start_pgoff << PAGE_SHIFT);
+				bytes_read = kernel_read(file, pg_vaddr,
+					HPAGE_PMD_SIZE, &loff);
+				VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE);
+
+				smp_wmb(); /* See comment in __pte_alloc() */
+				ret = alloc_set_pte(vmf, NULL, page);
+
+				if (likely(ret == 0)) {
+					VM_BUG_ON_PAGE(pmd_none(*vmf->pmd),
+						page);
+					vmf->page = page;
+					ret = VM_FAULT_NOPAGE;
+					goto dfa_thp_out;
+				}
+			}
+
+			put_page(page);
+		}
+	}
+
+dfa_thp_out:
+	vmf->address = address;
+	VM_BUG_ON(vmf->pte != NULL);
+	return ret;
+}
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+
+static
+int do_fault_around(struct vm_fault *vmf)
{
	unsigned long address = vmf->address, nr_pages, mask;
	pgoff_t start_pgoff = vmf->pgoff;
@@ -3566,6 +3658,21 @@  static int do_read_fault(struct vm_fault *vmf)
	struct vm_area_struct *vma = vmf->vma;
	int ret = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 * Check to see if we could map this request with a large THP page
+	 * instead.
+	 */
+	if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) &&
+		((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name,
+		"testr", 5)) == 0)) {
+			ret = do_fault_around_thp(vmf);
+
+			if (ret == VM_FAULT_NOPAGE)
+				return ret;
+	}
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
	/*
	 * Let's call ->map_pages() first and use ->fault() as fallback
	 * if page by the offset is not ready to be mapped (cold cache or
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506f..1c281d7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1327,6 +1327,10 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
	struct mm_struct *mm = current->mm;
	int pkey = 0;

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	unsigned long thp_maywrite = VM_MAYWRITE;
+#endif
+
	*populate = 0;

	if (!len)
@@ -1361,7 +1365,32 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
	/* Obtain the address to map to. we verify (or select) it and ensure
	 * that it represents a valid section of the address space.
	 */
-	addr = get_unmapped_area(file, addr, len, pgoff, flags);
+
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 *
+	 * If THP is enabled, and it's a read-only executable that is
+	 * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a
+	 * large page aligned virtual address, otherwise use the normal routine.
+	 * 
+	 * Note the THP routine will return a normal page size aligned start
+	 * address in some cases.
+	 */
+	if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) &&
+		(len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) &&
+		((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) {
+			addr = thp_get_unmapped_area(file, addr, len, pgoff,
+				flags);
+			if (addr && (!(addr & HPAGE_PMD_OFFSET)))
+				thp_maywrite = 0;
+	} else {
+#endif
+		addr = get_unmapped_area(file, addr, len, pgoff, flags);
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	}
+#endif
+
	if (offset_in_page(addr))
		return addr;

@@ -1376,7 +1405,11 @@  unsigned long do_mmap(struct file *file, unsigned long addr,
	 * of the memory object, so we don't do any here.
	 */
	vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC;
+#else
			mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
+#endif

	if (flags & MAP_LOCKED)
		if (!can_do_mlock())
diff --git a/mm/rmap.c b/mm/rmap.c
index b874c47..4fc24f8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1184,7 +1184,9 @@  void page_add_file_rmap(struct page *page, bool compound)
		}
		if (!atomic_inc_and_test(compound_mapcount_ptr(page)))
			goto out;
+#if 0
		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
		__inc_node_page_state(page, NR_SHMEM_PMDMAPPED);
	} else {
		if (PageTransCompound(page) && page_mapping(page)) {
@@ -1224,7 +1226,9 @@  static void page_remove_file_rmap(struct page *page, bool compound)
		}
		if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
			goto out;
+#if 0
		VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+#endif
		__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
	} else {
		if (!atomic_add_negative(-1, &page->_mapcount))