Message ID | 5BB682E1-DD52-4AA9-83E9-DEF091E0C709@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 14 May 2018, William Kucharski wrote: > The idea is that the kernel will attempt to allocate and map the range using a > PMD sized THP page upon first fault; if the allocation is successful the page > will be populated (at present using a call to kernel_read()) and the page will > be mapped at the PMD level. If memory allocation fails, the page fault routines > will drop through to the conventional PAGESIZE-oriented routines for mapping > the faulting page. Cool. This could be controlled by the faultaround logic right? If we get fault_around_bytes up to huge page size then it is reasonable to use a huge page direcly. fault_around_bytes can be set via sysfs so there is a natural way to control this feature there I think. > Since this approach will map a PMD size block of the memory map at a time, we > should see a slight uptick in time spent in disk I/O but a substantial drop in > page faults as well as a reduction in iTLB misses as address ranges will be > mapped with the larger page. Analysis of a test program that consists of a very > large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this > does occur and there is a slight reduction in program execution time. I think we would also want such a feature for regular writable pages as soon as possible.
> On May 14, 2018, at 9:19 AM, Christopher Lameter <cl@linux.com> wrote: > > Cool. This could be controlled by the faultaround logic right? If we get > fault_around_bytes up to huge page size then it is reasonable to use a > huge page directly. It isn't presently but certainly could be; for the prototype it tries to map a large page when needed and, should that fail, it will fall through to the normal fault around code. I would think we would want a separate parameter, as I can see the usefulness of more fine-grained control. Many users may want to try mapping a large page if possible, but would prefer a smaller number of bytes to be read in fault around should we need to fall back to using PAGESIZE pages. > fault_around_bytes can be set via sysfs so there is a natural way to > control this feature there I think. I agree; perhaps I could use "fault_around_thp_bytes" or something similar. >> Since this approach will map a PMD size block of the memory map at a time, we >> should see a slight uptick in time spent in disk I/O but a substantial drop in >> page faults as well as a reduction in iTLB misses as address ranges will be >> mapped with the larger page. Analysis of a test program that consists of a very >> large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this >> does occur and there is a slight reduction in program execution time. > > I think we would also want such a feature for regular writable pages as > soon as possible. That is my ultimate long-term goal for this project - full r/w support of large THP pages; prototyping with read-only text pages seemed like the best first step to get a sense of the possible benefits. -- Bill
[CCing Kirill and fs-devel] On Mon 14-05-18 07:12:13, William Kucharski wrote: > One of the downsides of THP as currently implemented is that it only supports > large page mappings for anonymous pages. There is a support for shmem merged already. ext4 was next on the plan AFAIR but I haven't seen any patches and Kirill was busy with other stuff IIRC. > I embarked upon this prototype on the theory that it would be advantageous to > be able to map large ranges of read-only text pages using THP as well. Can the fs really support THP only for read mappings? What if those pages are to be shared in a writable mapping as well? In other words can this all work without a full THP support for a particular fs? Keeping the rest of the email for new CC. > The idea is that the kernel will attempt to allocate and map the range using a > PMD sized THP page upon first fault; if the allocation is successful the page > will be populated (at present using a call to kernel_read()) and the page will > be mapped at the PMD level. If memory allocation fails, the page fault routines > will drop through to the conventional PAGESIZE-oriented routines for mapping > the faulting page. > > Since this approach will map a PMD size block of the memory map at a time, we > should see a slight uptick in time spent in disk I/O but a substantial drop in > page faults as well as a reduction in iTLB misses as address ranges will be > mapped with the larger page. Analysis of a test program that consists of a very > large text area (483,138,032 bytes in size) that thrashes D$ and I$ shows this > does occur and there is a slight reduction in program execution time. > > The text segment as seen from readelf: > > LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000 > 0x000000001ccc19f0 0x000000001ccc19f0 R E 0x200000 > > As currently implemented for test purposes, the prototype will only use large > pages to map an executable with a particular filename ("testr"), enabling easy > comparison of the same executable using 4K and 2M (x64) pages on the same > kernel. It is understood that this is just a proof of concept implementation > and much more work regarding enabling the feature and overall system usage of > it would need to be done before it was submitted as a kernel patch. However, I > felt it would be worthy to send it out as an RFC so I can find out whether > there are huge objections from the community to doing this at all, or a better > understanding of the major concerns that must be assuaged before it would even > be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the > equivalent of "always" and bypass some checks for anonymous pages by simply > #ifdefing the code out; obviously I would need to determine the right thing to > do in those cases. > > Current comparisons of 4K vs 2M pages as generated by "perf stat -d -d -d -r10" > follow; the 4K pagesize program was named "foo" and the 2M pagesize program > "testr" (as noted above) - please note that these numbers do vary from run to > run, but the orders of magnitude of the differences between the two versions > remain relatively constant: > > 4K Pages: > ========= > Performance counter stats for './foo' (10 runs): > > 307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 7,728 page-faults:u # 0.025 K/sec ( +- 0.00% ) > 1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%) > 562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%) > 20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%) > 2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%) > 180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%) > 40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%) > 232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%) > 23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%) > <not supported> L1-icache-loads:u > 74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) > 180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%) > 707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%) > 5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%) > 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%) > <not supported> L1-dcache-prefetches:u > <not supported> L1-dcache-prefetch-misses:u > > 307.093088771 seconds time elapsed ( +- 0.20% ) > > 2M Pages: > ========= > Performance counter stats for './testr' (10 runs): > > 289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% ) > 0 context-switches:u # 0.000 K/sec > 0 cpu-migrations:u # 0.000 K/sec > 598 page-faults:u # 0.002 K/sec ( +- 0.03% ) > 1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%) > 562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%) > 20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%) > 2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%) > 180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%) > 40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%) > 135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%) > 6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%) > <not supported> L1-icache-loads:u > 74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) > 180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%) > 835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%) > 6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%) > 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%) > <not supported> L1-dcache-prefetches:u > <not supported> L1-dcache-prefetch-misses:u > > 289.551551387 seconds time elapsed ( +- 0.20% ) > > A check of /proc/meminfo with the test program running shows the large mappings: > > ShmemPmdMapped: 471040 kB > > FAQ: > ==== > Q: What kernel is the prototype based on? > A: 4.14.0-rc7 > > Q: What is the biggest issue you haven't addressed? > A: Given this is a prototype, there are many. Aside from the fact that I > only map large pages for an executable of a specific name ("testr"), the > code must be integrated with large page size support in the page cache > as currently multiple iterations of an executable would each use their > own individually allocated THP pages and those pages filled with data > using kernel_read(), which allows for performance characterization but > would never be acceptable for a production kernel. > > A good example of the large page support required is the ext4 support > outlined in: > > https://www.mail-archive.com/linux-block@vger.kernel.org/msg04012.html > > There also need to be configuration options to enable this code at all, > likely only for file systems that support large pages, and more > reasonable fixes for the assumptions that all large THP pages are > anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.) > > Q: Which processes get their text as large pages? > A: At this point with this implementation it's any process with a read-only > text area of the proper size/alignment. > > An attempt is made to align the address for non-MAP_FIXED addresses. > > I do not make any attempt to move mappings that take up a majority of a > large page to a large page; I only map a large page if the address > aligns and the map size is larger than or equal to a large page. > > Q: Which architectures has this been tested on? > A: At present, only x64. > > Q: How about architectures (ARM, for instance) with multiple large page > sizes that are reasonable for text mappings? > A: At present a "large page" is just PMD size; it would be possible with > additional effort to allow for mapping using PUD-sized pages. > > Q: What about the use of non-PMD large page sizes (on non-x86 architectures)? > A: I haven't looked into that; I don't have an answer as to how to best > map a page that wasn't sized to be a PMD or PUD. > > Signed-off-by: William Kucharski <william.kucharski@oracle.com> > > =============================================================== > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index ed113ea..f4ac381 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) > if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) > return -EINVAL; > > - vma_len = (loff_t)(vma->vm_end - vma->vm_start); > - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); > + vma_len = (loff_t)(vma->vm_end - vma->vm_start); /* length of VMA */ > + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* add vma->vm_pgoff * PAGESIZE */ > /* check for overflow */ > if (len < vma_len) > return -EINVAL; > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 87067d2..353bec8 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr; > #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > -#define HPAGE_PMD_SHIFT PMD_SHIFT > -#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) > -#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) > - > -#define HPAGE_PUD_SHIFT PUD_SHIFT > -#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) > -#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) > +#define HPAGE_PMD_SHIFT PMD_SHIFT > +#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) > +#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1) > +#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET)) > + > +#define HPAGE_PUD_SHIFT PUD_SHIFT > +#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) > +#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1) > +#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET)) > > extern bool is_vma_temporary_stack(struct vm_area_struct *vma); > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 1981ed6..7b61c92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -445,6 +445,14 @@ subsys_initcall(hugepage_init); > > static int __init setup_transparent_hugepage(char *str) > { > +#if 1 > + set_bit(TRANSPARENT_HUGEPAGE_FLAG, > + &transparent_hugepage_flags); > + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > + &transparent_hugepage_flags); > + printk("THP permanently set ON\n"); > + return 1; > +#else > int ret = 0; > if (!str) > goto out; > @@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str) > if (!ret) > pr_warn("transparent_hugepage= cannot parse, ignored\n"); > return ret; > +#endif > } > __setup("transparent_hugepage=", setup_transparent_hugepage); > > @@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, > > if (addr) > goto out; > + > +#if 0 > if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) > goto out; > +#endif > > addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); > if (addr) > diff --git a/mm/memory.c b/mm/memory.c > index a728bed..fc352d8 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs); > * fault_around_pages() value (and therefore to page order). This way it's > * easier to guarantee that we don't cross page table boundaries. > */ > -static int do_fault_around(struct vm_fault *vmf) > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > +static > +int do_fault_around_thp(struct vm_fault *vmf) > +{ > + struct file *file = vmf->vma->vm_file; > + unsigned long address = vmf->address; > + pgoff_t start_pgoff = vmf->pgoff; > + pgoff_t end_pgoff; > + int ret = VM_FAULT_FALLBACK; > + int off; > + > + /* > + * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK) > + * or the start of the VMA. > + */ > + vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start); > + > + /* > + * Not a candidate if the start address calculated above isnt properly > + * aligned > + */ > + if (vmf->address & HPAGE_PMD_OFFSET) > + goto dfa_thp_out; > + > + off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); > + start_pgoff -= off; > + > + /* > + * end_pgoff is either end of page table or end of vma > + * or fault_around_pages() from start_pgoff, depending what is > + * smallest. > + */ > + end_pgoff = start_pgoff - > + ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + > + PTRS_PER_PTE - 1; > + end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1, > + start_pgoff + PTRS_PER_PTE - 1); > + > + /* > + * Check to see if we could map this request with a large THP page > + * instead. > + */ > + if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) && > + pmd_none(*vmf->pmd) && > + ((end_pgoff - start_pgoff) >= > + ((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) { > + struct page *page; > + > + page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP | > + __GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma, > + vmf->address, numa_node_id(), 1); > + > + if ((likely(page)) && (PageTransCompound(page))) { > + ssize_t bytes_read; > + void *pg_vaddr; > + > + prep_transhuge_page(page); > + pg_vaddr = page_address(page); > + > + if (likely(pg_vaddr)) { > + loff_t loff = (loff_t) > + (start_pgoff << PAGE_SHIFT); > + bytes_read = kernel_read(file, pg_vaddr, > + HPAGE_PMD_SIZE, &loff); > + VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE); > + > + smp_wmb(); /* See comment in __pte_alloc() */ > + ret = alloc_set_pte(vmf, NULL, page); > + > + if (likely(ret == 0)) { > + VM_BUG_ON_PAGE(pmd_none(*vmf->pmd), > + page); > + vmf->page = page; > + ret = VM_FAULT_NOPAGE; > + goto dfa_thp_out; > + } > + } > + > + put_page(page); > + } > + } > + > +dfa_thp_out: > + vmf->address = address; > + VM_BUG_ON(vmf->pte != NULL); > + return ret; > +} > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > + > +static > +int do_fault_around(struct vm_fault *vmf) > { > unsigned long address = vmf->address, nr_pages, mask; > pgoff_t start_pgoff = vmf->pgoff; > @@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > int ret = 0; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* > + * Check to see if we could map this request with a large THP page > + * instead. > + */ > + if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) && > + ((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name, > + "testr", 5)) == 0)) { > + ret = do_fault_around_thp(vmf); > + > + if (ret == VM_FAULT_NOPAGE) > + return ret; > + } > +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > + > /* > * Let's call ->map_pages() first and use ->fault() as fallback > * if page by the offset is not ready to be mapped (cold cache or > diff --git a/mm/mmap.c b/mm/mmap.c > index 680506f..1c281d7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > struct mm_struct *mm = current->mm; > int pkey = 0; > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + unsigned long thp_maywrite = VM_MAYWRITE; > +#endif > + > *populate = 0; > > if (!len) > @@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > /* Obtain the address to map to. we verify (or select) it and ensure > * that it represents a valid section of the address space. > */ > - addr = get_unmapped_area(file, addr, len, pgoff, flags); > + > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + /* > + * > + * If THP is enabled, and it's a read-only executable that is > + * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a > + * large page aligned virtual address, otherwise use the normal routine. > + * > + * Note the THP routine will return a normal page size aligned start > + * address in some cases. > + */ > + if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) && > + (len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) && > + ((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) { > + addr = thp_get_unmapped_area(file, addr, len, pgoff, > + flags); > + if (addr && (!(addr & HPAGE_PMD_OFFSET))) > + thp_maywrite = 0; > + } else { > +#endif > + addr = get_unmapped_area(file, addr, len, pgoff, flags); > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + } > +#endif > + > if (offset_in_page(addr)) > return addr; > > @@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr, > * of the memory object, so we don't do any here. > */ > vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC; > +#else > mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; > +#endif > > if (flags & MAP_LOCKED) > if (!can_do_mlock()) > diff --git a/mm/rmap.c b/mm/rmap.c > index b874c47..4fc24f8 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound) > } > if (!atomic_inc_and_test(compound_mapcount_ptr(page))) > goto out; > +#if 0 > VM_BUG_ON_PAGE(!PageSwapBacked(page), page); > +#endif > __inc_node_page_state(page, NR_SHMEM_PMDMAPPED); > } else { > if (PageTransCompound(page) && page_mapping(page)) { > @@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound) > } > if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) > goto out; > +#if 0 > VM_BUG_ON_PAGE(!PageSwapBacked(page), page); > +#endif > __dec_node_page_state(page, NR_SHMEM_PMDMAPPED); > } else { > if (!atomic_add_negative(-1, &page->_mapcount))
> On May 17, 2018, at 1:57 AM, Michal Hocko <mhocko@kernel.org> wrote: > > [CCing Kirill and fs-devel] > > On Mon 14-05-18 07:12:13, William Kucharski wrote: >> One of the downsides of THP as currently implemented is that it only supports >> large page mappings for anonymous pages. > > There is a support for shmem merged already. ext4 was next on the plan > AFAIR but I haven't seen any patches and Kirill was busy with other > stuff IIRC. I couldn't find anything that would specifically map text pages with large pages, so perhaps this could be integrated with that or I may have simply missed changes that would ultimately provide that functionality. > >> I embarked upon this prototype on the theory that it would be advantageous to >> be able to map large ranges of read-only text pages using THP as well. > > Can the fs really support THP only for read mappings? What if those > pages are to be shared in a writable mapping as well? In other words > can this all work without a full THP support for a particular fs? The integration with the page cache would indeed require filesystem support. The end result I'd like to see is full R/W support for large THP pages; I thought the RO text mapping proof of concept worthwhile to see what kind of results we might see and what the thoughts of the community were. Thanks for the feedback. -- Bill
On Mon, May 14, 2018 at 07:12:13AM -0600, William Kucharski wrote: > One of the downsides of THP as currently implemented is that it only supports > large page mappings for anonymous pages. It does also support shmem. > I embarked upon this prototype on the theory that it would be advantageous to > be able to map large ranges of read-only text pages using THP as well. I'm certain it is. The other thing I believe is true that we should be able to share page tables (my motivation is thousands of processes each mapping the same ridiculously-sized file). I was hoping this prototype would have code that would be stealable for that purpose, but you've gone in a different direction. Which is fine for a prototype; you've produced useful numbers. > As currently implemented for test purposes, the prototype will only use large > pages to map an executable with a particular filename ("testr"), enabling easy > comparison of the same executable using 4K and 2M (x64) pages on the same > kernel. It is understood that this is just a proof of concept implementation > and much more work regarding enabling the feature and overall system usage of > it would need to be done before it was submitted as a kernel patch. However, I > felt it would be worthy to send it out as an RFC so I can find out whether > there are huge objections from the community to doing this at all, or a better > understanding of the major concerns that must be assuaged before it would even > be considered. I currently hardcode CONFIG_TRANSPARENT_HUGEPAGE to the > equivalent of "always" and bypass some checks for anonymous pages by simply > #ifdefing the code out; obviously I would need to determine the right thing to > do in those cases. Understood that it's completely inappropriate for merging as it stands ;-) I think the first step is to get variable sized pages in the page cache working. Then the map-around functionality can probably just notice if they're big enough to map with a PMD and make that happen. I don't immediately see anything from this PoC that can be used, but it at least gives us a good point of comparison for any future work. > 4K Pages: > ========= > > 180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%) > 707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%) > 5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%) > 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%) > > 307.093088771 seconds time elapsed ( +- 0.20% ) > > 2M Pages: > ========= > > 180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%) > 835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%) > 6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%) > 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%) > > 289.551551387 seconds time elapsed ( +- 0.20% ) I think that really tells the story. We almost entirely eliminate dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4% of what they were. Does this test represent any kind of real world load, or is it designed to show the best possible improvement? > Q: How about architectures (ARM, for instance) with multiple large page > sizes that are reasonable for text mappings? > A: At present a "large page" is just PMD size; it would be possible with > additional effort to allow for mapping using PUD-sized pages. > > Q: What about the use of non-PMD large page sizes (on non-x86 architectures)? > A: I haven't looked into that; I don't have an answer as to how to best > map a page that wasn't sized to be a PMD or PUD. Yes, we really make no effort to support the kind of arbitrary page sizes supported by IA64 or PA-RISC. ARM might be interesting; I think you can mix 64k and 4k pages fairly arbitrarily (judging from the A57 docs). We don't have any generic interface for inserting TLB entries that are intermediate in size between a single page and a PMD, so we'll have to devise something like that. I can't find any information on what page sizes SPARC supports. Maybe you could point me at a reference? All I've managed to find is the architecture manuals for SPARC which believe it is not their purpose to mandate an MMU.
On 17 May 18 08:23, Matthew Wilcox wrote: > > I can't find any information on what page sizes SPARC supports. > Maybe you could point me at a reference? All I've managed to find is > the architecture manuals for SPARC which believe it is not their purpose > to mandate an MMU. > Page sizes of 8K, 64K, 512K, 4M, 32M, 256M, 2G, 16G are allowed architecturally -- some of these aren't present in some SPARC machines. Generally 8K, 64K, 4M, 256M, 2G, 16G are present on modern machines. Also note that the SPARC THP page size is 8M (so that it is PMD aligned). Larry
> On May 17, 2018, at 9:23 AM, Matthew Wilcox <willy@infradead.org> wrote: > > I'm certain it is. The other thing I believe is true that we should be > able to share page tables (my motivation is thousands of processes each > mapping the same ridiculously-sized file). I was hoping this prototype > would have code that would be stealable for that purpose, but you've > gone in a different direction. Which is fine for a prototype; you've > produced useful numbers. Definitely, and that's why I mentioned integration with the page cache would be crucial. This prototype allocates pages for each invocation of the executable, which would never fly on a real system. > I think the first step is to get variable sized pages in the page cache > working. Then the map-around functionality can probably just notice if > they're big enough to map with a PMD and make that happen. I don't immediately > see anything from this PoC that can be used, but it at least gives us a > good point of comparison for any future work. Yes, that's the first step to getting actual usable code designed and working; this prototype was designed just to get something working and to get a first swag at some performance numbers. I do think that adding code to map larger pages as a fault_around variant is a good start as the code is already going to potentially map in fault_around_bytes from the file to satisfy the fault. It makes sense to extend that paradigm to be able to tune when large pages might be read in and/or mapped using large pages extant in the page cache. Filesystem support becomes more important once writing to large pages is allowed. > I think that really tells the story. We almost entirely eliminate > dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4% > of what they were. Does this test represent any kind of real world load, > or is it designed to show the best possible improvement? It's admittedly designed to thrash the caches pretty hard and doesn't represent any type of actual workload I'm aware of. It basically calls various routines within a huge text area while scribbling to automatic arrays declared at the top of each routine. It wasn't designed as a worst case scenario, but rather as one that would hopefully show some obvious degree of difference when large text pages were supported. Thanks for your comments. -- Bill
On Thu, May 17, 2018 at 10:31 AM, William Kucharski <william.kucharski@oracle.com> wrote: > > >> On May 17, 2018, at 9:23 AM, Matthew Wilcox <willy@infradead.org> wrote: >> >> I'm certain it is. The other thing I believe is true that we should be >> able to share page tables (my motivation is thousands of processes each >> mapping the same ridiculously-sized file). I was hoping this prototype >> would have code that would be stealable for that purpose, but you've >> gone in a different direction. Which is fine for a prototype; you've >> produced useful numbers. > > Definitely, and that's why I mentioned integration with the page cache > would be crucial. This prototype allocates pages for each invocation of > the executable, which would never fly on a real system. > >> I think the first step is to get variable sized pages in the page cache >> working. Then the map-around functionality can probably just notice if >> they're big enough to map with a PMD and make that happen. I don't immediately >> see anything from this PoC that can be used, but it at least gives us a >> good point of comparison for any future work. > > Yes, that's the first step to getting actual usable code designed and > working; this prototype was designed just to get something working and > to get a first swag at some performance numbers. > > I do think that adding code to map larger pages as a fault_around variant > is a good start as the code is already going to potentially map in > fault_around_bytes from the file to satisfy the fault. It makes sense > to extend that paradigm to be able to tune when large pages might be > read in and/or mapped using large pages extant in the page cache. > > Filesystem support becomes more important once writing to large pages > is allowed. > >> I think that really tells the story. We almost entirely eliminate >> dTLB load misses (down to almost 0.1%) and iTLB load misses drop to 4% >> of what they were. Does this test represent any kind of real world load, >> or is it designed to show the best possible improvement? > > It's admittedly designed to thrash the caches pretty hard and doesn't > represent any type of actual workload I'm aware of. It basically calls > various routines within a huge text area while scribbling to automatic > arrays declared at the top of each routine. It wasn't designed as a worst > case scenario, but rather as one that would hopefully show some obvious > degree of difference when large text pages were supported. > > Thanks for your comments. > > -- Bill We (Facebook) have quite a few real workloads that take advantage of text on huge pages. For some of them, we can see savings close to the number above. Currently, we "hugify" the text region through some hack in user space. We are very interested in supporting it natively in the kernel, because the hack breaks other features. We also tested enabling text on huge pages through shmem, and it does work. The downside is that it requires putting the whole file in memory (or at least in swap). This doesn't work very well for large binaries with GBs of debugging data. Song
========= Performance counter stats for './foo' (10 runs): 307054.450421 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.21% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 7,728 page-faults:u # 0.025 K/sec ( +- 0.00% ) 1,401,295,823,265 cycles:u # 4.564 GHz ( +- 0.19% ) (30.77%) 562,704,668,718 instructions:u # 0.40 insn per cycle ( +- 0.00% ) (38.46%) 20,100,243,102 branches:u # 65.461 M/sec ( +- 0.00% ) (38.46%) 2,628,944 branch-misses:u # 0.01% of all branches ( +- 3.32% ) (38.46%) 180,885,880,185 L1-dcache-loads:u # 589.100 M/sec ( +- 0.00% ) (38.46%) 40,374,420,279 L1-dcache-load-misses:u # 22.32% of all L1-dcache hits ( +- 0.01% ) (38.46%) 232,184,583 LLC-loads:u # 0.756 M/sec ( +- 1.48% ) (30.77%) 23,990,082 LLC-load-misses:u # 10.33% of all LL-cache hits ( +- 1.48% ) (30.77%) <not supported> L1-icache-loads:u 74,897,499,234 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) 180,990,026,447 dTLB-loads:u # 589.440 M/sec ( +- 0.00% ) (30.77%) 707,373 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 4.62% ) (30.77%) 5,583,675 iTLB-loads:u # 0.018 M/sec ( +- 0.31% ) (30.77%) 1,219,514,499 iTLB-load-misses:u # 21840.71% of all iTLB cache hits ( +- 0.01% ) (30.77%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 307.093088771 seconds time elapsed ( +- 0.20% ) 2M Pages: ========= Performance counter stats for './testr' (10 runs): 289504.209769 task-clock:u (msec) # 1.000 CPUs utilized ( +- 0.19% ) 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 598 page-faults:u # 0.002 K/sec ( +- 0.03% ) 1,323,835,488,984 cycles:u # 4.573 GHz ( +- 0.19% ) (30.77%) 562,658,682,055 instructions:u # 0.43 insn per cycle ( +- 0.00% ) (38.46%) 20,099,662,528 branches:u # 69.428 M/sec ( +- 0.00% ) (38.46%) 2,877,086 branch-misses:u # 0.01% of all branches ( +- 4.52% ) (38.46%) 180,899,297,017 L1-dcache-loads:u # 624.859 M/sec ( +- 0.00% ) (38.46%) 40,209,140,089 L1-dcache-load-misses:u # 22.23% of all L1-dcache hits ( +- 0.00% ) (38.46%) 135,968,232 LLC-loads:u # 0.470 M/sec ( +- 1.56% ) (30.77%) 6,704,890 LLC-load-misses:u # 4.93% of all LL-cache hits ( +- 1.92% ) (30.77%) <not supported> L1-icache-loads:u 74,955,673,747 L1-icache-load-misses:u ( +- 0.00% ) (30.77%) 180,987,794,366 dTLB-loads:u # 625.165 M/sec ( +- 0.00% ) (30.77%) 835 dTLB-load-misses:u # 0.00% of all dTLB cache hits ( +- 14.35% ) (30.77%) 6,386,207 iTLB-loads:u # 0.022 M/sec ( +- 0.42% ) (30.77%) 51,929,869 iTLB-load-misses:u # 813.16% of all iTLB cache hits ( +- 1.61% ) (30.77%) <not supported> L1-dcache-prefetches:u <not supported> L1-dcache-prefetch-misses:u 289.551551387 seconds time elapsed ( +- 0.20% ) A check of /proc/meminfo with the test program running shows the large mappings: ShmemPmdMapped: 471040 kB FAQ: ==== Q: What kernel is the prototype based on? A: 4.14.0-rc7 Q: What is the biggest issue you haven't addressed? A: Given this is a prototype, there are many. Aside from the fact that I only map large pages for an executable of a specific name ("testr"), the code must be integrated with large page size support in the page cache as currently multiple iterations of an executable would each use their own individually allocated THP pages and those pages filled with data using kernel_read(), which allows for performance characterization but would never be acceptable for a production kernel. A good example of the large page support required is the ext4 support outlined in: https://www.mail-archive.com/linux-block@vger.kernel.org/msg04012.html There also need to be configuration options to enable this code at all, likely only for file systems that support large pages, and more reasonable fixes for the assumptions that all large THP pages are anonymous assertions in rmap.c (for the prototype I just "#if 0" them out.) Q: Which processes get their text as large pages? A: At this point with this implementation it's any process with a read-only text area of the proper size/alignment. An attempt is made to align the address for non-MAP_FIXED addresses. I do not make any attempt to move mappings that take up a majority of a large page to a large page; I only map a large page if the address aligns and the map size is larger than or equal to a large page. Q: Which architectures has this been tested on? A: At present, only x64. Q: How about architectures (ARM, for instance) with multiple large page sizes that are reasonable for text mappings? A: At present a "large page" is just PMD size; it would be possible with additional effort to allow for mapping using PUD-sized pages. Q: What about the use of non-PMD large page sizes (on non-x86 architectures)? A: I haven't looked into that; I don't have an answer as to how to best map a page that wasn't sized to be a PMD or PUD. Signed-off-by: William Kucharski <william.kucharski@oracle.com> =============================================================== diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index ed113ea..f4ac381 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -146,8 +146,8 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (vma->vm_pgoff & (~huge_page_mask(h) >> PAGE_SHIFT)) return -EINVAL; - vma_len = (loff_t)(vma->vm_end - vma->vm_start); - len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); + vma_len = (loff_t)(vma->vm_end - vma->vm_start); /* length of VMA */ + len = vma_len + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); /* add vma->vm_pgoff * PAGESIZE */ /* check for overflow */ if (len < vma_len) return -EINVAL; diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 87067d2..353bec8 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -80,13 +80,15 @@ extern struct kobj_attribute shmem_enabled_attr; #define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER) #ifdef CONFIG_TRANSPARENT_HUGEPAGE -#define HPAGE_PMD_SHIFT PMD_SHIFT -#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) -#define HPAGE_PMD_MASK (~(HPAGE_PMD_SIZE - 1)) - -#define HPAGE_PUD_SHIFT PUD_SHIFT -#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) -#define HPAGE_PUD_MASK (~(HPAGE_PUD_SIZE - 1)) +#define HPAGE_PMD_SHIFT PMD_SHIFT +#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT) +#define HPAGE_PMD_OFFSET (HPAGE_PMD_SIZE - 1) +#define HPAGE_PMD_MASK (~(HPAGE_PMD_OFFSET)) + +#define HPAGE_PUD_SHIFT PUD_SHIFT +#define HPAGE_PUD_SIZE ((1UL) << HPAGE_PUD_SHIFT) +#define HPAGE_PUD_OFFSET (HPAGE_PUD_SIZE - 1) +#define HPAGE_PUD_MASK (~(HPAGE_PUD_OFFSET)) extern bool is_vma_temporary_stack(struct vm_area_struct *vma); diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 1981ed6..7b61c92 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -445,6 +445,14 @@ subsys_initcall(hugepage_init); static int __init setup_transparent_hugepage(char *str) { +#if 1 + set_bit(TRANSPARENT_HUGEPAGE_FLAG, + &transparent_hugepage_flags); + clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, + &transparent_hugepage_flags); + printk("THP permanently set ON\n"); + return 1; +#else int ret = 0; if (!str) goto out; @@ -471,6 +479,7 @@ static int __init setup_transparent_hugepage(char *str) if (!ret) pr_warn("transparent_hugepage= cannot parse, ignored\n"); return ret; +#endif } __setup("transparent_hugepage=", setup_transparent_hugepage); @@ -532,8 +541,11 @@ unsigned long thp_get_unmapped_area(struct file *filp, unsigned long addr, if (addr) goto out; + +#if 0 if (!IS_DAX(filp->f_mapping->host) || !IS_ENABLED(CONFIG_FS_DAX_PMD)) goto out; +#endif addr = __thp_get_unmapped_area(filp, len, off, flags, PMD_SIZE); if (addr) diff --git a/mm/memory.c b/mm/memory.c index a728bed..fc352d8 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3506,7 +3506,99 @@ late_initcall(fault_around_debugfs); * fault_around_pages() value (and therefore to page order). This way it's * easier to guarantee that we don't cross page table boundaries. */ -static int do_fault_around(struct vm_fault *vmf) + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static +int do_fault_around_thp(struct vm_fault *vmf) +{ + struct file *file = vmf->vma->vm_file; + unsigned long address = vmf->address; + pgoff_t start_pgoff = vmf->pgoff; + pgoff_t end_pgoff; + int ret = VM_FAULT_FALLBACK; + int off; + + /* + * vmf->address will be the higher of (fault address & HPAGE_PMD_MASK) + * or the start of the VMA. + */ + vmf->address = max((address & HPAGE_PMD_MASK), vmf->vma->vm_start); + + /* + * Not a candidate if the start address calculated above isnt properly + * aligned + */ + if (vmf->address & HPAGE_PMD_OFFSET) + goto dfa_thp_out; + + off = ((address - vmf->address) >> PAGE_SHIFT) & (PTRS_PER_PTE - 1); + start_pgoff -= off; + + /* + * end_pgoff is either end of page table or end of vma + * or fault_around_pages() from start_pgoff, depending what is + * smallest. + */ + end_pgoff = start_pgoff - + ((vmf->address >> PAGE_SHIFT) & (PTRS_PER_PTE - 1)) + + PTRS_PER_PTE - 1; + end_pgoff = min3(end_pgoff, vma_pages(vmf->vma) + vmf->vma->vm_pgoff - 1, + start_pgoff + PTRS_PER_PTE - 1); + + /* + * Check to see if we could map this request with a large THP page + * instead. + */ + if (((strncmp(file->f_path.dentry->d_name.name, "testr", 5) == 0)) && + pmd_none(*vmf->pmd) && + ((end_pgoff - start_pgoff) >= + ((HPAGE_PMD_SIZE >> PAGE_SHIFT) - 1))) { + struct page *page; + + page = alloc_pages_vma(vmf->gfp_mask | __GFP_COMP | + __GFP_NORETRY, HPAGE_PMD_ORDER, vmf->vma, + vmf->address, numa_node_id(), 1); + + if ((likely(page)) && (PageTransCompound(page))) { + ssize_t bytes_read; + void *pg_vaddr; + + prep_transhuge_page(page); + pg_vaddr = page_address(page); + + if (likely(pg_vaddr)) { + loff_t loff = (loff_t) + (start_pgoff << PAGE_SHIFT); + bytes_read = kernel_read(file, pg_vaddr, + HPAGE_PMD_SIZE, &loff); + VM_BUG_ON(bytes_read != HPAGE_PMD_SIZE); + + smp_wmb(); /* See comment in __pte_alloc() */ + ret = alloc_set_pte(vmf, NULL, page); + + if (likely(ret == 0)) { + VM_BUG_ON_PAGE(pmd_none(*vmf->pmd), + page); + vmf->page = page; + ret = VM_FAULT_NOPAGE; + goto dfa_thp_out; + } + } + + put_page(page); + } + } + +dfa_thp_out: + vmf->address = address; + VM_BUG_ON(vmf->pte != NULL); + return ret; +} +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + + +static +int do_fault_around(struct vm_fault *vmf) { unsigned long address = vmf->address, nr_pages, mask; pgoff_t start_pgoff = vmf->pgoff; @@ -3566,6 +3658,21 @@ static int do_read_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; int ret = 0; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* + * Check to see if we could map this request with a large THP page + * instead. + */ + if ((vma_pages(vmf->vma) >= PTRS_PER_PMD) && + ((strncmp(vmf->vma->vm_file->f_path.dentry->d_name.name, + "testr", 5)) == 0)) { + ret = do_fault_around_thp(vmf); + + if (ret == VM_FAULT_NOPAGE) + return ret; + } +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ + /* * Let's call ->map_pages() first and use ->fault() as fallback * if page by the offset is not ready to be mapped (cold cache or diff --git a/mm/mmap.c b/mm/mmap.c index 680506f..1c281d7 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1327,6 +1327,10 @@ unsigned long do_mmap(struct file *file, unsigned long addr, struct mm_struct *mm = current->mm; int pkey = 0; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + unsigned long thp_maywrite = VM_MAYWRITE; +#endif + *populate = 0; if (!len) @@ -1361,7 +1365,32 @@ unsigned long do_mmap(struct file *file, unsigned long addr, /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ - addr = get_unmapped_area(file, addr, len, pgoff, flags); + + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + /* + * + * If THP is enabled, and it's a read-only executable that is + * MAP_PRIVATE mapped, call the appropriate thp function to perhaps get a + * large page aligned virtual address, otherwise use the normal routine. + * + * Note the THP routine will return a normal page size aligned start + * address in some cases. + */ + if ((prot & PROT_READ) && (prot & PROT_EXEC) && (!(prot & PROT_WRITE)) && + (len >= HPAGE_PMD_SIZE) && (flags & MAP_PRIVATE) && + ((!(flags & MAP_FIXED)) || (!(addr & HPAGE_PMD_OFFSET)))) { + addr = thp_get_unmapped_area(file, addr, len, pgoff, + flags); + if (addr && (!(addr & HPAGE_PMD_OFFSET))) + thp_maywrite = 0; + } else { +#endif + addr = get_unmapped_area(file, addr, len, pgoff, flags); +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + } +#endif + if (offset_in_page(addr)) return addr; @@ -1376,7 +1405,11 @@ unsigned long do_mmap(struct file *file, unsigned long addr, * of the memory object, so we don't do any here. */ vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + mm->def_flags | VM_MAYREAD | thp_maywrite | VM_MAYEXEC; +#else mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; +#endif if (flags & MAP_LOCKED) if (!can_do_mlock()) diff --git a/mm/rmap.c b/mm/rmap.c index b874c47..4fc24f8 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1184,7 +1184,9 @@ void page_add_file_rmap(struct page *page, bool compound) } if (!atomic_inc_and_test(compound_mapcount_ptr(page))) goto out; +#if 0 VM_BUG_ON_PAGE(!PageSwapBacked(page), page); +#endif __inc_node_page_state(page, NR_SHMEM_PMDMAPPED); } else { if (PageTransCompound(page) && page_mapping(page)) { @@ -1224,7 +1226,9 @@ static void page_remove_file_rmap(struct page *page, bool compound) } if (!atomic_add_negative(-1, compound_mapcount_ptr(page))) goto out; +#if 0 VM_BUG_ON_PAGE(!PageSwapBacked(page), page); +#endif __dec_node_page_state(page, NR_SHMEM_PMDMAPPED); } else { if (!atomic_add_negative(-1, &page->_mapcount))