diff mbox series

[v3,1/2] mm: migration: fix migration of huge PMD shared pages

Message ID 20180821205902.21223-2-mike.kravetz@oracle.com (mailing list archive)
State New, archived
Headers show
Series huge_pmd_unshare migration and flushing | expand

Commit Message

Mike Kravetz Aug. 21, 2018, 8:59 p.m. UTC
The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 3 deletions(-)

Comments

kernel test robot Aug. 21, 2018, 10:03 p.m. UTC | #1
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   arch/x86/mm/fault.o: In function `huge_pmd_sharing_possible':
>> fault.c:(.text+0xa06): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   arch/x86/mm/pgtable.o: In function `huge_pmd_sharing_possible':
   pgtable.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/fork.o: In function `huge_pmd_sharing_possible':
   fork.c:(.text+0x309): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sysctl.o: In function `huge_pmd_sharing_possible':
   sysctl.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/core.o: In function `huge_pmd_sharing_possible':
   core.c:(.text+0x299): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/loadavg.o: In function `huge_pmd_sharing_possible':
   loadavg.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/clock.o: In function `huge_pmd_sharing_possible':
   clock.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/cputime.o: In function `huge_pmd_sharing_possible':
   cputime.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/idle.o: In function `huge_pmd_sharing_possible':
   idle.c:(.text+0x36): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/fair.o: In function `huge_pmd_sharing_possible':
   fair.c:(.text+0x864): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/rt.o: In function `huge_pmd_sharing_possible':
   rt.c:(.text+0x72b): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/deadline.o: In function `huge_pmd_sharing_possible':
   deadline.c:(.text+0xac7): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/wait.o: In function `huge_pmd_sharing_possible':
   wait.c:(.text+0x16e): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/wait_bit.o: In function `huge_pmd_sharing_possible':
   wait_bit.c:(.text+0x7b): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/swait.o: In function `huge_pmd_sharing_possible':
   swait.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   kernel/sched/completion.o: In function `huge_pmd_sharing_possible':
   completion.c:(.text+0x4): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/filemap.o: In function `huge_pmd_sharing_possible':
   filemap.c:(.text+0x3ca): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/page_alloc.o: In function `huge_pmd_sharing_possible':
   page_alloc.c:(.text+0xa95): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/swap.o: In function `huge_pmd_sharing_possible':
   swap.c:(.text+0x551): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/vmscan.o: In function `huge_pmd_sharing_possible':
   vmscan.c:(.text+0x5bb): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/shmem.o: In function `huge_pmd_sharing_possible':
   shmem.c:(.text+0x6d): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/util.o: In function `huge_pmd_sharing_possible':
   util.c:(.text+0xc): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/compaction.o: In function `huge_pmd_sharing_possible':
   compaction.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/debug.o: In function `huge_pmd_sharing_possible':
   debug.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/gup.o: In function `huge_pmd_sharing_possible':
   gup.c:(.text+0x17c): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/memory.o: In function `huge_pmd_sharing_possible':
   memory.c:(.text+0x5f9): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mincore.o: In function `huge_pmd_sharing_possible':
   mincore.c:(.text+0x150): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mlock.o: In function `huge_pmd_sharing_possible':
   mlock.c:(.text+0x245): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mmap.o: In function `huge_pmd_sharing_possible':
   mmap.c:(.text+0x565): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mprotect.o: In function `huge_pmd_sharing_possible':
   mprotect.c:(.text+0x39): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/mremap.o: In function `huge_pmd_sharing_possible':
   mremap.c:(.text+0xf2): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/page_vma_mapped.o: In function `huge_pmd_sharing_possible':
   page_vma_mapped.c:(.text+0x0): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/pagewalk.o: In function `huge_pmd_sharing_possible':
   pagewalk.c:(.text+0x13d): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here
   mm/rmap.o: In function `huge_pmd_sharing_possible':
   rmap.c:(.text+0x3bb): multiple definition of `huge_pmd_sharing_possible'
   arch/x86/mm/init_32.o:init_32.c:(.text+0x0): first defined here

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Mike Kravetz Aug. 21, 2018, 11:06 p.m. UTC | #2
On 08/21/2018 03:03 PM, kbuild test robot wrote:
> Hi Mike,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18 next-20180821]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
> config: i386-tinyconfig (attached as .config)
> compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
> reproduce:
>         # save the attached .config to linux build tree
>         make ARCH=i386 

Oops, simple build fix addressed in updated patch below.

From 57b7d63d88f65b1c385a5f15c7cd5fa07513e955 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 21 Aug 2018 10:50:17 -0700
Subject: [PATCH v4 1/2] mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb.c            | 40 +++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++++++---
 3 files changed, 93 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..840b43a085cc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..fd155dc52117 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
kernel test robot Aug. 22, 2018, 12:51 a.m. UTC | #3
Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180821]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
config: sparc64-allyesconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        GCC_VERSION=7.2.0 make.cross ARCH=sparc64 

All errors (new ones prefixed by >>):

   mm/rmap.o: In function `try_to_unmap_one':
>> rmap.c:(.text+0x2708): undefined reference to `huge_pmd_sharing_possible'
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o
   `.exit.data' referenced in section `.exit.text' of drivers/tty/n_hdlc.o: defined in discarded section `.exit.data' of drivers/tty/n_hdlc.o

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Mike Kravetz Aug. 22, 2018, 1:10 a.m. UTC | #4
On 08/21/2018 05:51 PM, kbuild test robot wrote:
> Hi Mike,
> 
> I love your patch! Yet something to improve:
> 
> [auto build test ERROR on linus/master]
> [also build test ERROR on v4.18 next-20180821]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
> 
> url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/huge_pmd_unshare-migration-and-flushing/20180822-050255
> config: sparc64-allyesconfig (attached as .config)
> compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         GCC_VERSION=7.2.0 make.cross ARCH=sparc64 
> 

Ok, this should take care of all the build errors.  Needed to address
!CONFIG_HUGETLB_PAGE and !CONFIG_ARCH_WANT_HUGE_PMD_SHARE.

From 2868cea3ff33fccd03da9188dde2ae5b619bf300 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 21 Aug 2018 10:50:17 -0700
Subject: [PATCH v5 1/2] mm: migration: fix migration of huge PMD shared pages

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
no matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined due to ECC memory errors.
DB developers noticed they could reproduce the issue by (hotplug)
offlining memory used to back huge pages.  A simple testcase can
reproduce the problem by creating a shared PMD mapping (note that
this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on
x86)), and using migrate_pages() to migrate process pages between
nodes while continually writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops the reference on the PMD page.  After this, flush
caches and TLB.

mmu notifiers are called before locking page tables, but we can not
be sure of PMD sharing until page tables are locked.  Therefore,
check for the possibility of PMD sharing before locking so that
notifiers can prepare for the worst possible case.

Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Cc: stable@vger.kernel.org
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/hugetlb.h | 14 +++++++++++++
 mm/hugetlb.c            | 46 +++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               | 42 ++++++++++++++++++++++++++++++++++---
 3 files changed, 99 insertions(+), 3 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..840b43a085cc 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@ static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+static inline bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..f085019a4724 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
@@ -4659,6 +4699,12 @@ int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
+
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
 #define want_pmd_share()	(0)
 #endif /* CONFIG_ARCH_WANT_HUGE_PMD_SHARE */
 
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&
Michal Hocko Aug. 22, 2018, 12:28 p.m. UTC | #5
On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
[...]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index eb477809a5c0..8cf853a4b093 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  	}
>  
>  	/*
> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> -	 * the page can not be free in this function as call of try_to_unmap()
> -	 * must hold a reference on the page.
> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> +	 * For hugetlb, it could be much worse if we need to do pud
> +	 * invalidation in the case of pmd sharing.
> +	 *
> +	 * Note that the page can not be free in this function as call of
> +	 * try_to_unmap() must hold a reference on the page.
>  	 */
>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> +	if (PageHuge(page)) {
> +		/*
> +		 * If sharing is possible, start and end will be adjusted
> +		 * accordingly.
> +		 */
> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> +	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

I do not get this part. Why don't we simply unconditionally invalidate
the whole huge page range?

>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>  		address = pvmw.address;
>  
> +		if (PageHuge(page)) {
> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {

huge_pmd_unshare is documented to require a pte lock. Where do we take
it?
Mike Kravetz Aug. 22, 2018, 4:48 p.m. UTC | #6
On 08/22/2018 05:28 AM, Michal Hocko wrote:
> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> [...]
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index eb477809a5c0..8cf853a4b093 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  	}
>>  
>>  	/*
>> -	 * We have to assume the worse case ie pmd for invalidation. Note that
>> -	 * the page can not be free in this function as call of try_to_unmap()
>> -	 * must hold a reference on the page.
>> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
>> +	 * For hugetlb, it could be much worse if we need to do pud
>> +	 * invalidation in the case of pmd sharing.
>> +	 *
>> +	 * Note that the page can not be free in this function as call of
>> +	 * try_to_unmap() must hold a reference on the page.
>>  	 */
>>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
>> +	if (PageHuge(page)) {
>> +		/*
>> +		 * If sharing is possible, start and end will be adjusted
>> +		 * accordingly.
>> +		 */
>> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
>> +	}
>>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> I do not get this part. Why don't we simply unconditionally invalidate
> the whole huge page range?

In this routine, we are only unmapping a single page.  The existing code
is limiting the invalidate range to that page size: 4K or 2M.  With shared
PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
think we want to unconditionally invalidate 1G.  Is that what you are asking?

I do not know how often PMD sharing is exercised.  It certainly is used by
DBs for large shared areas.  I suspect it is less frequent than hugtlb pages
in general, and certainly less frequent than THP or base pages.

>>  
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
>>  		address = pvmw.address;
>>  
>> +		if (PageHuge(page)) {
>> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> 
> huge_pmd_unshare is documented to require a pte lock. Where do we take
> it?

It is somewhat hidden, but we are in the loop:

	while (page_vma_mapped_walk(&pvmw)) {

The routine page_vma_mapped_walk will acquire the lock, and it correctly
checks for huge pages and uses huge_pte_lockptr().

page_vma_mapped_walk_done() will release the lock.
Kirill A. Shutemov Aug. 22, 2018, 9:05 p.m. UTC | #7
On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3103099f64fd..f085019a4724 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  
>  	/*
>  	 * check on proper vm_flags and page table alignment
> +	 *
> +	 * Note that this is the same check used in huge_pmd_sharing_possible.
> +	 * If you change one, consider changing both.

Should we have helper to isolate the check in one place?

>  	 */
>  	if (vma->vm_flags & VM_MAYSHARE &&
>  	    vma->vm_start <= base && end <= vma->vm_end)
> @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>  	return false;
>  }
>  
> +/*
> + * Determine if start,end range within vma could be mapped by shared pmd.
> + * If yes, adjust start and end to cover range associated with possible
> + * shared pmd mappings.
> + */
> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +	bool ret = false;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return ret;

Do we ever use return value? I don't see it.

And in this case function name is not really work...

> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;
> +
> +		/*
> +		 * If sharing is possible, adjust start/end if necessary.
> +		 *
> +		 * Note that this is the same check used in vma_shareable.  If
> +		 * you change one, consider changing both.
> +		 */
> +		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
> +			if (a_start < *start)
> +				*start = a_start;
> +			if (a_end > *end)
> +				*end = a_end;
> +
> +			ret = true;
> +		}
> +	}
> +
> +	return ret;
> +}
> +
Mike Kravetz Aug. 22, 2018, 9:48 p.m. UTC | #8
On 08/22/2018 02:05 PM, Kirill A. Shutemov wrote:
> On Tue, Aug 21, 2018 at 06:10:42PM -0700, Mike Kravetz wrote:
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 3103099f64fd..f085019a4724 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4555,6 +4555,9 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>>  
>>  	/*
>>  	 * check on proper vm_flags and page table alignment
>> +	 *
>> +	 * Note that this is the same check used in huge_pmd_sharing_possible.
>> +	 * If you change one, consider changing both.
> 
> Should we have helper to isolate the check in one place?
> 

Yes, I will create one.  Most likely just a #define.

>>  	 */
>>  	if (vma->vm_flags & VM_MAYSHARE &&
>>  	    vma->vm_start <= base && end <= vma->vm_end)
>> @@ -4562,6 +4565,43 @@ static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
>>  	return false;
>>  }
>>  
>> +/*
>> + * Determine if start,end range within vma could be mapped by shared pmd.
>> + * If yes, adjust start and end to cover range associated with possible
>> + * shared pmd mappings.
>> + */
>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>> +				unsigned long *start, unsigned long *end)
>> +{
>> +	unsigned long check_addr = *start;
>> +	bool ret = false;
>> +
>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>> +		return ret;
> 
> Do we ever use return value? I don't see it.
> 
> And in this case function name is not really work...

You are correct.  None of the code uses the return value.  I initially
thought some caller would use it.  But every caller wants/needs to
adjust the range if sharing is possible.  This is a really long name
but how about:

void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
				unsigned long *start, unsigned long *end)

I'm open to other names and will update patch with suggestions.
Michal Hocko Aug. 23, 2018, 7:30 a.m. UTC | #9
On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > [...]
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index eb477809a5c0..8cf853a4b093 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  	}
> >>  
> >>  	/*
> >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> >> -	 * the page can not be free in this function as call of try_to_unmap()
> >> -	 * must hold a reference on the page.
> >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> >> +	 * For hugetlb, it could be much worse if we need to do pud
> >> +	 * invalidation in the case of pmd sharing.
> >> +	 *
> >> +	 * Note that the page can not be free in this function as call of
> >> +	 * try_to_unmap() must hold a reference on the page.
> >>  	 */
> >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> >> +	if (PageHuge(page)) {
> >> +		/*
> >> +		 * If sharing is possible, start and end will be adjusted
> >> +		 * accordingly.
> >> +		 */
> >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> >> +	}
> >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > 
> > I do not get this part. Why don't we simply unconditionally invalidate
> > the whole huge page range?
> 
> In this routine, we are only unmapping a single page.  The existing code
> is limiting the invalidate range to that page size: 4K or 2M.  With shared
> PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> think we want to unconditionally invalidate 1G.  Is that what you are asking?

But we know that huge_pmd_unshare unmapped a shared pte so we know when
to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
a) duplicates some checks and b) it updates start/stop out of line.

> I do not know how often PMD sharing is exercised.  It certainly is used by
> DBs for large shared areas.  I suspect it is less frequent than hugtlb pages
> in general, and certainly less frequent than THP or base pages.
> 
> >>  
> >>  	while (page_vma_mapped_walk(&pvmw)) {
> >> @@ -1409,6 +1419,32 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> >>  		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> >>  		address = pvmw.address;
> >>  
> >> +		if (PageHuge(page)) {
> >> +			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
> > 
> > huge_pmd_unshare is documented to require a pte lock. Where do we take
> > it?
> 
> It is somewhat hidden, but we are in the loop:
> 
> 	while (page_vma_mapped_walk(&pvmw)) {
> 
> The routine page_vma_mapped_walk will acquire the lock, and it correctly
> checks for huge pages and uses huge_pte_lockptr().
> 
> page_vma_mapped_walk_done() will release the lock.

OK, I can see it now. Thanks for the clarification. page_vma_mapped_walk
is quite hard to follow.
Kirill A. Shutemov Aug. 23, 2018, 8:21 a.m. UTC | #10
On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
> On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> > On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > > [...]
> > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > >> index eb477809a5c0..8cf853a4b093 100644
> > >> --- a/mm/rmap.c
> > >> +++ b/mm/rmap.c
> > >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > >>  	}
> > >>  
> > >>  	/*
> > >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> > >> -	 * the page can not be free in this function as call of try_to_unmap()
> > >> -	 * must hold a reference on the page.
> > >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> > >> +	 * For hugetlb, it could be much worse if we need to do pud
> > >> +	 * invalidation in the case of pmd sharing.
> > >> +	 *
> > >> +	 * Note that the page can not be free in this function as call of
> > >> +	 * try_to_unmap() must hold a reference on the page.
> > >>  	 */
> > >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> > >> +	if (PageHuge(page)) {
> > >> +		/*
> > >> +		 * If sharing is possible, start and end will be adjusted
> > >> +		 * accordingly.
> > >> +		 */
> > >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> > >> +	}
> > >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > > 
> > > I do not get this part. Why don't we simply unconditionally invalidate
> > > the whole huge page range?
> > 
> > In this routine, we are only unmapping a single page.  The existing code
> > is limiting the invalidate range to that page size: 4K or 2M.  With shared
> > PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> > think we want to unconditionally invalidate 1G.  Is that what you are asking?
> 
> But we know that huge_pmd_unshare unmapped a shared pte so we know when
> to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
> a) duplicates some checks and b) it updates start/stop out of line.

My reading on this is that mmu_notifier_invalidate_range_start() has to be
called from sleepable context on the full range that *can* be invalidated
before following mmu_notifier_invalidate_range_end().

In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
page that effectively enlarge range that has to be covered by
mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
shared page tables in the range, so we need to go with worst case
scenario.

I don't see conceptually better solution than what is proposed.
Michal Hocko Aug. 23, 2018, 10:33 a.m. UTC | #11
On Thu 23-08-18 11:21:12, Kirill A. Shutemov wrote:
> On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
> > On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
> > > On 08/22/2018 05:28 AM, Michal Hocko wrote:
> > > > On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> > > > [...]
> > > >> diff --git a/mm/rmap.c b/mm/rmap.c
> > > >> index eb477809a5c0..8cf853a4b093 100644
> > > >> --- a/mm/rmap.c
> > > >> +++ b/mm/rmap.c
> > > >> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> > > >>  	}
> > > >>  
> > > >>  	/*
> > > >> -	 * We have to assume the worse case ie pmd for invalidation. Note that
> > > >> -	 * the page can not be free in this function as call of try_to_unmap()
> > > >> -	 * must hold a reference on the page.
> > > >> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
> > > >> +	 * For hugetlb, it could be much worse if we need to do pud
> > > >> +	 * invalidation in the case of pmd sharing.
> > > >> +	 *
> > > >> +	 * Note that the page can not be free in this function as call of
> > > >> +	 * try_to_unmap() must hold a reference on the page.
> > > >>  	 */
> > > >>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
> > > >> +	if (PageHuge(page)) {
> > > >> +		/*
> > > >> +		 * If sharing is possible, start and end will be adjusted
> > > >> +		 * accordingly.
> > > >> +		 */
> > > >> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
> > > >> +	}
> > > >>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> > > > 
> > > > I do not get this part. Why don't we simply unconditionally invalidate
> > > > the whole huge page range?
> > > 
> > > In this routine, we are only unmapping a single page.  The existing code
> > > is limiting the invalidate range to that page size: 4K or 2M.  With shared
> > > PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
> > > think we want to unconditionally invalidate 1G.  Is that what you are asking?
> > 
> > But we know that huge_pmd_unshare unmapped a shared pte so we know when
> > to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
> > a) duplicates some checks and b) it updates start/stop out of line.
> 
> My reading on this is that mmu_notifier_invalidate_range_start() has to be
> called from sleepable context on the full range that *can* be invalidated
> before following mmu_notifier_invalidate_range_end().
> 
> In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
> page that effectively enlarge range that has to be covered by
> mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
> shared page tables in the range, so we need to go with worst case
> scenario.
> 
> I don't see conceptually better solution than what is proposed.

I was thinking we would just pull PageHuge outside of the
page_vma_mapped_walk. I thought it would look much more straightforward
but I've tried to put something together and it grown into an ugly code
as well. So going the Mike's way might be a better option after all.
Michal Hocko Aug. 23, 2018, 12:48 p.m. UTC | #12
On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
[...]

OK, after burning myself when trying to be clever here it seems like
your proposed solution is indeed simpler.

> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> +				unsigned long *start, unsigned long *end)
> +{
> +	unsigned long check_addr = *start;
> +	bool ret = false;
> +
> +	if (!(vma->vm_flags & VM_MAYSHARE))
> +		return ret;
> +
> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> +		unsigned long a_start = check_addr & PUD_MASK;
> +		unsigned long a_end = a_start + PUD_SIZE;

I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
huge_pmd_unshare does.
Mike Kravetz Aug. 23, 2018, 4:45 p.m. UTC | #13
On 08/23/2018 01:21 AM, Kirill A. Shutemov wrote:
> On Thu, Aug 23, 2018 at 09:30:35AM +0200, Michal Hocko wrote:
>> On Wed 22-08-18 09:48:16, Mike Kravetz wrote:
>>> On 08/22/2018 05:28 AM, Michal Hocko wrote:
>>>> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
>>>> [...]
>>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>>> index eb477809a5c0..8cf853a4b093 100644
>>>>> --- a/mm/rmap.c
>>>>> +++ b/mm/rmap.c
>>>>> @@ -1362,11 +1362,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>>>>>  	}
>>>>>  
>>>>>  	/*
>>>>> -	 * We have to assume the worse case ie pmd for invalidation. Note that
>>>>> -	 * the page can not be free in this function as call of try_to_unmap()
>>>>> -	 * must hold a reference on the page.
>>>>> +	 * For THP, we have to assume the worse case ie pmd for invalidation.
>>>>> +	 * For hugetlb, it could be much worse if we need to do pud
>>>>> +	 * invalidation in the case of pmd sharing.
>>>>> +	 *
>>>>> +	 * Note that the page can not be free in this function as call of
>>>>> +	 * try_to_unmap() must hold a reference on the page.
>>>>>  	 */
>>>>>  	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
>>>>> +	if (PageHuge(page)) {
>>>>> +		/*
>>>>> +		 * If sharing is possible, start and end will be adjusted
>>>>> +		 * accordingly.
>>>>> +		 */
>>>>> +		(void)huge_pmd_sharing_possible(vma, &start, &end);
>>>>> +	}
>>>>>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
>>>>
>>>> I do not get this part. Why don't we simply unconditionally invalidate
>>>> the whole huge page range?
>>>
>>> In this routine, we are only unmapping a single page.  The existing code
>>> is limiting the invalidate range to that page size: 4K or 2M.  With shared
>>> PMDs, we have the possibility of unmapping a PUD_SIZE area: 1G.  I don't
>>> think we want to unconditionally invalidate 1G.  Is that what you are asking?
>>
>> But we know that huge_pmd_unshare unmapped a shared pte so we know when
>> to flush 2MB or 1GB. I really do not like how huge_pmd_sharing_possible
>> a) duplicates some checks and b) it updates start/stop out of line.
> 
> My reading on this is that mmu_notifier_invalidate_range_start() has to be
> called from sleepable context on the full range that *can* be invalidated
> before following mmu_notifier_invalidate_range_end().
> 
> In this case huge_pmd_unshare() may unmap aligned PUD_SIZE around the PMD
> page that effectively enlarge range that has to be covered by
> mmu_notifier_invalidate_range_start(). We cannot yet know if there's any
> shared page tables in the range, so we need to go with worst case
> scenario.

Yes, that is a good summary.  We can not know for sure if there is PMD
sharing until we hold the page table lock.  So, we don't know if we should
invalidate/flush 2M or 1G.  Yet, we need to call
mmu_notifier_invalidate_range_start before taking the lock.  And, the notifiers
need to know the range of the worst possible case.  The best approach I came up
with is to adjust the range if sharing is 'possible'.

> 
> I don't see conceptually better solution than what is proposed.
> 

I have updated the patches based on Kirill's previous comments and will send out
a new version later today.
Mike Kravetz Aug. 23, 2018, 5:01 p.m. UTC | #14
On 08/23/2018 05:48 AM, Michal Hocko wrote:
> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> [...]
> 
> OK, after burning myself when trying to be clever here it seems like
> your proposed solution is indeed simpler.
> 
>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>> +				unsigned long *start, unsigned long *end)
>> +{
>> +	unsigned long check_addr = *start;
>> +	bool ret = false;
>> +
>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>> +		return ret;
>> +
>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
>> +		unsigned long a_start = check_addr & PUD_MASK;
>> +		unsigned long a_end = a_start + PUD_SIZE;
> 
> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
> huge_pmd_unshare does.

Sure, I can do that.

However, I consider the statement making that calculation in huge_pmd_unshare
to be VERY ugly and confusing code.

	*addr = ALIGN(*addr, HPAGE_SIZE * PTRS_PER_PTE) - HPAGE_SIZE;

Note that it is adjusting the value of passed argument 'unsigned long *addr'.
This argument/value is part of a loop condition in all current callers of
huge_pmd_unshare.  For instance:

	for (; address < end; address += huge_page_size(h)) {

So, that calculation in huge_pmd_unshare gets the calling loop back to
the starting address of the unmapped range.  It even takes the loop increment
'huge_page_size(h)' into account.  That is why that ' - HPAGE_SIZE' is at
the end of the calculation.

ugly and confusing!  And on my list of things to clean up.
Mike Kravetz Aug. 23, 2018, 5:56 p.m. UTC | #15
On 08/23/2018 10:01 AM, Mike Kravetz wrote:
> On 08/23/2018 05:48 AM, Michal Hocko wrote:
>> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
>> [...]
>>
>> OK, after burning myself when trying to be clever here it seems like
>> your proposed solution is indeed simpler.
>>
>>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
>>> +				unsigned long *start, unsigned long *end)
>>> +{
>>> +	unsigned long check_addr = *start;
>>> +	bool ret = false;
>>> +
>>> +	if (!(vma->vm_flags & VM_MAYSHARE))
>>> +		return ret;
>>> +
>>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
>>> +		unsigned long a_start = check_addr & PUD_MASK;
>>> +		unsigned long a_end = a_start + PUD_SIZE;
>>
>> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
>> huge_pmd_unshare does.
> 
> Sure, I can do that.

On second thought, this is more similar to vma_shareable() which uses
PUD_MASK and PUD_SIZE.  In fact Kirill asked me to put in a common helper
for this and vma_shareable.  So, I would prefer to leave it as PUD* unless
you feel strongly.

IMO, it would make more sense to change this in huge_pmd_unshare as PMD
sharing is pretty explicitly tied to PUD_SIZE.  But, that is a future cleanup.
Michal Hocko Aug. 23, 2018, 7:36 p.m. UTC | #16
On Thu 23-08-18 10:56:30, Mike Kravetz wrote:
> On 08/23/2018 10:01 AM, Mike Kravetz wrote:
> > On 08/23/2018 05:48 AM, Michal Hocko wrote:
> >> On Tue 21-08-18 18:10:42, Mike Kravetz wrote:
> >> [...]
> >>
> >> OK, after burning myself when trying to be clever here it seems like
> >> your proposed solution is indeed simpler.
> >>
> >>> +bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
> >>> +				unsigned long *start, unsigned long *end)
> >>> +{
> >>> +	unsigned long check_addr = *start;
> >>> +	bool ret = false;
> >>> +
> >>> +	if (!(vma->vm_flags & VM_MAYSHARE))
> >>> +		return ret;
> >>> +
> >>> +	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
> >>> +		unsigned long a_start = check_addr & PUD_MASK;
> >>> +		unsigned long a_end = a_start + PUD_SIZE;
> >>
> >> I guess this should be rather in HPAGE_SIZE * PTRS_PER_PTE units as
> >> huge_pmd_unshare does.
> > 
> > Sure, I can do that.
> 
> On second thought, this is more similar to vma_shareable() which uses
> PUD_MASK and PUD_SIZE.  In fact Kirill asked me to put in a common helper
> for this and vma_shareable.  So, I would prefer to leave it as PUD* unless
> you feel strongly.

I don't
 
> IMO, it would make more sense to change this in huge_pmd_unshare as PMD
> sharing is pretty explicitly tied to PUD_SIZE.  But, that is a future cleanup.

Fair enough. I didn't realize we are PUD explicit elsewhere. So if you
plan to update huge_pmd_unshare to be in sync then no objections from me
at all. I merely wanted to be in sync with this because it is a central
point to look at wrt pmd sharing.
diff mbox series

Patch

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 36fa6a2a82e3..1c6cde68487f 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -140,6 +140,8 @@  pte_t *huge_pte_alloc(struct mm_struct *mm,
 pte_t *huge_pte_offset(struct mm_struct *mm,
 		       unsigned long addr, unsigned long sz);
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep);
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end);
 struct page *follow_huge_addr(struct mm_struct *mm, unsigned long address,
 			      int write);
 struct page *follow_huge_pd(struct vm_area_struct *vma,
@@ -170,6 +172,18 @@  static inline unsigned long hugetlb_total_pages(void)
 	return 0;
 }
 
+static inline int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr,
+					pte_t *ptep)
+{
+	return 0;
+}
+
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	return false;
+}
+
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w,n)	({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)	ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma)	({ BUG(); 0; })
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3103099f64fd..fd155dc52117 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4555,6 +4555,9 @@  static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 
 	/*
 	 * check on proper vm_flags and page table alignment
+	 *
+	 * Note that this is the same check used in huge_pmd_sharing_possible.
+	 * If you change one, consider changing both.
 	 */
 	if (vma->vm_flags & VM_MAYSHARE &&
 	    vma->vm_start <= base && end <= vma->vm_end)
@@ -4562,6 +4565,43 @@  static bool vma_shareable(struct vm_area_struct *vma, unsigned long addr)
 	return false;
 }
 
+/*
+ * Determine if start,end range within vma could be mapped by shared pmd.
+ * If yes, adjust start and end to cover range associated with possible
+ * shared pmd mappings.
+ */
+bool huge_pmd_sharing_possible(struct vm_area_struct *vma,
+				unsigned long *start, unsigned long *end)
+{
+	unsigned long check_addr = *start;
+	bool ret = false;
+
+	if (!(vma->vm_flags & VM_MAYSHARE))
+		return ret;
+
+	for (check_addr = *start; check_addr < *end; check_addr += PUD_SIZE) {
+		unsigned long a_start = check_addr & PUD_MASK;
+		unsigned long a_end = a_start + PUD_SIZE;
+
+		/*
+		 * If sharing is possible, adjust start/end if necessary.
+		 *
+		 * Note that this is the same check used in vma_shareable.  If
+		 * you change one, consider changing both.
+		 */
+		if (vma->vm_start <= a_start && a_end <= vma->vm_end) {
+			if (a_start < *start)
+				*start = a_start;
+			if (a_end > *end)
+				*end = a_end;
+
+			ret = true;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Search for a shareable pmd page for hugetlb. In any case calls pmd_alloc()
  * and returns the corresponding pte. While this is not necessary for the
diff --git a/mm/rmap.c b/mm/rmap.c
index eb477809a5c0..8cf853a4b093 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1362,11 +1362,21 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 	}
 
 	/*
-	 * We have to assume the worse case ie pmd for invalidation. Note that
-	 * the page can not be free in this function as call of try_to_unmap()
-	 * must hold a reference on the page.
+	 * For THP, we have to assume the worse case ie pmd for invalidation.
+	 * For hugetlb, it could be much worse if we need to do pud
+	 * invalidation in the case of pmd sharing.
+	 *
+	 * Note that the page can not be free in this function as call of
+	 * try_to_unmap() must hold a reference on the page.
 	 */
 	end = min(vma->vm_end, start + (PAGE_SIZE << compound_order(page)));
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		(void)huge_pmd_sharing_possible(vma, &start, &end);
+	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
 
 	while (page_vma_mapped_walk(&pvmw)) {
@@ -1409,6 +1419,32 @@  static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
 		address = pvmw.address;
 
+		if (PageHuge(page)) {
+			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
+				/*
+				 * huge_pmd_unshare unmapped an entire PMD
+				 * page.  There is no way of knowing exactly
+				 * which PMDs may be cached for this mm, so
+				 * we must flush them all.  start/end were
+				 * already adjusted above to cover this range.
+				 */
+				flush_cache_range(vma, start, end);
+				flush_tlb_range(vma, start, end);
+				mmu_notifier_invalidate_range(mm, start, end);
+
+				/*
+				 * The ref count of the PMD page was dropped
+				 * which is part of the way map counting
+				 * is done for shared PMDs.  Return 'true'
+				 * here.  When there is no other sharing,
+				 * huge_pmd_unshare returns false and we will
+				 * unmap the actual page and drop map count
+				 * to zero.
+				 */
+				page_vma_mapped_walk_done(&pvmw);
+				break;
+			}
+		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
 		    (flags & TTU_MIGRATION) &&