mm: migration: fix migration of huge PMD shared pages

Message ID	20180813034108.27269-1-mike.kravetz@oracle.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of mike.kravetz@oracle.com designates 156.151.31.86 as permitted sender) client-ip=156.151.31.86; From: Mike Kravetz <mike.kravetz@oracle.com> To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>, =?utf-8?b?SsOp?= =?utf-8?b?csO0bWUgR2xpc3Nl?= <jglisse@redhat.com>, Vlastimil Babka <vbabka@suse.cz>, Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>, Davidlohr Bueso <dave@stgolabs.net>, Michal Hocko <mhocko@kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Mike Kravetz <mike.kravetz@oracle.com> Subject: [PATCH] mm: migration: fix migration of huge PMD shared pages Date: Sun, 12 Aug 2018 20:41:08 -0700 Message-Id: <20180813034108.27269-1-mike.kravetz@oracle.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: migration: fix migration of huge PMD shared pages \| expand mm: migration: fix migration of huge PMD shared pages

Mike Kravetz Aug. 13, 2018, 3:41 a.m. UTC

The page migration code employs try_to_unmap() to try and unmap the
source page.  This is accomplished by using rmap_walk to find all
vmas where the page is mapped.  This search stops when page mapcount
is zero.  For shared PMD huge pages, the page map count is always 1
not matter the number of mappings.  Shared mappings are tracked via
the reference count of the PMD page.  Therefore, try_to_unmap stops
prematurely and does not completely unmap all mappings of the source
page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the
target page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global
areas after a huge page was soft offlined.  DB developers noticed
they could reproduce the issue by (hotplug) offlining memory used
to back huge pages.  A simple testcase can reproduce the problem by
creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes.

To fix, have the try_to_unmap_one routine check for huge PMD sharing
by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
shared mapping it will be 'unshared' which removes the page table
entry and drops reference on PMD page.  After this, flush caches and
TLB.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
I am not %100 sure on the required flushing, so suggestions would be
appreciated.  This also should go to stable.  It has been around for
a long time so still looking for an appropriate 'fixes:'.

 mm/rmap.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

kernel test robot Aug. 13, 2018, 4:10 a.m. UTC | #1

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-randconfig-x003-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean '__NR_unshare'? [-Werror=implicit-function-declaration]
          huge_pmd_unshare(mm, &address, pvmw.pte)) {
          ^~~~~~~~~~~~~~~~
          __NR_unshare
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382	
  1383			/*
  1384			 * If the page is mlock()d, we cannot swap it out.
  1385			 * If it's recently referenced (perhaps page_referenced
  1386			 * skipped over this mm) then we should reactivate it.
  1387			 */
  1388			if (!(flags & TTU_IGNORE_MLOCK)) {
  1389				if (vma->vm_flags & VM_LOCKED) {
  1390					/* PTE-mapped THP are never mlocked */
  1391					if (!PageTransCompound(page)) {
  1392						/*
  1393						 * Holding pte lock, we do *not* need
  1394						 * mmap_sem here
  1395						 */
  1396						mlock_vma_page(page);
  1397					}
  1398					ret = false;
  1399					page_vma_mapped_walk_done(&pvmw);
  1400					break;
  1401				}
  1402				if (flags & TTU_MUNLOCK)
  1403					continue;
  1404			}
  1405	
  1406			/* Unexpected PMD-mapped THP? */
  1407			VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408	
  1409			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410			address = pvmw.address;
  1411	
  1412			/*
  1413			 * PMDs for hugetlbfs pages could be shared.  In this case,
  1414			 * pages with shared PMDs will have a mapcount of 1 no matter
  1415			 * how many times it is actually mapped.  Map counting for
  1416			 * PMD sharing is mostly done via the reference count on the
  1417			 * PMD page itself.  If the page we are trying to unmap is a
  1418			 * hugetlbfs page, attempt to 'unshare' at the PMD level.
  1419			 * huge_pmd_unshare takes care of clearing the PUD and
  1420			 * reference counting on the PMD page which effectively unmaps
  1421			 * the page.  Take care of flushing cache and TLB for page in
  1422			 * this specific mapping here.
  1423			 */
  1424			if (PageHuge(page) &&
> 1425			    huge_pmd_unshare(mm, &address, pvmw.pte)) {
  1426				unsigned long end_add = address + vma_mmu_pagesize(vma);
  1427	
  1428				flush_cache_range(vma, address, end_add);
  1429				flush_tlb_range(vma, address, end_add);
  1430				mmu_notifier_invalidate_range(mm, address, end_add);
  1431				continue;
  1432			}
  1433	
  1434			if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435			    (flags & TTU_MIGRATION) &&
  1436			    is_zone_device_page(page)) {
  1437				swp_entry_t entry;
  1438				pte_t swp_pte;
  1439	
  1440				pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
  1441	
  1442				/*
  1443				 * Store the pfn of the page in a special migration
  1444				 * pte. do_swap_page() will wait until the migration
  1445				 * pte is removed and then restart fault handling.
  1446				 */
  1447				entry = make_migration_entry(page, 0);
  1448				swp_pte = swp_entry_to_pte(entry);
  1449				if (pte_soft_dirty(pteval))
  1450					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451				set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452				/*
  1453				 * No need to invalidate here it will synchronize on
  1454				 * against the special swap migration pte.
  1455				 */
  1456				goto discard;
  1457			}
  1458	
  1459			if (!(flags & TTU_IGNORE_ACCESS)) {
  1460				if (ptep_clear_flush_young_notify(vma, address,
  1461							pvmw.pte)) {
  1462					ret = false;
  1463					page_vma_mapped_walk_done(&pvmw);
  1464					break;
  1465				}
  1466			}
  1467	
  1468			/* Nuke the page table entry. */
  1469			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
  1470			if (should_defer_flush(mm, flags)) {
  1471				/*
  1472				 * We clear the PTE but do not flush so potentially
  1473				 * a remote CPU could still be writing to the page.
  1474				 * If the entry was previously clean then the
  1475				 * architecture must guarantee that a clear->dirty
  1476				 * transition on a cached TLB entry is written through
  1477				 * and traps if the PTE is unmapped.
  1478				 */
  1479				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  1480	
  1481				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
  1482			} else {
  1483				pteval = ptep_clear_flush(vma, address, pvmw.pte);
  1484			}
  1485	
  1486			/* Move the dirty bit to the page. Now the pte is gone. */
  1487			if (pte_dirty(pteval))
  1488				set_page_dirty(page);
  1489	
  1490			/* Update high watermark before we lower rss */
  1491			update_hiwater_rss(mm);
  1492	
  1493			if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
  1494				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  1495				if (PageHuge(page)) {
  1496					int nr = 1 << compound_order(page);
  1497					hugetlb_count_sub(nr, mm);
  1498					set_huge_swap_pte_at(mm, address,
  1499							     pvmw.pte, pteval,
  1500							     vma_mmu_pagesize(vma));
  1501				} else {
  1502					dec_mm_counter(mm, mm_counter(page));
  1503					set_pte_at(mm, address, pvmw.pte, pteval);
  1504				}
  1505	
  1506			} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
  1507				/*
  1508				 * The guest indicated that the page content is of no
  1509				 * interest anymore. Simply discard the pte, vmscan
  1510				 * will take care of the rest.
  1511				 * A future reference will then fault in a new zero
  1512				 * page. When userfaultfd is active, we must not drop
  1513				 * this page though, as its main user (postcopy
  1514				 * migration) will not expect userfaults on already
  1515				 * copied pages.
  1516				 */
  1517				dec_mm_counter(mm, mm_counter(page));
  1518				/* We have to invalidate as we cleared the pte */
  1519				mmu_notifier_invalidate_range(mm, address,
  1520							      address + PAGE_SIZE);
  1521			} else if (IS_ENABLED(CONFIG_MIGRATION) &&
  1522					(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
  1523				swp_entry_t entry;
  1524				pte_t swp_pte;
  1525	
  1526				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1527					set_pte_at(mm, address, pvmw.pte, pteval);
  1528					ret = false;
  1529					page_vma_mapped_walk_done(&pvmw);
  1530					break;
  1531				}
  1532	
  1533				/*
  1534				 * Store the pfn of the page in a special migration
  1535				 * pte. do_swap_page() will wait until the migration
  1536				 * pte is removed and then restart fault handling.
  1537				 */
  1538				entry = make_migration_entry(subpage,
  1539						pte_write(pteval));
  1540				swp_pte = swp_entry_to_pte(entry);
  1541				if (pte_soft_dirty(pteval))
  1542					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1543				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1544				/*
  1545				 * No need to invalidate here it will synchronize on
  1546				 * against the special swap migration pte.
  1547				 */
  1548			} else if (PageAnon(page)) {
  1549				swp_entry_t entry = { .val = page_private(subpage) };
  1550				pte_t swp_pte;
  1551				/*
  1552				 * Store the swap location in the pte.
  1553				 * See handle_pte_fault() ...
  1554				 */
  1555				if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
  1556					WARN_ON_ONCE(1);
  1557					ret = false;
  1558					/* We have to invalidate as we cleared the pte */
  1559					mmu_notifier_invalidate_range(mm, address,
  1560								address + PAGE_SIZE);
  1561					page_vma_mapped_walk_done(&pvmw);
  1562					break;
  1563				}
  1564	
  1565				/* MADV_FREE page check */
  1566				if (!PageSwapBacked(page)) {
  1567					if (!PageDirty(page)) {
  1568						/* Invalidate as we cleared the pte */
  1569						mmu_notifier_invalidate_range(mm,
  1570							address, address + PAGE_SIZE);
  1571						dec_mm_counter(mm, MM_ANONPAGES);
  1572						goto discard;
  1573					}
  1574	
  1575					/*
  1576					 * If the page was redirtied, it cannot be
  1577					 * discarded. Remap the page to page table.
  1578					 */
  1579					set_pte_at(mm, address, pvmw.pte, pteval);
  1580					SetPageSwapBacked(page);
  1581					ret = false;
  1582					page_vma_mapped_walk_done(&pvmw);
  1583					break;
  1584				}
  1585	
  1586				if (swap_duplicate(entry) < 0) {
  1587					set_pte_at(mm, address, pvmw.pte, pteval);
  1588					ret = false;
  1589					page_vma_mapped_walk_done(&pvmw);
  1590					break;
  1591				}
  1592				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1593					set_pte_at(mm, address, pvmw.pte, pteval);
  1594					ret = false;
  1595					page_vma_mapped_walk_done(&pvmw);
  1596					break;
  1597				}
  1598				if (list_empty(&mm->mmlist)) {
  1599					spin_lock(&mmlist_lock);
  1600					if (list_empty(&mm->mmlist))
  1601						list_add(&mm->mmlist, &init_mm.mmlist);
  1602					spin_unlock(&mmlist_lock);
  1603				}
  1604				dec_mm_counter(mm, MM_ANONPAGES);
  1605				inc_mm_counter(mm, MM_SWAPENTS);
  1606				swp_pte = swp_entry_to_pte(entry);
  1607				if (pte_soft_dirty(pteval))
  1608					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1609				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1610				/* Invalidate as we cleared the pte */
  1611				mmu_notifier_invalidate_range(mm, address,
  1612							      address + PAGE_SIZE);
  1613			} else {
  1614				/*
  1615				 * We should not need to notify here as we reach this
  1616				 * case only from freeze_page() itself only call from
  1617				 * split_huge_page_to_list() so everything below must
  1618				 * be true:
  1619				 *   - page is not anonymous
  1620				 *   - page is locked
  1621				 *
  1622				 * So as it is a locked file back page thus it can not
  1623				 * be remove from the page cache and replace by a new
  1624				 * page before mmu_notifier_invalidate_range_end so no
  1625				 * concurrent thread might update its page table to
  1626				 * point at new page while a device still is using this
  1627				 * page.
  1628				 *
  1629				 * See Documentation/vm/mmu_notifier.rst
  1630				 */
  1631				dec_mm_counter(mm, mm_counter_file(page));
  1632			}
  1633	discard:
  1634			/*
  1635			 * No need to call mmu_notifier_invalidate_range() it has be
  1636			 * done above for all cases requiring it to happen under page
  1637			 * table lock before mmu_notifier_invalidate_range_end()
  1638			 *
  1639			 * See Documentation/vm/mmu_notifier.rst
  1640			 */
  1641			page_remove_rmap(subpage, PageHuge(page));
  1642			put_page(page);
  1643		}
  1644	
  1645		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
  1646	
  1647		return ret;
  1648	}
  1649	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

kernel test robot Aug. 13, 2018, 4:17 a.m. UTC | #2

Hi Mike,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.18 next-20180810]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_unmap_one':
>> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean 'do_huge_pmd_wp_page'? [-Werror=implicit-function-declaration]
          huge_pmd_unshare(mm, &address, pvmw.pte)) {
          ^~~~~~~~~~~~~~~~
          do_huge_pmd_wp_page
   cc1: some warnings being treated as errors

vim +1425 mm/rmap.c

  1382	
  1383			/*
  1384			 * If the page is mlock()d, we cannot swap it out.
  1385			 * If it's recently referenced (perhaps page_referenced
  1386			 * skipped over this mm) then we should reactivate it.
  1387			 */
  1388			if (!(flags & TTU_IGNORE_MLOCK)) {
  1389				if (vma->vm_flags & VM_LOCKED) {
  1390					/* PTE-mapped THP are never mlocked */
  1391					if (!PageTransCompound(page)) {
  1392						/*
  1393						 * Holding pte lock, we do *not* need
  1394						 * mmap_sem here
  1395						 */
  1396						mlock_vma_page(page);
  1397					}
  1398					ret = false;
  1399					page_vma_mapped_walk_done(&pvmw);
  1400					break;
  1401				}
  1402				if (flags & TTU_MUNLOCK)
  1403					continue;
  1404			}
  1405	
  1406			/* Unexpected PMD-mapped THP? */
  1407			VM_BUG_ON_PAGE(!pvmw.pte, page);
  1408	
  1409			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1410			address = pvmw.address;
  1411	
  1412			/*
  1413			 * PMDs for hugetlbfs pages could be shared.  In this case,
  1414			 * pages with shared PMDs will have a mapcount of 1 no matter
  1415			 * how many times it is actually mapped.  Map counting for
  1416			 * PMD sharing is mostly done via the reference count on the
  1417			 * PMD page itself.  If the page we are trying to unmap is a
  1418			 * hugetlbfs page, attempt to 'unshare' at the PMD level.
  1419			 * huge_pmd_unshare takes care of clearing the PUD and
  1420			 * reference counting on the PMD page which effectively unmaps
  1421			 * the page.  Take care of flushing cache and TLB for page in
  1422			 * this specific mapping here.
  1423			 */
  1424			if (PageHuge(page) &&
> 1425			    huge_pmd_unshare(mm, &address, pvmw.pte)) {
  1426				unsigned long end_add = address + vma_mmu_pagesize(vma);
  1427	
  1428				flush_cache_range(vma, address, end_add);
  1429				flush_tlb_range(vma, address, end_add);
  1430				mmu_notifier_invalidate_range(mm, address, end_add);
  1431				continue;
  1432			}
  1433	
  1434			if (IS_ENABLED(CONFIG_MIGRATION) &&
  1435			    (flags & TTU_MIGRATION) &&
  1436			    is_zone_device_page(page)) {
  1437				swp_entry_t entry;
  1438				pte_t swp_pte;
  1439	
  1440				pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
  1441	
  1442				/*
  1443				 * Store the pfn of the page in a special migration
  1444				 * pte. do_swap_page() will wait until the migration
  1445				 * pte is removed and then restart fault handling.
  1446				 */
  1447				entry = make_migration_entry(page, 0);
  1448				swp_pte = swp_entry_to_pte(entry);
  1449				if (pte_soft_dirty(pteval))
  1450					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1451				set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1452				/*
  1453				 * No need to invalidate here it will synchronize on
  1454				 * against the special swap migration pte.
  1455				 */
  1456				goto discard;
  1457			}
  1458	
  1459			if (!(flags & TTU_IGNORE_ACCESS)) {
  1460				if (ptep_clear_flush_young_notify(vma, address,
  1461							pvmw.pte)) {
  1462					ret = false;
  1463					page_vma_mapped_walk_done(&pvmw);
  1464					break;
  1465				}
  1466			}
  1467	
  1468			/* Nuke the page table entry. */
  1469			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
  1470			if (should_defer_flush(mm, flags)) {
  1471				/*
  1472				 * We clear the PTE but do not flush so potentially
  1473				 * a remote CPU could still be writing to the page.
  1474				 * If the entry was previously clean then the
  1475				 * architecture must guarantee that a clear->dirty
  1476				 * transition on a cached TLB entry is written through
  1477				 * and traps if the PTE is unmapped.
  1478				 */
  1479				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  1480	
  1481				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
  1482			} else {
  1483				pteval = ptep_clear_flush(vma, address, pvmw.pte);
  1484			}
  1485	
  1486			/* Move the dirty bit to the page. Now the pte is gone. */
  1487			if (pte_dirty(pteval))
  1488				set_page_dirty(page);
  1489	
  1490			/* Update high watermark before we lower rss */
  1491			update_hiwater_rss(mm);
  1492	
  1493			if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
  1494				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  1495				if (PageHuge(page)) {
  1496					int nr = 1 << compound_order(page);
  1497					hugetlb_count_sub(nr, mm);
  1498					set_huge_swap_pte_at(mm, address,
  1499							     pvmw.pte, pteval,
  1500							     vma_mmu_pagesize(vma));
  1501				} else {
  1502					dec_mm_counter(mm, mm_counter(page));
  1503					set_pte_at(mm, address, pvmw.pte, pteval);
  1504				}
  1505	
  1506			} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
  1507				/*
  1508				 * The guest indicated that the page content is of no
  1509				 * interest anymore. Simply discard the pte, vmscan
  1510				 * will take care of the rest.
  1511				 * A future reference will then fault in a new zero
  1512				 * page. When userfaultfd is active, we must not drop
  1513				 * this page though, as its main user (postcopy
  1514				 * migration) will not expect userfaults on already
  1515				 * copied pages.
  1516				 */
  1517				dec_mm_counter(mm, mm_counter(page));
  1518				/* We have to invalidate as we cleared the pte */
  1519				mmu_notifier_invalidate_range(mm, address,
  1520							      address + PAGE_SIZE);
  1521			} else if (IS_ENABLED(CONFIG_MIGRATION) &&
  1522					(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
  1523				swp_entry_t entry;
  1524				pte_t swp_pte;
  1525	
  1526				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1527					set_pte_at(mm, address, pvmw.pte, pteval);
  1528					ret = false;
  1529					page_vma_mapped_walk_done(&pvmw);
  1530					break;
  1531				}
  1532	
  1533				/*
  1534				 * Store the pfn of the page in a special migration
  1535				 * pte. do_swap_page() will wait until the migration
  1536				 * pte is removed and then restart fault handling.
  1537				 */
  1538				entry = make_migration_entry(subpage,
  1539						pte_write(pteval));
  1540				swp_pte = swp_entry_to_pte(entry);
  1541				if (pte_soft_dirty(pteval))
  1542					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1543				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1544				/*
  1545				 * No need to invalidate here it will synchronize on
  1546				 * against the special swap migration pte.
  1547				 */
  1548			} else if (PageAnon(page)) {
  1549				swp_entry_t entry = { .val = page_private(subpage) };
  1550				pte_t swp_pte;
  1551				/*
  1552				 * Store the swap location in the pte.
  1553				 * See handle_pte_fault() ...
  1554				 */
  1555				if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
  1556					WARN_ON_ONCE(1);
  1557					ret = false;
  1558					/* We have to invalidate as we cleared the pte */
  1559					mmu_notifier_invalidate_range(mm, address,
  1560								address + PAGE_SIZE);
  1561					page_vma_mapped_walk_done(&pvmw);
  1562					break;
  1563				}
  1564	
  1565				/* MADV_FREE page check */
  1566				if (!PageSwapBacked(page)) {
  1567					if (!PageDirty(page)) {
  1568						/* Invalidate as we cleared the pte */
  1569						mmu_notifier_invalidate_range(mm,
  1570							address, address + PAGE_SIZE);
  1571						dec_mm_counter(mm, MM_ANONPAGES);
  1572						goto discard;
  1573					}
  1574	
  1575					/*
  1576					 * If the page was redirtied, it cannot be
  1577					 * discarded. Remap the page to page table.
  1578					 */
  1579					set_pte_at(mm, address, pvmw.pte, pteval);
  1580					SetPageSwapBacked(page);
  1581					ret = false;
  1582					page_vma_mapped_walk_done(&pvmw);
  1583					break;
  1584				}
  1585	
  1586				if (swap_duplicate(entry) < 0) {
  1587					set_pte_at(mm, address, pvmw.pte, pteval);
  1588					ret = false;
  1589					page_vma_mapped_walk_done(&pvmw);
  1590					break;
  1591				}
  1592				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1593					set_pte_at(mm, address, pvmw.pte, pteval);
  1594					ret = false;
  1595					page_vma_mapped_walk_done(&pvmw);
  1596					break;
  1597				}
  1598				if (list_empty(&mm->mmlist)) {
  1599					spin_lock(&mmlist_lock);
  1600					if (list_empty(&mm->mmlist))
  1601						list_add(&mm->mmlist, &init_mm.mmlist);
  1602					spin_unlock(&mmlist_lock);
  1603				}
  1604				dec_mm_counter(mm, MM_ANONPAGES);
  1605				inc_mm_counter(mm, MM_SWAPENTS);
  1606				swp_pte = swp_entry_to_pte(entry);
  1607				if (pte_soft_dirty(pteval))
  1608					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1609				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1610				/* Invalidate as we cleared the pte */
  1611				mmu_notifier_invalidate_range(mm, address,
  1612							      address + PAGE_SIZE);
  1613			} else {
  1614				/*
  1615				 * We should not need to notify here as we reach this
  1616				 * case only from freeze_page() itself only call from
  1617				 * split_huge_page_to_list() so everything below must
  1618				 * be true:
  1619				 *   - page is not anonymous
  1620				 *   - page is locked
  1621				 *
  1622				 * So as it is a locked file back page thus it can not
  1623				 * be remove from the page cache and replace by a new
  1624				 * page before mmu_notifier_invalidate_range_end so no
  1625				 * concurrent thread might update its page table to
  1626				 * point at new page while a device still is using this
  1627				 * page.
  1628				 *
  1629				 * See Documentation/vm/mmu_notifier.rst
  1630				 */
  1631				dec_mm_counter(mm, mm_counter_file(page));
  1632			}
  1633	discard:
  1634			/*
  1635			 * No need to call mmu_notifier_invalidate_range() it has be
  1636			 * done above for all cases requiring it to happen under page
  1637			 * table lock before mmu_notifier_invalidate_range_end()
  1638			 *
  1639			 * See Documentation/vm/mmu_notifier.rst
  1640			 */
  1641			page_remove_rmap(subpage, PageHuge(page));
  1642			put_page(page);
  1643		}
  1644	
  1645		mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
  1646	
  1647		return ret;
  1648	}
  1649	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

Kirill A. Shutemov Aug. 13, 2018, 10:58 a.m. UTC | #3

On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> The page migration code employs try_to_unmap() to try and unmap the
> source page.  This is accomplished by using rmap_walk to find all
> vmas where the page is mapped.  This search stops when page mapcount
> is zero.  For shared PMD huge pages, the page map count is always 1
> not matter the number of mappings.  Shared mappings are tracked via
> the reference count of the PMD page.  Therefore, try_to_unmap stops
> prematurely and does not completely unmap all mappings of the source
> page.
> 
> This problem can result is data corruption as writes to the original
> source page can happen after contents of the page are copied to the
> target page.  Hence, data is lost.
> 
> This problem was originally seen as DB corruption of shared global
> areas after a huge page was soft offlined.  DB developers noticed
> they could reproduce the issue by (hotplug) offlining memory used
> to back huge pages.  A simple testcase can reproduce the problem by
> creating a shared PMD mapping (note that this must be at least
> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> migrate_pages() to migrate process pages between nodes.
> 
> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> shared mapping it will be 'unshared' which removes the page table
> entry and drops reference on PMD page.  After this, flush caches and
> TLB.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
> I am not %100 sure on the required flushing, so suggestions would be
> appreciated.  This also should go to stable.  It has been around for
> a long time so still looking for an appropriate 'fixes:'.

I believe we need flushing. And huge_pmd_unshare() usage in
__unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
that case.

Mike Kravetz Aug. 13, 2018, 11:21 p.m. UTC | #4

On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>> The page migration code employs try_to_unmap() to try and unmap the
>> source page.  This is accomplished by using rmap_walk to find all
>> vmas where the page is mapped.  This search stops when page mapcount
>> is zero.  For shared PMD huge pages, the page map count is always 1
>> not matter the number of mappings.  Shared mappings are tracked via
>> the reference count of the PMD page.  Therefore, try_to_unmap stops
>> prematurely and does not completely unmap all mappings of the source
>> page.
>>
>> This problem can result is data corruption as writes to the original
>> source page can happen after contents of the page are copied to the
>> target page.  Hence, data is lost.
>>
>> This problem was originally seen as DB corruption of shared global
>> areas after a huge page was soft offlined.  DB developers noticed
>> they could reproduce the issue by (hotplug) offlining memory used
>> to back huge pages.  A simple testcase can reproduce the problem by
>> creating a shared PMD mapping (note that this must be at least
>> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
>> migrate_pages() to migrate process pages between nodes.
>>
>> To fix, have the try_to_unmap_one routine check for huge PMD sharing
>> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
>> shared mapping it will be 'unshared' which removes the page table
>> entry and drops reference on PMD page.  After this, flush caches and
>> TLB.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>> I am not %100 sure on the required flushing, so suggestions would be
>> appreciated.  This also should go to stable.  It has been around for
>> a long time so still looking for an appropriate 'fixes:'.
> 
> I believe we need flushing. And huge_pmd_unshare() usage in
> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> that case.

Thanks Kirill,

__unmap_hugepage_range() has two callers:
1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
   tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
   TLB flush.
2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
  has three callers:
  - unmap_vmas which assumes the caller will flush the whole range after
    return.
  - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
  - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu

So, it appears we are covered.  But, I could be missing something.

My primary reason for asking the question was with respect to the code
added to try_to_unmap_one.  In my testing, the changes I added appeared
to be required.  Just wanted to make sure.

I need to fix a build issue and will send another version.

Kirill A. Shutemov Aug. 14, 2018, 8:48 a.m. UTC | #5

On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >> The page migration code employs try_to_unmap() to try and unmap the
> >> source page.  This is accomplished by using rmap_walk to find all
> >> vmas where the page is mapped.  This search stops when page mapcount
> >> is zero.  For shared PMD huge pages, the page map count is always 1
> >> not matter the number of mappings.  Shared mappings are tracked via
> >> the reference count of the PMD page.  Therefore, try_to_unmap stops
> >> prematurely and does not completely unmap all mappings of the source
> >> page.
> >>
> >> This problem can result is data corruption as writes to the original
> >> source page can happen after contents of the page are copied to the
> >> target page.  Hence, data is lost.
> >>
> >> This problem was originally seen as DB corruption of shared global
> >> areas after a huge page was soft offlined.  DB developers noticed
> >> they could reproduce the issue by (hotplug) offlining memory used
> >> to back huge pages.  A simple testcase can reproduce the problem by
> >> creating a shared PMD mapping (note that this must be at least
> >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
> >> migrate_pages() to migrate process pages between nodes.
> >>
> >> To fix, have the try_to_unmap_one routine check for huge PMD sharing
> >> by calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a
> >> shared mapping it will be 'unshared' which removes the page table
> >> entry and drops reference on PMD page.  After this, flush caches and
> >> TLB.
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> >> ---
> >> I am not %100 sure on the required flushing, so suggestions would be
> >> appreciated.  This also should go to stable.  It has been around for
> >> a long time so still looking for an appropriate 'fixes:'.
> > 
> > I believe we need flushing. And huge_pmd_unshare() usage in
> > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> > that case.
> 
> Thanks Kirill,
> 
> __unmap_hugepage_range() has two callers:
> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>    tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>    TLB flush.
> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>   has three callers:
>   - unmap_vmas which assumes the caller will flush the whole range after
>     return.
>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> 
> So, it appears we are covered.  But, I could be missing something.

My problem here is that the mapping that moved by huge_pmd_unshare() in
not accounted into mmu_gather and can be missed on tlb_finish_mmu().

Mike Kravetz Aug. 15, 2018, 12:15 a.m. UTC | #6

On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
>> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
>>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
>>>> I am not %100 sure on the required flushing, so suggestions would be
>>>> appreciated.  This also should go to stable.  It has been around for
>>>> a long time so still looking for an appropriate 'fixes:'.
>>>
>>> I believe we need flushing. And huge_pmd_unshare() usage in
>>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
>>> that case.
>>
>> Thanks Kirill,
>>
>> __unmap_hugepage_range() has two callers:
>> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
>>    tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
>>    TLB flush.
>> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
>>   has three callers:
>>   - unmap_vmas which assumes the caller will flush the whole range after
>>     return.
>>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
>>
>> So, it appears we are covered.  But, I could be missing something.
> 
> My problem here is that the mapping that moved by huge_pmd_unshare() in
> not accounted into mmu_gather and can be missed on tlb_finish_mmu().

Ah, I think I now see the issue you are concerned with.

When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
The routine __unmap_hugepage_range may only have been passed a range
that is a subset of PUD_SIZE.  In the case I was trying to address,
try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
Upon further thought, I think that even in the case of try_to_unmap_one
we should flush PUD_SIZE range.

My first thought would be to embed this flushing within huge_pmd_unshare
itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
explicit:
flush_cache_range(PUD_SIZE)
flush_tlb_range(PUD_SIZE)
mmu_notifier_invalidate_range(PUD_SIZE)
That would take some of the burden off the callers of huge_pmd_unshare.
However, I am not sure if the flushing calls above play nice in all the
calling environments.  I'll look into it some more, but would appreciate
additional comments.

Kirill A. Shutemov Aug. 15, 2018, 8:47 a.m. UTC | #7

On Tue, Aug 14, 2018 at 05:15:57PM -0700, Mike Kravetz wrote:
> On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote:
> > On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote:
> >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote:
> >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote:
> >>>> I am not %100 sure on the required flushing, so suggestions would be
> >>>> appreciated.  This also should go to stable.  It has been around for
> >>>> a long time so still looking for an appropriate 'fixes:'.
> >>>
> >>> I believe we need flushing. And huge_pmd_unshare() usage in
> >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in
> >>> that case.
> >>
> >> Thanks Kirill,
> >>
> >> __unmap_hugepage_range() has two callers:
> >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and
> >>    tlb_finish_mmu on the range.  IIUC, this should cause an appropriate
> >>    TLB flush.
> >> 2) __unmap_hugepage_range_final via unmap_single_vma.  unmap_single_vma
> >>   has three callers:
> >>   - unmap_vmas which assumes the caller will flush the whole range after
> >>     return.
> >>   - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>   - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu
> >>
> >> So, it appears we are covered.  But, I could be missing something.
> > 
> > My problem here is that the mapping that moved by huge_pmd_unshare() in
> > not accounted into mmu_gather and can be missed on tlb_finish_mmu().
> 
> Ah, I think I now see the issue you are concerned with.
> 
> When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area.
> The routine __unmap_hugepage_range may only have been passed a range
> that is a subset of PUD_SIZE.  In the case I was trying to address,
> try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE.
> Upon further thought, I think that even in the case of try_to_unmap_one
> we should flush PUD_SIZE range.
> 
> My first thought would be to embed this flushing within huge_pmd_unshare
> itself.  Perhaps, whenever huge_pmd_unshare succeeds we should do an
> explicit:
> flush_cache_range(PUD_SIZE)
> flush_tlb_range(PUD_SIZE)
> mmu_notifier_invalidate_range(PUD_SIZE)
> That would take some of the burden off the callers of huge_pmd_unshare.
> However, I am not sure if the flushing calls above play nice in all the
> calling environments.  I'll look into it some more, but would appreciate
> additional comments.

I don't think it would work: flush_tlb_range() does IPI and calling it
under spinlock will not go well. I think we need to find a way to account
it properly in the mmu_gather. It's not obvious to me how.

mm: migration: fix migration of huge PMD shared pages

Commit Message

Comments

Patch