Message ID | 20180813034108.27269-1-mike.kravetz@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: migration: fix migration of huge PMD shared pages | expand |
Hi Mike, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180810] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549 config: i386-randconfig-x003-201832 (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm/rmap.c: In function 'try_to_unmap_one': >> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean '__NR_unshare'? [-Werror=implicit-function-declaration] huge_pmd_unshare(mm, &address, pvmw.pte)) { ^~~~~~~~~~~~~~~~ __NR_unshare cc1: some warnings being treated as errors vim +1425 mm/rmap.c 1382 1383 /* 1384 * If the page is mlock()d, we cannot swap it out. 1385 * If it's recently referenced (perhaps page_referenced 1386 * skipped over this mm) then we should reactivate it. 1387 */ 1388 if (!(flags & TTU_IGNORE_MLOCK)) { 1389 if (vma->vm_flags & VM_LOCKED) { 1390 /* PTE-mapped THP are never mlocked */ 1391 if (!PageTransCompound(page)) { 1392 /* 1393 * Holding pte lock, we do *not* need 1394 * mmap_sem here 1395 */ 1396 mlock_vma_page(page); 1397 } 1398 ret = false; 1399 page_vma_mapped_walk_done(&pvmw); 1400 break; 1401 } 1402 if (flags & TTU_MUNLOCK) 1403 continue; 1404 } 1405 1406 /* Unexpected PMD-mapped THP? */ 1407 VM_BUG_ON_PAGE(!pvmw.pte, page); 1408 1409 subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); 1410 address = pvmw.address; 1411 1412 /* 1413 * PMDs for hugetlbfs pages could be shared. In this case, 1414 * pages with shared PMDs will have a mapcount of 1 no matter 1415 * how many times it is actually mapped. Map counting for 1416 * PMD sharing is mostly done via the reference count on the 1417 * PMD page itself. If the page we are trying to unmap is a 1418 * hugetlbfs page, attempt to 'unshare' at the PMD level. 1419 * huge_pmd_unshare takes care of clearing the PUD and 1420 * reference counting on the PMD page which effectively unmaps 1421 * the page. Take care of flushing cache and TLB for page in 1422 * this specific mapping here. 1423 */ 1424 if (PageHuge(page) && > 1425 huge_pmd_unshare(mm, &address, pvmw.pte)) { 1426 unsigned long end_add = address + vma_mmu_pagesize(vma); 1427 1428 flush_cache_range(vma, address, end_add); 1429 flush_tlb_range(vma, address, end_add); 1430 mmu_notifier_invalidate_range(mm, address, end_add); 1431 continue; 1432 } 1433 1434 if (IS_ENABLED(CONFIG_MIGRATION) && 1435 (flags & TTU_MIGRATION) && 1436 is_zone_device_page(page)) { 1437 swp_entry_t entry; 1438 pte_t swp_pte; 1439 1440 pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte); 1441 1442 /* 1443 * Store the pfn of the page in a special migration 1444 * pte. do_swap_page() will wait until the migration 1445 * pte is removed and then restart fault handling. 1446 */ 1447 entry = make_migration_entry(page, 0); 1448 swp_pte = swp_entry_to_pte(entry); 1449 if (pte_soft_dirty(pteval)) 1450 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1451 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 1452 /* 1453 * No need to invalidate here it will synchronize on 1454 * against the special swap migration pte. 1455 */ 1456 goto discard; 1457 } 1458 1459 if (!(flags & TTU_IGNORE_ACCESS)) { 1460 if (ptep_clear_flush_young_notify(vma, address, 1461 pvmw.pte)) { 1462 ret = false; 1463 page_vma_mapped_walk_done(&pvmw); 1464 break; 1465 } 1466 } 1467 1468 /* Nuke the page table entry. */ 1469 flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); 1470 if (should_defer_flush(mm, flags)) { 1471 /* 1472 * We clear the PTE but do not flush so potentially 1473 * a remote CPU could still be writing to the page. 1474 * If the entry was previously clean then the 1475 * architecture must guarantee that a clear->dirty 1476 * transition on a cached TLB entry is written through 1477 * and traps if the PTE is unmapped. 1478 */ 1479 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 1480 1481 set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); 1482 } else { 1483 pteval = ptep_clear_flush(vma, address, pvmw.pte); 1484 } 1485 1486 /* Move the dirty bit to the page. Now the pte is gone. */ 1487 if (pte_dirty(pteval)) 1488 set_page_dirty(page); 1489 1490 /* Update high watermark before we lower rss */ 1491 update_hiwater_rss(mm); 1492 1493 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { 1494 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1495 if (PageHuge(page)) { 1496 int nr = 1 << compound_order(page); 1497 hugetlb_count_sub(nr, mm); 1498 set_huge_swap_pte_at(mm, address, 1499 pvmw.pte, pteval, 1500 vma_mmu_pagesize(vma)); 1501 } else { 1502 dec_mm_counter(mm, mm_counter(page)); 1503 set_pte_at(mm, address, pvmw.pte, pteval); 1504 } 1505 1506 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { 1507 /* 1508 * The guest indicated that the page content is of no 1509 * interest anymore. Simply discard the pte, vmscan 1510 * will take care of the rest. 1511 * A future reference will then fault in a new zero 1512 * page. When userfaultfd is active, we must not drop 1513 * this page though, as its main user (postcopy 1514 * migration) will not expect userfaults on already 1515 * copied pages. 1516 */ 1517 dec_mm_counter(mm, mm_counter(page)); 1518 /* We have to invalidate as we cleared the pte */ 1519 mmu_notifier_invalidate_range(mm, address, 1520 address + PAGE_SIZE); 1521 } else if (IS_ENABLED(CONFIG_MIGRATION) && 1522 (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { 1523 swp_entry_t entry; 1524 pte_t swp_pte; 1525 1526 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1527 set_pte_at(mm, address, pvmw.pte, pteval); 1528 ret = false; 1529 page_vma_mapped_walk_done(&pvmw); 1530 break; 1531 } 1532 1533 /* 1534 * Store the pfn of the page in a special migration 1535 * pte. do_swap_page() will wait until the migration 1536 * pte is removed and then restart fault handling. 1537 */ 1538 entry = make_migration_entry(subpage, 1539 pte_write(pteval)); 1540 swp_pte = swp_entry_to_pte(entry); 1541 if (pte_soft_dirty(pteval)) 1542 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1543 set_pte_at(mm, address, pvmw.pte, swp_pte); 1544 /* 1545 * No need to invalidate here it will synchronize on 1546 * against the special swap migration pte. 1547 */ 1548 } else if (PageAnon(page)) { 1549 swp_entry_t entry = { .val = page_private(subpage) }; 1550 pte_t swp_pte; 1551 /* 1552 * Store the swap location in the pte. 1553 * See handle_pte_fault() ... 1554 */ 1555 if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) { 1556 WARN_ON_ONCE(1); 1557 ret = false; 1558 /* We have to invalidate as we cleared the pte */ 1559 mmu_notifier_invalidate_range(mm, address, 1560 address + PAGE_SIZE); 1561 page_vma_mapped_walk_done(&pvmw); 1562 break; 1563 } 1564 1565 /* MADV_FREE page check */ 1566 if (!PageSwapBacked(page)) { 1567 if (!PageDirty(page)) { 1568 /* Invalidate as we cleared the pte */ 1569 mmu_notifier_invalidate_range(mm, 1570 address, address + PAGE_SIZE); 1571 dec_mm_counter(mm, MM_ANONPAGES); 1572 goto discard; 1573 } 1574 1575 /* 1576 * If the page was redirtied, it cannot be 1577 * discarded. Remap the page to page table. 1578 */ 1579 set_pte_at(mm, address, pvmw.pte, pteval); 1580 SetPageSwapBacked(page); 1581 ret = false; 1582 page_vma_mapped_walk_done(&pvmw); 1583 break; 1584 } 1585 1586 if (swap_duplicate(entry) < 0) { 1587 set_pte_at(mm, address, pvmw.pte, pteval); 1588 ret = false; 1589 page_vma_mapped_walk_done(&pvmw); 1590 break; 1591 } 1592 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1593 set_pte_at(mm, address, pvmw.pte, pteval); 1594 ret = false; 1595 page_vma_mapped_walk_done(&pvmw); 1596 break; 1597 } 1598 if (list_empty(&mm->mmlist)) { 1599 spin_lock(&mmlist_lock); 1600 if (list_empty(&mm->mmlist)) 1601 list_add(&mm->mmlist, &init_mm.mmlist); 1602 spin_unlock(&mmlist_lock); 1603 } 1604 dec_mm_counter(mm, MM_ANONPAGES); 1605 inc_mm_counter(mm, MM_SWAPENTS); 1606 swp_pte = swp_entry_to_pte(entry); 1607 if (pte_soft_dirty(pteval)) 1608 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1609 set_pte_at(mm, address, pvmw.pte, swp_pte); 1610 /* Invalidate as we cleared the pte */ 1611 mmu_notifier_invalidate_range(mm, address, 1612 address + PAGE_SIZE); 1613 } else { 1614 /* 1615 * We should not need to notify here as we reach this 1616 * case only from freeze_page() itself only call from 1617 * split_huge_page_to_list() so everything below must 1618 * be true: 1619 * - page is not anonymous 1620 * - page is locked 1621 * 1622 * So as it is a locked file back page thus it can not 1623 * be remove from the page cache and replace by a new 1624 * page before mmu_notifier_invalidate_range_end so no 1625 * concurrent thread might update its page table to 1626 * point at new page while a device still is using this 1627 * page. 1628 * 1629 * See Documentation/vm/mmu_notifier.rst 1630 */ 1631 dec_mm_counter(mm, mm_counter_file(page)); 1632 } 1633 discard: 1634 /* 1635 * No need to call mmu_notifier_invalidate_range() it has be 1636 * done above for all cases requiring it to happen under page 1637 * table lock before mmu_notifier_invalidate_range_end() 1638 * 1639 * See Documentation/vm/mmu_notifier.rst 1640 */ 1641 page_remove_rmap(subpage, PageHuge(page)); 1642 put_page(page); 1643 } 1644 1645 mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); 1646 1647 return ret; 1648 } 1649 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
Hi Mike, I love your patch! Yet something to improve: [auto build test ERROR on linus/master] [also build test ERROR on v4.18 next-20180810] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Mike-Kravetz/mm-migration-fix-migration-of-huge-PMD-shared-pages/20180813-114549 config: i386-tinyconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-16) 7.3.0 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): mm/rmap.c: In function 'try_to_unmap_one': >> mm/rmap.c:1425:7: error: implicit declaration of function 'huge_pmd_unshare'; did you mean 'do_huge_pmd_wp_page'? [-Werror=implicit-function-declaration] huge_pmd_unshare(mm, &address, pvmw.pte)) { ^~~~~~~~~~~~~~~~ do_huge_pmd_wp_page cc1: some warnings being treated as errors vim +1425 mm/rmap.c 1382 1383 /* 1384 * If the page is mlock()d, we cannot swap it out. 1385 * If it's recently referenced (perhaps page_referenced 1386 * skipped over this mm) then we should reactivate it. 1387 */ 1388 if (!(flags & TTU_IGNORE_MLOCK)) { 1389 if (vma->vm_flags & VM_LOCKED) { 1390 /* PTE-mapped THP are never mlocked */ 1391 if (!PageTransCompound(page)) { 1392 /* 1393 * Holding pte lock, we do *not* need 1394 * mmap_sem here 1395 */ 1396 mlock_vma_page(page); 1397 } 1398 ret = false; 1399 page_vma_mapped_walk_done(&pvmw); 1400 break; 1401 } 1402 if (flags & TTU_MUNLOCK) 1403 continue; 1404 } 1405 1406 /* Unexpected PMD-mapped THP? */ 1407 VM_BUG_ON_PAGE(!pvmw.pte, page); 1408 1409 subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); 1410 address = pvmw.address; 1411 1412 /* 1413 * PMDs for hugetlbfs pages could be shared. In this case, 1414 * pages with shared PMDs will have a mapcount of 1 no matter 1415 * how many times it is actually mapped. Map counting for 1416 * PMD sharing is mostly done via the reference count on the 1417 * PMD page itself. If the page we are trying to unmap is a 1418 * hugetlbfs page, attempt to 'unshare' at the PMD level. 1419 * huge_pmd_unshare takes care of clearing the PUD and 1420 * reference counting on the PMD page which effectively unmaps 1421 * the page. Take care of flushing cache and TLB for page in 1422 * this specific mapping here. 1423 */ 1424 if (PageHuge(page) && > 1425 huge_pmd_unshare(mm, &address, pvmw.pte)) { 1426 unsigned long end_add = address + vma_mmu_pagesize(vma); 1427 1428 flush_cache_range(vma, address, end_add); 1429 flush_tlb_range(vma, address, end_add); 1430 mmu_notifier_invalidate_range(mm, address, end_add); 1431 continue; 1432 } 1433 1434 if (IS_ENABLED(CONFIG_MIGRATION) && 1435 (flags & TTU_MIGRATION) && 1436 is_zone_device_page(page)) { 1437 swp_entry_t entry; 1438 pte_t swp_pte; 1439 1440 pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte); 1441 1442 /* 1443 * Store the pfn of the page in a special migration 1444 * pte. do_swap_page() will wait until the migration 1445 * pte is removed and then restart fault handling. 1446 */ 1447 entry = make_migration_entry(page, 0); 1448 swp_pte = swp_entry_to_pte(entry); 1449 if (pte_soft_dirty(pteval)) 1450 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1451 set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte); 1452 /* 1453 * No need to invalidate here it will synchronize on 1454 * against the special swap migration pte. 1455 */ 1456 goto discard; 1457 } 1458 1459 if (!(flags & TTU_IGNORE_ACCESS)) { 1460 if (ptep_clear_flush_young_notify(vma, address, 1461 pvmw.pte)) { 1462 ret = false; 1463 page_vma_mapped_walk_done(&pvmw); 1464 break; 1465 } 1466 } 1467 1468 /* Nuke the page table entry. */ 1469 flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); 1470 if (should_defer_flush(mm, flags)) { 1471 /* 1472 * We clear the PTE but do not flush so potentially 1473 * a remote CPU could still be writing to the page. 1474 * If the entry was previously clean then the 1475 * architecture must guarantee that a clear->dirty 1476 * transition on a cached TLB entry is written through 1477 * and traps if the PTE is unmapped. 1478 */ 1479 pteval = ptep_get_and_clear(mm, address, pvmw.pte); 1480 1481 set_tlb_ubc_flush_pending(mm, pte_dirty(pteval)); 1482 } else { 1483 pteval = ptep_clear_flush(vma, address, pvmw.pte); 1484 } 1485 1486 /* Move the dirty bit to the page. Now the pte is gone. */ 1487 if (pte_dirty(pteval)) 1488 set_page_dirty(page); 1489 1490 /* Update high watermark before we lower rss */ 1491 update_hiwater_rss(mm); 1492 1493 if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) { 1494 pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); 1495 if (PageHuge(page)) { 1496 int nr = 1 << compound_order(page); 1497 hugetlb_count_sub(nr, mm); 1498 set_huge_swap_pte_at(mm, address, 1499 pvmw.pte, pteval, 1500 vma_mmu_pagesize(vma)); 1501 } else { 1502 dec_mm_counter(mm, mm_counter(page)); 1503 set_pte_at(mm, address, pvmw.pte, pteval); 1504 } 1505 1506 } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { 1507 /* 1508 * The guest indicated that the page content is of no 1509 * interest anymore. Simply discard the pte, vmscan 1510 * will take care of the rest. 1511 * A future reference will then fault in a new zero 1512 * page. When userfaultfd is active, we must not drop 1513 * this page though, as its main user (postcopy 1514 * migration) will not expect userfaults on already 1515 * copied pages. 1516 */ 1517 dec_mm_counter(mm, mm_counter(page)); 1518 /* We have to invalidate as we cleared the pte */ 1519 mmu_notifier_invalidate_range(mm, address, 1520 address + PAGE_SIZE); 1521 } else if (IS_ENABLED(CONFIG_MIGRATION) && 1522 (flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) { 1523 swp_entry_t entry; 1524 pte_t swp_pte; 1525 1526 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1527 set_pte_at(mm, address, pvmw.pte, pteval); 1528 ret = false; 1529 page_vma_mapped_walk_done(&pvmw); 1530 break; 1531 } 1532 1533 /* 1534 * Store the pfn of the page in a special migration 1535 * pte. do_swap_page() will wait until the migration 1536 * pte is removed and then restart fault handling. 1537 */ 1538 entry = make_migration_entry(subpage, 1539 pte_write(pteval)); 1540 swp_pte = swp_entry_to_pte(entry); 1541 if (pte_soft_dirty(pteval)) 1542 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1543 set_pte_at(mm, address, pvmw.pte, swp_pte); 1544 /* 1545 * No need to invalidate here it will synchronize on 1546 * against the special swap migration pte. 1547 */ 1548 } else if (PageAnon(page)) { 1549 swp_entry_t entry = { .val = page_private(subpage) }; 1550 pte_t swp_pte; 1551 /* 1552 * Store the swap location in the pte. 1553 * See handle_pte_fault() ... 1554 */ 1555 if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) { 1556 WARN_ON_ONCE(1); 1557 ret = false; 1558 /* We have to invalidate as we cleared the pte */ 1559 mmu_notifier_invalidate_range(mm, address, 1560 address + PAGE_SIZE); 1561 page_vma_mapped_walk_done(&pvmw); 1562 break; 1563 } 1564 1565 /* MADV_FREE page check */ 1566 if (!PageSwapBacked(page)) { 1567 if (!PageDirty(page)) { 1568 /* Invalidate as we cleared the pte */ 1569 mmu_notifier_invalidate_range(mm, 1570 address, address + PAGE_SIZE); 1571 dec_mm_counter(mm, MM_ANONPAGES); 1572 goto discard; 1573 } 1574 1575 /* 1576 * If the page was redirtied, it cannot be 1577 * discarded. Remap the page to page table. 1578 */ 1579 set_pte_at(mm, address, pvmw.pte, pteval); 1580 SetPageSwapBacked(page); 1581 ret = false; 1582 page_vma_mapped_walk_done(&pvmw); 1583 break; 1584 } 1585 1586 if (swap_duplicate(entry) < 0) { 1587 set_pte_at(mm, address, pvmw.pte, pteval); 1588 ret = false; 1589 page_vma_mapped_walk_done(&pvmw); 1590 break; 1591 } 1592 if (arch_unmap_one(mm, vma, address, pteval) < 0) { 1593 set_pte_at(mm, address, pvmw.pte, pteval); 1594 ret = false; 1595 page_vma_mapped_walk_done(&pvmw); 1596 break; 1597 } 1598 if (list_empty(&mm->mmlist)) { 1599 spin_lock(&mmlist_lock); 1600 if (list_empty(&mm->mmlist)) 1601 list_add(&mm->mmlist, &init_mm.mmlist); 1602 spin_unlock(&mmlist_lock); 1603 } 1604 dec_mm_counter(mm, MM_ANONPAGES); 1605 inc_mm_counter(mm, MM_SWAPENTS); 1606 swp_pte = swp_entry_to_pte(entry); 1607 if (pte_soft_dirty(pteval)) 1608 swp_pte = pte_swp_mksoft_dirty(swp_pte); 1609 set_pte_at(mm, address, pvmw.pte, swp_pte); 1610 /* Invalidate as we cleared the pte */ 1611 mmu_notifier_invalidate_range(mm, address, 1612 address + PAGE_SIZE); 1613 } else { 1614 /* 1615 * We should not need to notify here as we reach this 1616 * case only from freeze_page() itself only call from 1617 * split_huge_page_to_list() so everything below must 1618 * be true: 1619 * - page is not anonymous 1620 * - page is locked 1621 * 1622 * So as it is a locked file back page thus it can not 1623 * be remove from the page cache and replace by a new 1624 * page before mmu_notifier_invalidate_range_end so no 1625 * concurrent thread might update its page table to 1626 * point at new page while a device still is using this 1627 * page. 1628 * 1629 * See Documentation/vm/mmu_notifier.rst 1630 */ 1631 dec_mm_counter(mm, mm_counter_file(page)); 1632 } 1633 discard: 1634 /* 1635 * No need to call mmu_notifier_invalidate_range() it has be 1636 * done above for all cases requiring it to happen under page 1637 * table lock before mmu_notifier_invalidate_range_end() 1638 * 1639 * See Documentation/vm/mmu_notifier.rst 1640 */ 1641 page_remove_rmap(subpage, PageHuge(page)); 1642 put_page(page); 1643 } 1644 1645 mmu_notifier_invalidate_range_end(vma->vm_mm, start, end); 1646 1647 return ret; 1648 } 1649 --- 0-DAY kernel test infrastructure Open Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote: > The page migration code employs try_to_unmap() to try and unmap the > source page. This is accomplished by using rmap_walk to find all > vmas where the page is mapped. This search stops when page mapcount > is zero. For shared PMD huge pages, the page map count is always 1 > not matter the number of mappings. Shared mappings are tracked via > the reference count of the PMD page. Therefore, try_to_unmap stops > prematurely and does not completely unmap all mappings of the source > page. > > This problem can result is data corruption as writes to the original > source page can happen after contents of the page are copied to the > target page. Hence, data is lost. > > This problem was originally seen as DB corruption of shared global > areas after a huge page was soft offlined. DB developers noticed > they could reproduce the issue by (hotplug) offlining memory used > to back huge pages. A simple testcase can reproduce the problem by > creating a shared PMD mapping (note that this must be at least > PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using > migrate_pages() to migrate process pages between nodes. > > To fix, have the try_to_unmap_one routine check for huge PMD sharing > by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a > shared mapping it will be 'unshared' which removes the page table > entry and drops reference on PMD page. After this, flush caches and > TLB. > > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > --- > I am not %100 sure on the required flushing, so suggestions would be > appreciated. This also should go to stable. It has been around for > a long time so still looking for an appropriate 'fixes:'. I believe we need flushing. And huge_pmd_unshare() usage in __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in that case.
On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote: > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote: >> The page migration code employs try_to_unmap() to try and unmap the >> source page. This is accomplished by using rmap_walk to find all >> vmas where the page is mapped. This search stops when page mapcount >> is zero. For shared PMD huge pages, the page map count is always 1 >> not matter the number of mappings. Shared mappings are tracked via >> the reference count of the PMD page. Therefore, try_to_unmap stops >> prematurely and does not completely unmap all mappings of the source >> page. >> >> This problem can result is data corruption as writes to the original >> source page can happen after contents of the page are copied to the >> target page. Hence, data is lost. >> >> This problem was originally seen as DB corruption of shared global >> areas after a huge page was soft offlined. DB developers noticed >> they could reproduce the issue by (hotplug) offlining memory used >> to back huge pages. A simple testcase can reproduce the problem by >> creating a shared PMD mapping (note that this must be at least >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using >> migrate_pages() to migrate process pages between nodes. >> >> To fix, have the try_to_unmap_one routine check for huge PMD sharing >> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a >> shared mapping it will be 'unshared' which removes the page table >> entry and drops reference on PMD page. After this, flush caches and >> TLB. >> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> >> --- >> I am not %100 sure on the required flushing, so suggestions would be >> appreciated. This also should go to stable. It has been around for >> a long time so still looking for an appropriate 'fixes:'. > > I believe we need flushing. And huge_pmd_unshare() usage in > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in > that case. Thanks Kirill, __unmap_hugepage_range() has two callers: 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and tlb_finish_mmu on the range. IIUC, this should cause an appropriate TLB flush. 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma has three callers: - unmap_vmas which assumes the caller will flush the whole range after return. - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu So, it appears we are covered. But, I could be missing something. My primary reason for asking the question was with respect to the code added to try_to_unmap_one. In my testing, the changes I added appeared to be required. Just wanted to make sure. I need to fix a build issue and will send another version.
On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote: > On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote: > > On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote: > >> The page migration code employs try_to_unmap() to try and unmap the > >> source page. This is accomplished by using rmap_walk to find all > >> vmas where the page is mapped. This search stops when page mapcount > >> is zero. For shared PMD huge pages, the page map count is always 1 > >> not matter the number of mappings. Shared mappings are tracked via > >> the reference count of the PMD page. Therefore, try_to_unmap stops > >> prematurely and does not completely unmap all mappings of the source > >> page. > >> > >> This problem can result is data corruption as writes to the original > >> source page can happen after contents of the page are copied to the > >> target page. Hence, data is lost. > >> > >> This problem was originally seen as DB corruption of shared global > >> areas after a huge page was soft offlined. DB developers noticed > >> they could reproduce the issue by (hotplug) offlining memory used > >> to back huge pages. A simple testcase can reproduce the problem by > >> creating a shared PMD mapping (note that this must be at least > >> PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using > >> migrate_pages() to migrate process pages between nodes. > >> > >> To fix, have the try_to_unmap_one routine check for huge PMD sharing > >> by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a > >> shared mapping it will be 'unshared' which removes the page table > >> entry and drops reference on PMD page. After this, flush caches and > >> TLB. > >> > >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> > >> --- > >> I am not %100 sure on the required flushing, so suggestions would be > >> appreciated. This also should go to stable. It has been around for > >> a long time so still looking for an appropriate 'fixes:'. > > > > I believe we need flushing. And huge_pmd_unshare() usage in > > __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in > > that case. > > Thanks Kirill, > > __unmap_hugepage_range() has two callers: > 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and > tlb_finish_mmu on the range. IIUC, this should cause an appropriate > TLB flush. > 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma > has three callers: > - unmap_vmas which assumes the caller will flush the whole range after > return. > - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu > - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu > > So, it appears we are covered. But, I could be missing something. My problem here is that the mapping that moved by huge_pmd_unshare() in not accounted into mmu_gather and can be missed on tlb_finish_mmu().
On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote: > On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote: >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote: >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote: >>>> I am not %100 sure on the required flushing, so suggestions would be >>>> appreciated. This also should go to stable. It has been around for >>>> a long time so still looking for an appropriate 'fixes:'. >>> >>> I believe we need flushing. And huge_pmd_unshare() usage in >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in >>> that case. >> >> Thanks Kirill, >> >> __unmap_hugepage_range() has two callers: >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and >> tlb_finish_mmu on the range. IIUC, this should cause an appropriate >> TLB flush. >> 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma >> has three callers: >> - unmap_vmas which assumes the caller will flush the whole range after >> return. >> - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu >> - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu >> >> So, it appears we are covered. But, I could be missing something. > > My problem here is that the mapping that moved by huge_pmd_unshare() in > not accounted into mmu_gather and can be missed on tlb_finish_mmu(). Ah, I think I now see the issue you are concerned with. When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area. The routine __unmap_hugepage_range may only have been passed a range that is a subset of PUD_SIZE. In the case I was trying to address, try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE. Upon further thought, I think that even in the case of try_to_unmap_one we should flush PUD_SIZE range. My first thought would be to embed this flushing within huge_pmd_unshare itself. Perhaps, whenever huge_pmd_unshare succeeds we should do an explicit: flush_cache_range(PUD_SIZE) flush_tlb_range(PUD_SIZE) mmu_notifier_invalidate_range(PUD_SIZE) That would take some of the burden off the callers of huge_pmd_unshare. However, I am not sure if the flushing calls above play nice in all the calling environments. I'll look into it some more, but would appreciate additional comments.
On Tue, Aug 14, 2018 at 05:15:57PM -0700, Mike Kravetz wrote: > On 08/14/2018 01:48 AM, Kirill A. Shutemov wrote: > > On Mon, Aug 13, 2018 at 11:21:41PM +0000, Mike Kravetz wrote: > >> On 08/13/2018 03:58 AM, Kirill A. Shutemov wrote: > >>> On Sun, Aug 12, 2018 at 08:41:08PM -0700, Mike Kravetz wrote: > >>>> I am not %100 sure on the required flushing, so suggestions would be > >>>> appreciated. This also should go to stable. It has been around for > >>>> a long time so still looking for an appropriate 'fixes:'. > >>> > >>> I believe we need flushing. And huge_pmd_unshare() usage in > >>> __unmap_hugepage_range() looks suspicious: I don't see how we flush TLB in > >>> that case. > >> > >> Thanks Kirill, > >> > >> __unmap_hugepage_range() has two callers: > >> 1) unmap_hugepage_range, which wraps the call with tlb_gather_mmu and > >> tlb_finish_mmu on the range. IIUC, this should cause an appropriate > >> TLB flush. > >> 2) __unmap_hugepage_range_final via unmap_single_vma. unmap_single_vma > >> has three callers: > >> - unmap_vmas which assumes the caller will flush the whole range after > >> return. > >> - zap_page_range wraps the call with tlb_gather_mmu/tlb_finish_mmu > >> - zap_page_range_single wraps the call with tlb_gather_mmu/tlb_finish_mmu > >> > >> So, it appears we are covered. But, I could be missing something. > > > > My problem here is that the mapping that moved by huge_pmd_unshare() in > > not accounted into mmu_gather and can be missed on tlb_finish_mmu(). > > Ah, I think I now see the issue you are concerned with. > > When huge_pmd_unshare succeeds we effectively unmap a PUD_SIZE area. > The routine __unmap_hugepage_range may only have been passed a range > that is a subset of PUD_SIZE. In the case I was trying to address, > try_to_unmap_one() the 'range' will certainly be less than PUD_SIZE. > Upon further thought, I think that even in the case of try_to_unmap_one > we should flush PUD_SIZE range. > > My first thought would be to embed this flushing within huge_pmd_unshare > itself. Perhaps, whenever huge_pmd_unshare succeeds we should do an > explicit: > flush_cache_range(PUD_SIZE) > flush_tlb_range(PUD_SIZE) > mmu_notifier_invalidate_range(PUD_SIZE) > That would take some of the burden off the callers of huge_pmd_unshare. > However, I am not sure if the flushing calls above play nice in all the > calling environments. I'll look into it some more, but would appreciate > additional comments. I don't think it would work: flush_tlb_range() does IPI and calling it under spinlock will not go well. I think we need to find a way to account it properly in the mmu_gather. It's not obvious to me how.
diff --git a/mm/rmap.c b/mm/rmap.c index 09a799c9aebd..45583758bf16 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1409,6 +1409,27 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); address = pvmw.address; + /* + * PMDs for hugetlbfs pages could be shared. In this case, + * pages with shared PMDs will have a mapcount of 1 no matter + * how many times it is actually mapped. Map counting for + * PMD sharing is mostly done via the reference count on the + * PMD page itself. If the page we are trying to unmap is a + * hugetlbfs page, attempt to 'unshare' at the PMD level. + * huge_pmd_unshare takes care of clearing the PUD and + * reference counting on the PMD page which effectively unmaps + * the page. Take care of flushing cache and TLB for page in + * this specific mapping here. + */ + if (PageHuge(page) && + huge_pmd_unshare(mm, &address, pvmw.pte)) { + unsigned long end_add = address + vma_mmu_pagesize(vma); + + flush_cache_range(vma, address, end_add); + flush_tlb_range(vma, address, end_add); + mmu_notifier_invalidate_range(mm, address, end_add); + continue; + } if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
The page migration code employs try_to_unmap() to try and unmap the source page. This is accomplished by using rmap_walk to find all vmas where the page is mapped. This search stops when page mapcount is zero. For shared PMD huge pages, the page map count is always 1 not matter the number of mappings. Shared mappings are tracked via the reference count of the PMD page. Therefore, try_to_unmap stops prematurely and does not completely unmap all mappings of the source page. This problem can result is data corruption as writes to the original source page can happen after contents of the page are copied to the target page. Hence, data is lost. This problem was originally seen as DB corruption of shared global areas after a huge page was soft offlined. DB developers noticed they could reproduce the issue by (hotplug) offlining memory used to back huge pages. A simple testcase can reproduce the problem by creating a shared PMD mapping (note that this must be at least PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using migrate_pages() to migrate process pages between nodes. To fix, have the try_to_unmap_one routine check for huge PMD sharing by calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared mapping it will be 'unshared' which removes the page table entry and drops reference on PMD page. After this, flush caches and TLB. Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> --- I am not %100 sure on the required flushing, so suggestions would be appreciated. This also should go to stable. It has been around for a long time so still looking for an appropriate 'fixes:'. mm/rmap.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+)