mm/rmap.c: split huge pmd when it really is

Message ID	20191223022435.30653-1-richardw.yang@linux.intel.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=jhCc=2N=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 788B820409 From: Wei Yang <richardw.yang@linux.intel.com> To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com, Wei Yang <richardw.yang@linux.intel.com> Subject: [PATCH] mm/rmap.c: split huge pmd when it really is Date: Mon, 23 Dec 2019 10:24:35 +0800 Message-Id: <20191223022435.30653-1-richardw.yang@linux.intel.com> Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/rmap.c: split huge pmd when it really is \| expand mm/rmap.c: split huge pmd when it really is

Wei Yang Dec. 23, 2019, 2:24 a.m. UTC

There are two cases to call try_to_unmap_one() with TTU_SPLIT_HUGE_PMD
set:

  * unmap_page()
  * shrink_page_list()

In both case, the page passed to try_to_unmap_one() is PageHead() of the
THP. If this page's mapping address in process is not HPAGE_PMD_SIZE
aligned, this means the THP is not mapped as PMD THP in this process.
This could happen when we do mremap() a PMD size range to an un-aligned
address.

Currently, this case is handled by following check in __split_huge_pmd()
luckily.

  page != pmd_page(*pmd)

This patch checks the address to skip some hard work.

Signed-off-by: Wei Yang <richardw.yang@linux.intel.com>
---
 mm/rmap.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Kirill A. Shutemov Dec. 23, 2019, 5:16 p.m. UTC | #1

On Mon, Dec 23, 2019 at 10:24:35AM +0800, Wei Yang wrote:
> There are two cases to call try_to_unmap_one() with TTU_SPLIT_HUGE_PMD
> set:
> 
>   * unmap_page()
>   * shrink_page_list()
> 
> In both case, the page passed to try_to_unmap_one() is PageHead() of the
> THP. If this page's mapping address in process is not HPAGE_PMD_SIZE
> aligned, this means the THP is not mapped as PMD THP in this process.
> This could happen when we do mremap() a PMD size range to an un-aligned
> address.
> 
> Currently, this case is handled by following check in __split_huge_pmd()
> luckily.
> 
>   page != pmd_page(*pmd)
> 
> This patch checks the address to skip some hard work.

Do you see some measurable performance improvement? rmap is heavy enough
and I expect this kind of overhead to be within noise level.

I don't have anything agains the check, but it complicates the picture.

And if we are going this path, it worth also check if the vma is long
enough to hold huge page.

And I don't see why the check cannot be done inside split_huge_pmd_address().

Wei Yang Dec. 23, 2019, 9:59 p.m. UTC | #2

On Mon, Dec 23, 2019 at 08:16:53PM +0300, Kirill A. Shutemov wrote:
>On Mon, Dec 23, 2019 at 10:24:35AM +0800, Wei Yang wrote:
>> There are two cases to call try_to_unmap_one() with TTU_SPLIT_HUGE_PMD
>> set:
>> 
>>   * unmap_page()
>>   * shrink_page_list()
>> 
>> In both case, the page passed to try_to_unmap_one() is PageHead() of the
>> THP. If this page's mapping address in process is not HPAGE_PMD_SIZE
>> aligned, this means the THP is not mapped as PMD THP in this process.
>> This could happen when we do mremap() a PMD size range to an un-aligned
>> address.
>> 
>> Currently, this case is handled by following check in __split_huge_pmd()
>> luckily.
>> 
>>   page != pmd_page(*pmd)
>> 
>> This patch checks the address to skip some hard work.
>
>Do you see some measurable performance improvement? rmap is heavy enough
>and I expect this kind of overhead to be within noise level.
>
>I don't have anything agains the check, but it complicates the picture.
>
>And if we are going this path, it worth also check if the vma is long
>enough to hold huge page.
>
>And I don't see why the check cannot be done inside split_huge_pmd_address().
>

Ok, let me put the check into split_huge_pmd_address().

>-- 
> Kirill A. Shutemov

kernel test robot Dec. 24, 2019, 9:47 p.m. UTC | #3

Hi Wei,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on mmotm/master]
[also build test ERROR on linus/master v5.5-rc3 next-20191220]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Wei-Yang/mm-rmap-c-split-huge-pmd-when-it-really-is/20191225-023217
base:   git://git.cmpxchg.org/linux-mmotm.git master
config: i386-tinyconfig (attached as .config)
compiler: gcc-7 (Debian 7.5.0-3) 7.5.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from include/asm-generic/bug.h:19:0,
                    from arch/x86/include/asm/bug.h:83,
                    from include/linux/bug.h:5,
                    from include/linux/mmdebug.h:5,
                    from include/linux/mm.h:9,
                    from mm/rmap.c:48:
   mm/rmap.c: In function 'try_to_unmap_one':
>> include/linux/compiler.h:350:38: error: call to '__compiletime_assert_1375' declared with attribute error: BUILD_BUG failed
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
                                         ^
   include/linux/kernel.h:37:48: note: in definition of macro 'IS_ALIGNED'
    #define IS_ALIGNED(x, a)  (((x) & ((typeof(x))(a) - 1)) == 0)
                                                   ^
>> include/linux/compiler.h:338:2: note: in expansion of macro '__compiletime_assert'
     __compiletime_assert(condition, msg, prefix, suffix)
     ^~~~~~~~~~~~~~~~~~~~
   include/linux/compiler.h:350:2: note: in expansion of macro '_compiletime_assert'
     _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
     ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
    #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
                                        ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:59:21: note: in expansion of macro 'BUILD_BUG_ON_MSG'
    #define BUILD_BUG() BUILD_BUG_ON_MSG(1, "BUILD_BUG failed")
                        ^~~~~~~~~~~~~~~~
>> include/linux/huge_mm.h:282:27: note: in expansion of macro 'BUILD_BUG'
    #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
                              ^~~~~~~~~
>> mm/rmap.c:1375:23: note: in expansion of macro 'HPAGE_PMD_SIZE'
      IS_ALIGNED(address, HPAGE_PMD_SIZE)) {
                          ^~~~~~~~~~~~~~

vim +/HPAGE_PMD_SIZE +1375 mm/rmap.c

  1336	
  1337	/*
  1338	 * @arg: enum ttu_flags will be passed to this argument
  1339	 */
  1340	static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
  1341			     unsigned long address, void *arg)
  1342	{
  1343		struct mm_struct *mm = vma->vm_mm;
  1344		struct page_vma_mapped_walk pvmw = {
  1345			.page = page,
  1346			.vma = vma,
  1347			.address = address,
  1348		};
  1349		pte_t pteval;
  1350		struct page *subpage;
  1351		bool ret = true;
  1352		struct mmu_notifier_range range;
  1353		enum ttu_flags flags = (enum ttu_flags)arg;
  1354	
  1355		/* munlock has nothing to gain from examining un-locked vmas */
  1356		if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
  1357			return true;
  1358	
  1359		if (IS_ENABLED(CONFIG_MIGRATION) && (flags & TTU_MIGRATION) &&
  1360		    is_zone_device_page(page) && !is_device_private_page(page))
  1361			return true;
  1362	
  1363		/*
  1364		 * There are two places set TTU_SPLIT_HUGE_PMD
  1365		 *
  1366		 *     unmap_page()
  1367		 *     shrink_page_list()
  1368		 *
  1369		 * In both cases, the "page" here is the PageHead() of the THP.
  1370		 *
  1371		 * If the page is not a real PMD huge page, e.g. after mremap(), it is
  1372		 * not necessary to split.
  1373		 */
  1374		if ((flags & TTU_SPLIT_HUGE_PMD) &&
> 1375			IS_ALIGNED(address, HPAGE_PMD_SIZE)) {
  1376			split_huge_pmd_address(vma, address,
  1377					flags & TTU_SPLIT_FREEZE, page);
  1378		}
  1379	
  1380		/*
  1381		 * For THP, we have to assume the worse case ie pmd for invalidation.
  1382		 * For hugetlb, it could be much worse if we need to do pud
  1383		 * invalidation in the case of pmd sharing.
  1384		 *
  1385		 * Note that the page can not be free in this function as call of
  1386		 * try_to_unmap() must hold a reference on the page.
  1387		 */
  1388		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
  1389					address,
  1390					min(vma->vm_end, address + page_size(page)));
  1391		if (PageHuge(page)) {
  1392			/*
  1393			 * If sharing is possible, start and end will be adjusted
  1394			 * accordingly.
  1395			 */
  1396			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  1397							     &range.end);
  1398		}
  1399		mmu_notifier_invalidate_range_start(&range);
  1400	
  1401		while (page_vma_mapped_walk(&pvmw)) {
  1402	#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
  1403			/* PMD-mapped THP migration entry */
  1404			if (!pvmw.pte && (flags & TTU_MIGRATION)) {
  1405				VM_BUG_ON_PAGE(PageHuge(page) || !PageTransCompound(page), page);
  1406	
  1407				set_pmd_migration_entry(&pvmw, page);
  1408				continue;
  1409			}
  1410	#endif
  1411	
  1412			/*
  1413			 * If the page is mlock()d, we cannot swap it out.
  1414			 * If it's recently referenced (perhaps page_referenced
  1415			 * skipped over this mm) then we should reactivate it.
  1416			 */
  1417			if (!(flags & TTU_IGNORE_MLOCK)) {
  1418				if (vma->vm_flags & VM_LOCKED) {
  1419					/* PTE-mapped THP are never mlocked */
  1420					if (!PageTransCompound(page)) {
  1421						/*
  1422						 * Holding pte lock, we do *not* need
  1423						 * mmap_sem here
  1424						 */
  1425						mlock_vma_page(page);
  1426					}
  1427					ret = false;
  1428					page_vma_mapped_walk_done(&pvmw);
  1429					break;
  1430				}
  1431				if (flags & TTU_MUNLOCK)
  1432					continue;
  1433			}
  1434	
  1435			/* Unexpected PMD-mapped THP? */
  1436			VM_BUG_ON_PAGE(!pvmw.pte, page);
  1437	
  1438			subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
  1439			address = pvmw.address;
  1440	
  1441			if (PageHuge(page)) {
  1442				if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
  1443					/*
  1444					 * huge_pmd_unshare unmapped an entire PMD
  1445					 * page.  There is no way of knowing exactly
  1446					 * which PMDs may be cached for this mm, so
  1447					 * we must flush them all.  start/end were
  1448					 * already adjusted above to cover this range.
  1449					 */
  1450					flush_cache_range(vma, range.start, range.end);
  1451					flush_tlb_range(vma, range.start, range.end);
  1452					mmu_notifier_invalidate_range(mm, range.start,
  1453								      range.end);
  1454	
  1455					/*
  1456					 * The ref count of the PMD page was dropped
  1457					 * which is part of the way map counting
  1458					 * is done for shared PMDs.  Return 'true'
  1459					 * here.  When there is no other sharing,
  1460					 * huge_pmd_unshare returns false and we will
  1461					 * unmap the actual page and drop map count
  1462					 * to zero.
  1463					 */
  1464					page_vma_mapped_walk_done(&pvmw);
  1465					break;
  1466				}
  1467			}
  1468	
  1469			if (IS_ENABLED(CONFIG_MIGRATION) &&
  1470			    (flags & TTU_MIGRATION) &&
  1471			    is_zone_device_page(page)) {
  1472				swp_entry_t entry;
  1473				pte_t swp_pte;
  1474	
  1475				pteval = ptep_get_and_clear(mm, pvmw.address, pvmw.pte);
  1476	
  1477				/*
  1478				 * Store the pfn of the page in a special migration
  1479				 * pte. do_swap_page() will wait until the migration
  1480				 * pte is removed and then restart fault handling.
  1481				 */
  1482				entry = make_migration_entry(page, 0);
  1483				swp_pte = swp_entry_to_pte(entry);
  1484				if (pte_soft_dirty(pteval))
  1485					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1486				set_pte_at(mm, pvmw.address, pvmw.pte, swp_pte);
  1487				/*
  1488				 * No need to invalidate here it will synchronize on
  1489				 * against the special swap migration pte.
  1490				 *
  1491				 * The assignment to subpage above was computed from a
  1492				 * swap PTE which results in an invalid pointer.
  1493				 * Since only PAGE_SIZE pages can currently be
  1494				 * migrated, just set it to page. This will need to be
  1495				 * changed when hugepage migrations to device private
  1496				 * memory are supported.
  1497				 */
  1498				subpage = page;
  1499				goto discard;
  1500			}
  1501	
  1502			if (!(flags & TTU_IGNORE_ACCESS)) {
  1503				if (ptep_clear_flush_young_notify(vma, address,
  1504							pvmw.pte)) {
  1505					ret = false;
  1506					page_vma_mapped_walk_done(&pvmw);
  1507					break;
  1508				}
  1509			}
  1510	
  1511			/* Nuke the page table entry. */
  1512			flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
  1513			if (should_defer_flush(mm, flags)) {
  1514				/*
  1515				 * We clear the PTE but do not flush so potentially
  1516				 * a remote CPU could still be writing to the page.
  1517				 * If the entry was previously clean then the
  1518				 * architecture must guarantee that a clear->dirty
  1519				 * transition on a cached TLB entry is written through
  1520				 * and traps if the PTE is unmapped.
  1521				 */
  1522				pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  1523	
  1524				set_tlb_ubc_flush_pending(mm, pte_dirty(pteval));
  1525			} else {
  1526				pteval = ptep_clear_flush(vma, address, pvmw.pte);
  1527			}
  1528	
  1529			/* Move the dirty bit to the page. Now the pte is gone. */
  1530			if (pte_dirty(pteval))
  1531				set_page_dirty(page);
  1532	
  1533			/* Update high watermark before we lower rss */
  1534			update_hiwater_rss(mm);
  1535	
  1536			if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
  1537				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  1538				if (PageHuge(page)) {
  1539					hugetlb_count_sub(compound_nr(page), mm);
  1540					set_huge_swap_pte_at(mm, address,
  1541							     pvmw.pte, pteval,
  1542							     vma_mmu_pagesize(vma));
  1543				} else {
  1544					dec_mm_counter(mm, mm_counter(page));
  1545					set_pte_at(mm, address, pvmw.pte, pteval);
  1546				}
  1547	
  1548			} else if (pte_unused(pteval) && !userfaultfd_armed(vma)) {
  1549				/*
  1550				 * The guest indicated that the page content is of no
  1551				 * interest anymore. Simply discard the pte, vmscan
  1552				 * will take care of the rest.
  1553				 * A future reference will then fault in a new zero
  1554				 * page. When userfaultfd is active, we must not drop
  1555				 * this page though, as its main user (postcopy
  1556				 * migration) will not expect userfaults on already
  1557				 * copied pages.
  1558				 */
  1559				dec_mm_counter(mm, mm_counter(page));
  1560				/* We have to invalidate as we cleared the pte */
  1561				mmu_notifier_invalidate_range(mm, address,
  1562							      address + PAGE_SIZE);
  1563			} else if (IS_ENABLED(CONFIG_MIGRATION) &&
  1564					(flags & (TTU_MIGRATION|TTU_SPLIT_FREEZE))) {
  1565				swp_entry_t entry;
  1566				pte_t swp_pte;
  1567	
  1568				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1569					set_pte_at(mm, address, pvmw.pte, pteval);
  1570					ret = false;
  1571					page_vma_mapped_walk_done(&pvmw);
  1572					break;
  1573				}
  1574	
  1575				/*
  1576				 * Store the pfn of the page in a special migration
  1577				 * pte. do_swap_page() will wait until the migration
  1578				 * pte is removed and then restart fault handling.
  1579				 */
  1580				entry = make_migration_entry(subpage,
  1581						pte_write(pteval));
  1582				swp_pte = swp_entry_to_pte(entry);
  1583				if (pte_soft_dirty(pteval))
  1584					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1585				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1586				/*
  1587				 * No need to invalidate here it will synchronize on
  1588				 * against the special swap migration pte.
  1589				 */
  1590			} else if (PageAnon(page)) {
  1591				swp_entry_t entry = { .val = page_private(subpage) };
  1592				pte_t swp_pte;
  1593				/*
  1594				 * Store the swap location in the pte.
  1595				 * See handle_pte_fault() ...
  1596				 */
  1597				if (unlikely(PageSwapBacked(page) != PageSwapCache(page))) {
  1598					WARN_ON_ONCE(1);
  1599					ret = false;
  1600					/* We have to invalidate as we cleared the pte */
  1601					mmu_notifier_invalidate_range(mm, address,
  1602								address + PAGE_SIZE);
  1603					page_vma_mapped_walk_done(&pvmw);
  1604					break;
  1605				}
  1606	
  1607				/* MADV_FREE page check */
  1608				if (!PageSwapBacked(page)) {
  1609					if (!PageDirty(page)) {
  1610						/* Invalidate as we cleared the pte */
  1611						mmu_notifier_invalidate_range(mm,
  1612							address, address + PAGE_SIZE);
  1613						dec_mm_counter(mm, MM_ANONPAGES);
  1614						goto discard;
  1615					}
  1616	
  1617					/*
  1618					 * If the page was redirtied, it cannot be
  1619					 * discarded. Remap the page to page table.
  1620					 */
  1621					set_pte_at(mm, address, pvmw.pte, pteval);
  1622					SetPageSwapBacked(page);
  1623					ret = false;
  1624					page_vma_mapped_walk_done(&pvmw);
  1625					break;
  1626				}
  1627	
  1628				if (swap_duplicate(entry) < 0) {
  1629					set_pte_at(mm, address, pvmw.pte, pteval);
  1630					ret = false;
  1631					page_vma_mapped_walk_done(&pvmw);
  1632					break;
  1633				}
  1634				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  1635					set_pte_at(mm, address, pvmw.pte, pteval);
  1636					ret = false;
  1637					page_vma_mapped_walk_done(&pvmw);
  1638					break;
  1639				}
  1640				if (list_empty(&mm->mmlist)) {
  1641					spin_lock(&mmlist_lock);
  1642					if (list_empty(&mm->mmlist))
  1643						list_add(&mm->mmlist, &init_mm.mmlist);
  1644					spin_unlock(&mmlist_lock);
  1645				}
  1646				dec_mm_counter(mm, MM_ANONPAGES);
  1647				inc_mm_counter(mm, MM_SWAPENTS);
  1648				swp_pte = swp_entry_to_pte(entry);
  1649				if (pte_soft_dirty(pteval))
  1650					swp_pte = pte_swp_mksoft_dirty(swp_pte);
  1651				set_pte_at(mm, address, pvmw.pte, swp_pte);
  1652				/* Invalidate as we cleared the pte */
  1653				mmu_notifier_invalidate_range(mm, address,
  1654							      address + PAGE_SIZE);
  1655			} else {
  1656				/*
  1657				 * This is a locked file-backed page, thus it cannot
  1658				 * be removed from the page cache and replaced by a new
  1659				 * page before mmu_notifier_invalidate_range_end, so no
  1660				 * concurrent thread might update its page table to
  1661				 * point at new page while a device still is using this
  1662				 * page.
  1663				 *
  1664				 * See Documentation/vm/mmu_notifier.rst
  1665				 */
  1666				dec_mm_counter(mm, mm_counter_file(page));
  1667			}
  1668	discard:
  1669			/*
  1670			 * No need to call mmu_notifier_invalidate_range() it has be
  1671			 * done above for all cases requiring it to happen under page
  1672			 * table lock before mmu_notifier_invalidate_range_end()
  1673			 *
  1674			 * See Documentation/vm/mmu_notifier.rst
  1675			 */
  1676			page_remove_rmap(subpage, PageHuge(page));
  1677			put_page(page);
  1678		}
  1679	
  1680		mmu_notifier_invalidate_range_end(&range);
  1681	
  1682		return ret;
  1683	}
  1684	

---
0-DAY kernel test infrastructure                 Open Source Technology Center
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org Intel Corporation

mm/rmap.c: split huge pmd when it really is

Commit Message

Comments

Patch