diff mbox series

[v1,06/14] mm, x86: support the access bit on non-leaf PMD entries

Message ID 20210313075747.3781593-7-yuzhao@google.com (mailing list archive)
State New, archived
Headers show
Series Multigenerational LRU | expand

Commit Message

Yu Zhao March 13, 2021, 7:57 a.m. UTC
Some architectures support the accessed bit on non-leaf PMD entries
(parents) in addition to leaf PTE entries (children) where pages are
mapped, e.g., x86_64 sets the accessed bit on a parent when using it
as part of linear-address translation [1]. Page table walkers who are
interested in the accessed bit on children can take advantage of this:
they do not need to search the children when the accessed bit is not
set on a parent, given that they have previously cleared the accessed
bit on this parent in addition to its children.

[1]: Intel 64 and IA-32 Architectures Software Developer's Manual
     Volume 3 (October 2019), section 4.8

Signed-off-by: Yu Zhao <yuzhao@google.com>
---
 arch/Kconfig                   | 8 ++++++++
 arch/x86/Kconfig               | 1 +
 arch/x86/include/asm/pgtable.h | 2 +-
 arch/x86/mm/pgtable.c          | 5 ++++-
 include/linux/pgtable.h        | 4 ++--
 5 files changed, 16 insertions(+), 4 deletions(-)

Comments

Zi Yan March 14, 2021, 10:12 p.m. UTC | #1
On 13 Mar 2021, at 2:57, Yu Zhao wrote:

> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page table walkers who are
> interested in the accessed bit on children can take advantage of this:
> they do not need to search the children when the accessed bit is not
> set on a parent, given that they have previously cleared the accessed
> bit on this parent in addition to its children.
>
> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>      Volume 3 (October 2019), section 4.8

Just curious. Does this also apply to non-leaf PUD entries? Do you
mind sharing which sentence from the manual gives the information?

Thanks.

—
Best Regards,
Yan Zi
Matthew Wilcox March 14, 2021, 10:51 p.m. UTC | #2
On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> 
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> >
> > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> >      Volume 3 (October 2019), section 4.8
> 
> Just curious. Does this also apply to non-leaf PUD entries? Do you
> mind sharing which sentence from the manual gives the information?

The first few sentences from 4.8:

: For any paging-structure entry that is used during linear-address
: translation, bit 5 is the accessed flag. For paging-structure
: entries that map a page (as opposed to referencing another paging
: structure), bit 6 is the dirty flag. These flags are provided for
: use by memory-management software to manage the transfer of pages and
: paging structures into and out of physical memory.

: Whenever the processor uses a paging-structure entry as part of
: linear-address translation, it sets the accessed flag in that entry
: (if it is not already set).

The way they differentiate between the A and D bits makes it clear to
me that the A bit is set at each level of the tree, but the D bit is
only set on leaf entries.
Dave Hansen March 14, 2021, 11:22 p.m. UTC | #3
On 3/12/21 11:57 PM, Yu Zhao wrote:
> Some architectures support the accessed bit on non-leaf PMD entries
> (parents) in addition to leaf PTE entries (children) where pages are
> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> as part of linear-address translation [1]. Page table walkers who are
> interested in the accessed bit on children can take advantage of this:
> they do not need to search the children when the accessed bit is not
> set on a parent, given that they have previously cleared the accessed
> bit on this parent in addition to its children.

I'd like to hear a *LOT* more about how this is going to be used.

The one part of this which is entirely missing is the interaction with
the TLB and mid-level paging structure caches.  The CPU is pretty
aggressive about setting no-leaf accessed bits when TLB entries are
created.  This *looks* to be depending on that behavior, but it would be
nice to spell it out explicitly.
Yu Zhao March 15, 2021, 12:03 a.m. UTC | #4
On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> > On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> > 
> > > Some architectures support the accessed bit on non-leaf PMD entries
> > > (parents) in addition to leaf PTE entries (children) where pages are
> > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > > as part of linear-address translation [1]. Page table walkers who are
> > > interested in the accessed bit on children can take advantage of this:
> > > they do not need to search the children when the accessed bit is not
> > > set on a parent, given that they have previously cleared the accessed
> > > bit on this parent in addition to its children.
> > >
> > > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> > >      Volume 3 (October 2019), section 4.8
> > 
> > Just curious. Does this also apply to non-leaf PUD entries? Do you
> > mind sharing which sentence from the manual gives the information?
> 
> The first few sentences from 4.8:
> 
> : For any paging-structure entry that is used during linear-address
> : translation, bit 5 is the accessed flag. For paging-structure
> : entries that map a page (as opposed to referencing another paging
> : structure), bit 6 is the dirty flag. These flags are provided for
> : use by memory-management software to manage the transfer of pages and
> : paging structures into and out of physical memory.
> 
> : Whenever the processor uses a paging-structure entry as part of
> : linear-address translation, it sets the accessed flag in that entry
> : (if it is not already set).

As far as I know x86 is the one that supports this.

> The way they differentiate between the A and D bits makes it clear to
> me that the A bit is set at each level of the tree, but the D bit is
> only set on leaf entries.

And the difference makes perfect sense (to me). Kudos to Intel.
Zi Yan March 15, 2021, 12:27 a.m. UTC | #5
On 14 Mar 2021, at 20:03, Yu Zhao wrote:

> On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
>> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
>>> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
>>>
>>>> Some architectures support the accessed bit on non-leaf PMD entries
>>>> (parents) in addition to leaf PTE entries (children) where pages are
>>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
>>>> as part of linear-address translation [1]. Page table walkers who are
>>>> interested in the accessed bit on children can take advantage of this:
>>>> they do not need to search the children when the accessed bit is not
>>>> set on a parent, given that they have previously cleared the accessed
>>>> bit on this parent in addition to its children.
>>>>
>>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
>>>>      Volume 3 (October 2019), section 4.8
>>>
>>> Just curious. Does this also apply to non-leaf PUD entries? Do you
>>> mind sharing which sentence from the manual gives the information?
>>
>> The first few sentences from 4.8:
>>
>> : For any paging-structure entry that is used during linear-address
>> : translation, bit 5 is the accessed flag. For paging-structure
>> : entries that map a page (as opposed to referencing another paging
>> : structure), bit 6 is the dirty flag. These flags are provided for
>> : use by memory-management software to manage the transfer of pages and
>> : paging structures into and out of physical memory.
>>
>> : Whenever the processor uses a paging-structure entry as part of
>> : linear-address translation, it sets the accessed flag in that entry
>> : (if it is not already set).

Matthew, thanks for the pointer.

>
> As far as I know x86 is the one that supports this.
>
>> The way they differentiate between the A and D bits makes it clear to
>> me that the A bit is set at each level of the tree, but the D bit is
>> only set on leaf entries.
>
> And the difference makes perfect sense (to me). Kudos to Intel.

Hi Yu,

You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG.
Is it PUD granularity too large to be useful for multigenerational LRU algorithm?

Thanks.

—
Best Regards,
Yan Zi
Yu Zhao March 15, 2021, 1:04 a.m. UTC | #6
On Sun, Mar 14, 2021 at 08:27:29PM -0400, Zi Yan wrote:
> On 14 Mar 2021, at 20:03, Yu Zhao wrote:
> 
> > On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote:
> >> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote:
> >>> On 13 Mar 2021, at 2:57, Yu Zhao wrote:
> >>>
> >>>> Some architectures support the accessed bit on non-leaf PMD entries
> >>>> (parents) in addition to leaf PTE entries (children) where pages are
> >>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> >>>> as part of linear-address translation [1]. Page table walkers who are
> >>>> interested in the accessed bit on children can take advantage of this:
> >>>> they do not need to search the children when the accessed bit is not
> >>>> set on a parent, given that they have previously cleared the accessed
> >>>> bit on this parent in addition to its children.
> >>>>
> >>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
> >>>>      Volume 3 (October 2019), section 4.8
> >>>
> >>> Just curious. Does this also apply to non-leaf PUD entries? Do you
> >>> mind sharing which sentence from the manual gives the information?
> >>
> >> The first few sentences from 4.8:
> >>
> >> : For any paging-structure entry that is used during linear-address
> >> : translation, bit 5 is the accessed flag. For paging-structure
> >> : entries that map a page (as opposed to referencing another paging
> >> : structure), bit 6 is the dirty flag. These flags are provided for
> >> : use by memory-management software to manage the transfer of pages and
> >> : paging structures into and out of physical memory.
> >>
> >> : Whenever the processor uses a paging-structure entry as part of
> >> : linear-address translation, it sets the accessed flag in that entry
> >> : (if it is not already set).
> 
> Matthew, thanks for the pointer.
> 
> >
> > As far as I know x86 is the one that supports this.
> >
> >> The way they differentiate between the A and D bits makes it clear to
> >> me that the A bit is set at each level of the tree, but the D bit is
> >> only set on leaf entries.
> >
> > And the difference makes perfect sense (to me). Kudos to Intel.
> 
> Hi Yu,
> 
> You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG.
> Is it PUD granularity too large to be useful for multigenerational LRU algorithm?

Oh, sorry. I overlooked this part of the question.

Yes, you are right. We found no measurable performance difference
between using and not using the accessed bit on non-leaf PUD entries.

For the PMD case, the difference is tiny but still measurable on small
systems, e.g., laptops with 4GB memory. It's clear (a few percent in
kswapd) on servers with tens of GBs of 4KB pages.
Yu Zhao March 15, 2021, 3:16 a.m. UTC | #7
On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote:
> On 3/12/21 11:57 PM, Yu Zhao wrote:
> > Some architectures support the accessed bit on non-leaf PMD entries
> > (parents) in addition to leaf PTE entries (children) where pages are
> > mapped, e.g., x86_64 sets the accessed bit on a parent when using it
> > as part of linear-address translation [1]. Page table walkers who are
> > interested in the accessed bit on children can take advantage of this:
> > they do not need to search the children when the accessed bit is not
> > set on a parent, given that they have previously cleared the accessed
> > bit on this parent in addition to its children.
> 
> I'd like to hear a *LOT* more about how this is going to be used.
> 
> The one part of this which is entirely missing is the interaction with
> the TLB and mid-level paging structure caches.  The CPU is pretty
> aggressive about setting no-leaf accessed bits when TLB entries are
> created.  This *looks* to be depending on that behavior, but it would be
> nice to spell it out explicitly.

Good point. Let me start with a couple of observations we've made:
  1) some applications create very sparse address spaces, for various
  reasons. A notable example is those using Scudo memory allocations:
  they usually have double-digit numbers of PTE entries for each PMD
  entry (and thousands of VMAs for just a few hundred MBs of memory
  usage, sigh...).
  2) scans of an address space (from the reclaim path) are much less
  frequent than context switches of it. Under our heaviest memory
  pressure (30%+ overcommitted; guess how much we've profited from
  it :) ), their magnitudes are still on different orders.
  Specifically, on our smallest system (2GB, with PCID), we observed
  no difference between flushing and not flushing TLB in terms of page
  selections. We actually observed more TLB misses under heavier
  memory pressure, and our theory is that this is due to increased
  memory footprint that causes the pressure.

There are two use cases for the accessed bit on non-leaf PMD entries:
the hot tracking and the cold tracking. I'll focus on the cold
tracking, which is what this series about.

Since non-leaf entries are more likely to be cached, in theory, the
false negative rate is higher compared with leaf entries as the CPU
won't set the accessed bit again until the next TLB miss. (Here a
false negative means the accessed bit isn't set on an entry has been
used, after we cleared the accessed bit. And IIRC, there are also
false positives, i.e., the accessed bit is set on entries used by
speculative execution only.) But this is not a problem because of the
second observation aforementioned.

Now let's consider the worst case scenario: what happens when we hit
a false negative on a non-leaf PMD entry? We think the pages mapped
by the PTE entries of this PMD entry are inactive and try to reclaim
them, until we see the accessed bit set on one of the PTE entries.
This will cost us one futile attempt for all the 512 PTE entries. A
glance at lru_gen_scan_around() in the 11th patch would explain
exactly why. If you are guessing that function embodies the same idea
of "fault around", you are right.

And there are two places that could benefit from this patch (and the
next) immediately, independent to this series. One is
clear_refs_test_walk() in fs/proc/task_mmu.c. The other is
madvise_pageout_page_range() and madvise_cold_page_range() in
mm/madvise.c. Both are page table walkers that clear the accessed bit.

I think I've covered a lot of ground but I'm sure there is a lot more.
So please feel free to add and I'll include everything we discuss here
in the next version.
diff mbox series

Patch

diff --git a/arch/Kconfig b/arch/Kconfig
index 2bb30673d8e6..137446d17732 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -783,6 +783,14 @@  config HAVE_ARCH_TRANSPARENT_HUGEPAGE
 config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
 	bool
 
+config HAVE_ARCH_PARENT_PMD_YOUNG
+	bool
+	help
+	  Architectures that select this are able to set the accessed bit on
+	  non-leaf PMD entries in addition to leaf PTE entries where pages are
+	  mapped. For them, page table walkers that clear the accessed bit may
+	  stop at non-leaf PMD entries when they do not see the accessed bit.
+
 config HAVE_ARCH_HUGE_VMAP
 	bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2792879d398e..b5972eb82337 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -163,6 +163,7 @@  config X86
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64
+	select HAVE_ARCH_PARENT_PMD_YOUNG	if X86_64
 	select HAVE_ARCH_USERFAULTFD_WP         if X86_64 && USERFAULTFD
 	select HAVE_ARCH_VMAP_STACK		if X86_64
 	select HAVE_ARCH_WITHIN_STACK_FRAMES
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a02c67291cfc..a6b5cfe1fc5a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,7 @@  static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 
 static inline int pmd_bad(pmd_t pmd)
 {
-	return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	return ((pmd_flags(pmd) | _PAGE_ACCESSED) & ~_PAGE_USER) != _KERNPG_TABLE;
 }
 
 static inline unsigned long pages_to_mb(unsigned long npg)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index f6a9e2e36642..1c27e6f43f80 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -550,7 +550,7 @@  int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	return ret;
 }
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pmd_t *pmdp)
 {
@@ -562,6 +562,9 @@  int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 
 	return ret;
 }
+#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 int pudp_test_and_clear_young(struct vm_area_struct *vma,
 			      unsigned long addr, pud_t *pudp)
 {
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5e772392a379..08dd9b8c055a 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -193,7 +193,7 @@  static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG)
 static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 					    unsigned long address,
 					    pmd_t *pmdp)
@@ -214,7 +214,7 @@  static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma,
 	BUILD_BUG();
 	return 0;
 }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG */
 #endif
 
 #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH