Message ID | 20210313075747.3781593-7-yuzhao@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Multigenerational LRU | expand |
On 13 Mar 2021, at 2:57, Yu Zhao wrote: > Some architectures support the accessed bit on non-leaf PMD entries > (parents) in addition to leaf PTE entries (children) where pages are > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > as part of linear-address translation [1]. Page table walkers who are > interested in the accessed bit on children can take advantage of this: > they do not need to search the children when the accessed bit is not > set on a parent, given that they have previously cleared the accessed > bit on this parent in addition to its children. > > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual > Volume 3 (October 2019), section 4.8 Just curious. Does this also apply to non-leaf PUD entries? Do you mind sharing which sentence from the manual gives the information? Thanks. — Best Regards, Yan Zi
On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote: > On 13 Mar 2021, at 2:57, Yu Zhao wrote: > > > Some architectures support the accessed bit on non-leaf PMD entries > > (parents) in addition to leaf PTE entries (children) where pages are > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > as part of linear-address translation [1]. Page table walkers who are > > interested in the accessed bit on children can take advantage of this: > > they do not need to search the children when the accessed bit is not > > set on a parent, given that they have previously cleared the accessed > > bit on this parent in addition to its children. > > > > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual > > Volume 3 (October 2019), section 4.8 > > Just curious. Does this also apply to non-leaf PUD entries? Do you > mind sharing which sentence from the manual gives the information? The first few sentences from 4.8: : For any paging-structure entry that is used during linear-address : translation, bit 5 is the accessed flag. For paging-structure : entries that map a page (as opposed to referencing another paging : structure), bit 6 is the dirty flag. These flags are provided for : use by memory-management software to manage the transfer of pages and : paging structures into and out of physical memory. : Whenever the processor uses a paging-structure entry as part of : linear-address translation, it sets the accessed flag in that entry : (if it is not already set). The way they differentiate between the A and D bits makes it clear to me that the A bit is set at each level of the tree, but the D bit is only set on leaf entries.
On 3/12/21 11:57 PM, Yu Zhao wrote: > Some architectures support the accessed bit on non-leaf PMD entries > (parents) in addition to leaf PTE entries (children) where pages are > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > as part of linear-address translation [1]. Page table walkers who are > interested in the accessed bit on children can take advantage of this: > they do not need to search the children when the accessed bit is not > set on a parent, given that they have previously cleared the accessed > bit on this parent in addition to its children. I'd like to hear a *LOT* more about how this is going to be used. The one part of this which is entirely missing is the interaction with the TLB and mid-level paging structure caches. The CPU is pretty aggressive about setting no-leaf accessed bits when TLB entries are created. This *looks* to be depending on that behavior, but it would be nice to spell it out explicitly.
On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote: > On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote: > > On 13 Mar 2021, at 2:57, Yu Zhao wrote: > > > > > Some architectures support the accessed bit on non-leaf PMD entries > > > (parents) in addition to leaf PTE entries (children) where pages are > > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > > as part of linear-address translation [1]. Page table walkers who are > > > interested in the accessed bit on children can take advantage of this: > > > they do not need to search the children when the accessed bit is not > > > set on a parent, given that they have previously cleared the accessed > > > bit on this parent in addition to its children. > > > > > > [1]: Intel 64 and IA-32 Architectures Software Developer's Manual > > > Volume 3 (October 2019), section 4.8 > > > > Just curious. Does this also apply to non-leaf PUD entries? Do you > > mind sharing which sentence from the manual gives the information? > > The first few sentences from 4.8: > > : For any paging-structure entry that is used during linear-address > : translation, bit 5 is the accessed flag. For paging-structure > : entries that map a page (as opposed to referencing another paging > : structure), bit 6 is the dirty flag. These flags are provided for > : use by memory-management software to manage the transfer of pages and > : paging structures into and out of physical memory. > > : Whenever the processor uses a paging-structure entry as part of > : linear-address translation, it sets the accessed flag in that entry > : (if it is not already set). As far as I know x86 is the one that supports this. > The way they differentiate between the A and D bits makes it clear to > me that the A bit is set at each level of the tree, but the D bit is > only set on leaf entries. And the difference makes perfect sense (to me). Kudos to Intel.
On 14 Mar 2021, at 20:03, Yu Zhao wrote: > On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote: >> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote: >>> On 13 Mar 2021, at 2:57, Yu Zhao wrote: >>> >>>> Some architectures support the accessed bit on non-leaf PMD entries >>>> (parents) in addition to leaf PTE entries (children) where pages are >>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it >>>> as part of linear-address translation [1]. Page table walkers who are >>>> interested in the accessed bit on children can take advantage of this: >>>> they do not need to search the children when the accessed bit is not >>>> set on a parent, given that they have previously cleared the accessed >>>> bit on this parent in addition to its children. >>>> >>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual >>>> Volume 3 (October 2019), section 4.8 >>> >>> Just curious. Does this also apply to non-leaf PUD entries? Do you >>> mind sharing which sentence from the manual gives the information? >> >> The first few sentences from 4.8: >> >> : For any paging-structure entry that is used during linear-address >> : translation, bit 5 is the accessed flag. For paging-structure >> : entries that map a page (as opposed to referencing another paging >> : structure), bit 6 is the dirty flag. These flags are provided for >> : use by memory-management software to manage the transfer of pages and >> : paging structures into and out of physical memory. >> >> : Whenever the processor uses a paging-structure entry as part of >> : linear-address translation, it sets the accessed flag in that entry >> : (if it is not already set). Matthew, thanks for the pointer. > > As far as I know x86 is the one that supports this. > >> The way they differentiate between the A and D bits makes it clear to >> me that the A bit is set at each level of the tree, but the D bit is >> only set on leaf entries. > > And the difference makes perfect sense (to me). Kudos to Intel. Hi Yu, You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG. Is it PUD granularity too large to be useful for multigenerational LRU algorithm? Thanks. — Best Regards, Yan Zi
On Sun, Mar 14, 2021 at 08:27:29PM -0400, Zi Yan wrote: > On 14 Mar 2021, at 20:03, Yu Zhao wrote: > > > On Sun, Mar 14, 2021 at 10:51:03PM +0000, Matthew Wilcox wrote: > >> On Sun, Mar 14, 2021 at 06:12:42PM -0400, Zi Yan wrote: > >>> On 13 Mar 2021, at 2:57, Yu Zhao wrote: > >>> > >>>> Some architectures support the accessed bit on non-leaf PMD entries > >>>> (parents) in addition to leaf PTE entries (children) where pages are > >>>> mapped, e.g., x86_64 sets the accessed bit on a parent when using it > >>>> as part of linear-address translation [1]. Page table walkers who are > >>>> interested in the accessed bit on children can take advantage of this: > >>>> they do not need to search the children when the accessed bit is not > >>>> set on a parent, given that they have previously cleared the accessed > >>>> bit on this parent in addition to its children. > >>>> > >>>> [1]: Intel 64 and IA-32 Architectures Software Developer's Manual > >>>> Volume 3 (October 2019), section 4.8 > >>> > >>> Just curious. Does this also apply to non-leaf PUD entries? Do you > >>> mind sharing which sentence from the manual gives the information? > >> > >> The first few sentences from 4.8: > >> > >> : For any paging-structure entry that is used during linear-address > >> : translation, bit 5 is the accessed flag. For paging-structure > >> : entries that map a page (as opposed to referencing another paging > >> : structure), bit 6 is the dirty flag. These flags are provided for > >> : use by memory-management software to manage the transfer of pages and > >> : paging structures into and out of physical memory. > >> > >> : Whenever the processor uses a paging-structure entry as part of > >> : linear-address translation, it sets the accessed flag in that entry > >> : (if it is not already set). > > Matthew, thanks for the pointer. > > > > > As far as I know x86 is the one that supports this. > > > >> The way they differentiate between the A and D bits makes it clear to > >> me that the A bit is set at each level of the tree, but the D bit is > >> only set on leaf entries. > > > > And the difference makes perfect sense (to me). Kudos to Intel. > > Hi Yu, > > You only introduced HAVE_ARCH_PARENT_PMD_YOUNG but no HAVE_ARCH_PARENT_PUD_YOUNG. > Is it PUD granularity too large to be useful for multigenerational LRU algorithm? Oh, sorry. I overlooked this part of the question. Yes, you are right. We found no measurable performance difference between using and not using the accessed bit on non-leaf PUD entries. For the PMD case, the difference is tiny but still measurable on small systems, e.g., laptops with 4GB memory. It's clear (a few percent in kswapd) on servers with tens of GBs of 4KB pages.
On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote: > On 3/12/21 11:57 PM, Yu Zhao wrote: > > Some architectures support the accessed bit on non-leaf PMD entries > > (parents) in addition to leaf PTE entries (children) where pages are > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > as part of linear-address translation [1]. Page table walkers who are > > interested in the accessed bit on children can take advantage of this: > > they do not need to search the children when the accessed bit is not > > set on a parent, given that they have previously cleared the accessed > > bit on this parent in addition to its children. > > I'd like to hear a *LOT* more about how this is going to be used. > > The one part of this which is entirely missing is the interaction with > the TLB and mid-level paging structure caches. The CPU is pretty > aggressive about setting no-leaf accessed bits when TLB entries are > created. This *looks* to be depending on that behavior, but it would be > nice to spell it out explicitly. Good point. Let me start with a couple of observations we've made: 1) some applications create very sparse address spaces, for various reasons. A notable example is those using Scudo memory allocations: they usually have double-digit numbers of PTE entries for each PMD entry (and thousands of VMAs for just a few hundred MBs of memory usage, sigh...). 2) scans of an address space (from the reclaim path) are much less frequent than context switches of it. Under our heaviest memory pressure (30%+ overcommitted; guess how much we've profited from it :) ), their magnitudes are still on different orders. Specifically, on our smallest system (2GB, with PCID), we observed no difference between flushing and not flushing TLB in terms of page selections. We actually observed more TLB misses under heavier memory pressure, and our theory is that this is due to increased memory footprint that causes the pressure. There are two use cases for the accessed bit on non-leaf PMD entries: the hot tracking and the cold tracking. I'll focus on the cold tracking, which is what this series about. Since non-leaf entries are more likely to be cached, in theory, the false negative rate is higher compared with leaf entries as the CPU won't set the accessed bit again until the next TLB miss. (Here a false negative means the accessed bit isn't set on an entry has been used, after we cleared the accessed bit. And IIRC, there are also false positives, i.e., the accessed bit is set on entries used by speculative execution only.) But this is not a problem because of the second observation aforementioned. Now let's consider the worst case scenario: what happens when we hit a false negative on a non-leaf PMD entry? We think the pages mapped by the PTE entries of this PMD entry are inactive and try to reclaim them, until we see the accessed bit set on one of the PTE entries. This will cost us one futile attempt for all the 512 PTE entries. A glance at lru_gen_scan_around() in the 11th patch would explain exactly why. If you are guessing that function embodies the same idea of "fault around", you are right. And there are two places that could benefit from this patch (and the next) immediately, independent to this series. One is clear_refs_test_walk() in fs/proc/task_mmu.c. The other is madvise_pageout_page_range() and madvise_cold_page_range() in mm/madvise.c. Both are page table walkers that clear the accessed bit. I think I've covered a lot of ground but I'm sure there is a lot more. So please feel free to add and I'll include everything we discuss here in the next version.
diff --git a/arch/Kconfig b/arch/Kconfig index 2bb30673d8e6..137446d17732 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -783,6 +783,14 @@ config HAVE_ARCH_TRANSPARENT_HUGEPAGE config HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD bool +config HAVE_ARCH_PARENT_PMD_YOUNG + bool + help + Architectures that select this are able to set the accessed bit on + non-leaf PMD entries in addition to leaf PTE entries where pages are + mapped. For them, page table walkers that clear the accessed bit may + stop at non-leaf PMD entries when they do not see the accessed bit. + config HAVE_ARCH_HUGE_VMAP bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 2792879d398e..b5972eb82337 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -163,6 +163,7 @@ config X86 select HAVE_ARCH_TRACEHOOK select HAVE_ARCH_TRANSPARENT_HUGEPAGE select HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD if X86_64 + select HAVE_ARCH_PARENT_PMD_YOUNG if X86_64 select HAVE_ARCH_USERFAULTFD_WP if X86_64 && USERFAULTFD select HAVE_ARCH_VMAP_STACK if X86_64 select HAVE_ARCH_WITHIN_STACK_FRAMES diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a02c67291cfc..a6b5cfe1fc5a 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -846,7 +846,7 @@ static inline unsigned long pmd_page_vaddr(pmd_t pmd) static inline int pmd_bad(pmd_t pmd) { - return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; + return ((pmd_flags(pmd) | _PAGE_ACCESSED) & ~_PAGE_USER) != _KERNPG_TABLE; } static inline unsigned long pages_to_mb(unsigned long npg) diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c index f6a9e2e36642..1c27e6f43f80 100644 --- a/arch/x86/mm/pgtable.c +++ b/arch/x86/mm/pgtable.c @@ -550,7 +550,7 @@ int ptep_test_and_clear_young(struct vm_area_struct *vma, return ret; } -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmdp) { @@ -562,6 +562,9 @@ int pmdp_test_and_clear_young(struct vm_area_struct *vma, return ret; } +#endif + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE int pudp_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pud_t *pudp) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 5e772392a379..08dd9b8c055a 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -193,7 +193,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, #endif #ifndef __HAVE_ARCH_PMDP_TEST_AND_CLEAR_YOUNG -#ifdef CONFIG_TRANSPARENT_HUGEPAGE +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG) static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, unsigned long address, pmd_t *pmdp) @@ -214,7 +214,7 @@ static inline int pmdp_test_and_clear_young(struct vm_area_struct *vma, BUILD_BUG(); return 0; } -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +#endif /* CONFIG_TRANSPARENT_HUGEPAGE || CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG */ #endif #ifndef __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
Some architectures support the accessed bit on non-leaf PMD entries (parents) in addition to leaf PTE entries (children) where pages are mapped, e.g., x86_64 sets the accessed bit on a parent when using it as part of linear-address translation [1]. Page table walkers who are interested in the accessed bit on children can take advantage of this: they do not need to search the children when the accessed bit is not set on a parent, given that they have previously cleared the accessed bit on this parent in addition to its children. [1]: Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3 (October 2019), section 4.8 Signed-off-by: Yu Zhao <yuzhao@google.com> --- arch/Kconfig | 8 ++++++++ arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 2 +- arch/x86/mm/pgtable.c | 5 ++++- include/linux/pgtable.h | 4 ++-- 5 files changed, 16 insertions(+), 4 deletions(-)