Message ID | alpine.DEB.2.21.1807091749150.114630@chino.kir.corp.google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Mon, 9 Jul 2018 17:50:03 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > When perf profiling a wide variety of different workloads, it was found > that vmacache_find() had higher than expected cost: up to 0.08% of cpu > utilization in some cases. This was found to rival other core VM > functions such as alloc_pages_vma() with thp enabled and default > mempolicy, and the conditionals in __get_vma_policy(). > > VMACACHE_HASH() determines which of the four per-task_struct slots a vma > is cached for a particular address. This currently depends on the pfn, > so pfn 5212 occupies a different vmacache slot than its neighboring > pfn 5213. > > vmacache_find() iterates through all four of current's vmacache slots > when looking up an address. Hashing based on pfn, an address has > ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or > about 25%, *if* the vma is cached. > > This patch hashes an address by its pmd instead of pte to optimize for > workloads with good spatial locality. This results in a higher > probability of vmas being cached in the first slot that is checked: > normally ~70% on the same workloads instead of 25%. Was the improvement quantifiable? Surprised. That little array will all be in CPU cache and that loop should execute pretty quickly? If it's *that* sensitive then let's zap the no-longer-needed WARN_ON. And we could hide all the event counting behind some developer-only ifdef. Did you consider LRU-sorting the array instead?
On Mon, 9 Jul 2018, Andrew Morton wrote: > > When perf profiling a wide variety of different workloads, it was found > > that vmacache_find() had higher than expected cost: up to 0.08% of cpu > > utilization in some cases. This was found to rival other core VM > > functions such as alloc_pages_vma() with thp enabled and default > > mempolicy, and the conditionals in __get_vma_policy(). > > > > VMACACHE_HASH() determines which of the four per-task_struct slots a vma > > is cached for a particular address. This currently depends on the pfn, > > so pfn 5212 occupies a different vmacache slot than its neighboring > > pfn 5213. > > > > vmacache_find() iterates through all four of current's vmacache slots > > when looking up an address. Hashing based on pfn, an address has > > ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or > > about 25%, *if* the vma is cached. > > > > This patch hashes an address by its pmd instead of pte to optimize for > > workloads with good spatial locality. This results in a higher > > probability of vmas being cached in the first slot that is checked: > > normally ~70% on the same workloads instead of 25%. > > Was the improvement quantifiable? > I've run page fault testing to answer this question on Haswell since the initial profiling was done over a wide variety of user-controlled workloads and there's no guarantee that such profiling would be a fair comparison either way. For page faulting it's either falling below our testing levels of 0.02%, or is right at 0.02%. Running without the patch it's 0.05-0.06% overhead. > Surprised. That little array will all be in CPU cache and that loop > should execute pretty quickly? If it's *that* sensitive then let's zap > the no-longer-needed WARN_ON. And we could hide all the event counting > behind some developer-only ifdef. > Those vmevents are only defined for CONFIG_DEBUG_VM_VMACACHE, so no change needed there. The WARN_ON() could be moved under the same config option. I assume that if such a config option exists that at least somebody is interested in debugging mm/vmacache.c once in a while. > Did you consider LRU-sorting the array instead? > It adds 40 bytes to struct task_struct, but I'm not sure the least recently used is the first preferred check. If I do madvise(MADV_DONTNEED) from a malloc implementation where I don't control what is free()'d and I'm constantly freeing back to the same hugepages, for example, I may always get first slot cache hits with this patch as opposed to the 25% chance that the current implementation has (and perhaps an lru would as well). I'm sure that I could construct a workload where LRU would be better and could show that the added footprint were worthwhile, but I could also construct a workload where the current implementation based on pfn would outperform all of these. It simply turns out that on the user-controlled workloads that I was profiling that hashing based on pmd was the win.
On Mon, 9 Jul 2018 18:37:37 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > > Did you consider LRU-sorting the array instead? > > > > It adds 40 bytes to struct task_struct, What does? LRU sort? It's a 4-entry array, just do it in place, like bh_lru_install(). Confused. > but I'm not sure the least > recently used is the first preferred check. If I do > madvise(MADV_DONTNEED) from a malloc implementation where I don't control > what is free()'d and I'm constantly freeing back to the same hugepages, > for example, I may always get first slot cache hits with this patch as > opposed to the 25% chance that the current implementation has (and perhaps > an lru would as well). > > I'm sure that I could construct a workload where LRU would be better and > could show that the added footprint were worthwhile, but I could also > construct a workload where the current implementation based on pfn would > outperform all of these. It simply turns out that on the user-controlled > workloads that I was profiling that hashing based on pmd was the win. That leaves us nowhere to go. Zapping the WARN_ON seems a no-brainer though?
On Wed, 11 Jul 2018, Andrew Morton wrote: > > > Did you consider LRU-sorting the array instead? > > > > > > > It adds 40 bytes to struct task_struct, > > What does? LRU sort? It's a 4-entry array, just do it in place, like > bh_lru_install(). Confused. > I was imagining an optimized sort rather than adding an iteration to vmacache_update() of the same form that causes vmacache_find() to show up on my perf reports in the first place. > > but I'm not sure the least > > recently used is the first preferred check. If I do > > madvise(MADV_DONTNEED) from a malloc implementation where I don't control > > what is free()'d and I'm constantly freeing back to the same hugepages, > > for example, I may always get first slot cache hits with this patch as > > opposed to the 25% chance that the current implementation has (and perhaps > > an lru would as well). > > > > I'm sure that I could construct a workload where LRU would be better and > > could show that the added footprint were worthwhile, but I could also > > construct a workload where the current implementation based on pfn would > > outperform all of these. It simply turns out that on the user-controlled > > workloads that I was profiling that hashing based on pmd was the win. > > That leaves us nowhere to go. Zapping the WARN_ON seems a no-brainer > though? > I would suggest it goes under CONFIG_DEBUG_VM_VMACACHE. My implementation for the optimized vmacache_find() is based on the premise that spatial locality matters, and in practice on random user-controlled workloads this yields a faster lookup than the current implementation. Of course, any caching technique can be defeated by workloads, artifical or otherwise, but I suggest that as a general principle caching based on PMD_SHIFT rather than pfn has a greater likelihood of avoiding the iteration in vmacache_find() because of spatial locality for anything that iterates over a range of memory.
diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h --- a/include/linux/vmacache.h +++ b/include/linux/vmacache.h @@ -5,12 +5,6 @@ #include <linux/sched.h> #include <linux/mm.h> -/* - * Hash based on the page number. Provides a good hit rate for - * workloads with good locality and those with random accesses as well. - */ -#define VMACACHE_HASH(addr) ((addr >> PAGE_SHIFT) & VMACACHE_MASK) - static inline void vmacache_flush(struct task_struct *tsk) { memset(tsk->vmacache.vmas, 0, sizeof(tsk->vmacache.vmas)); diff --git a/mm/vmacache.c b/mm/vmacache.c --- a/mm/vmacache.c +++ b/mm/vmacache.c @@ -7,6 +7,12 @@ #include <linux/mm.h> #include <linux/vmacache.h> +/* + * Hash based on the pmd of addr. Provides a good hit rate for workloads with + * spatial locality. + */ +#define VMACACHE_HASH(addr) ((addr >> PMD_SHIFT) & VMACACHE_MASK) + /* * Flush vma caches for threads that share a given mm. * @@ -87,6 +93,7 @@ static bool vmacache_valid(struct mm_struct *mm) struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr) { + int idx = VMACACHE_HASH(addr); int i; count_vm_vmacache_event(VMACACHE_FIND_CALLS); @@ -95,16 +102,18 @@ struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr) return NULL; for (i = 0; i < VMACACHE_SIZE; i++) { - struct vm_area_struct *vma = current->vmacache.vmas[i]; - - if (!vma) - continue; - if (WARN_ON_ONCE(vma->vm_mm != mm)) - break; - if (vma->vm_start <= addr && vma->vm_end > addr) { - count_vm_vmacache_event(VMACACHE_FIND_HITS); - return vma; + struct vm_area_struct *vma = current->vmacache.vmas[idx]; + + if (vma) { + if (WARN_ON_ONCE(vma->vm_mm != mm)) + break; + if (vma->vm_start <= addr && vma->vm_end > addr) { + count_vm_vmacache_event(VMACACHE_FIND_HITS); + return vma; + } } + if (++idx == VMACACHE_SIZE) + idx = 0; } return NULL; @@ -115,6 +124,7 @@ struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm, unsigned long start, unsigned long end) { + int idx = VMACACHE_HASH(addr); int i; count_vm_vmacache_event(VMACACHE_FIND_CALLS); @@ -123,12 +133,14 @@ struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm, return NULL; for (i = 0; i < VMACACHE_SIZE; i++) { - struct vm_area_struct *vma = current->vmacache.vmas[i]; + struct vm_area_struct *vma = current->vmacache.vmas[idx]; if (vma && vma->vm_start == start && vma->vm_end == end) { count_vm_vmacache_event(VMACACHE_FIND_HITS); return vma; } + if (++idx == VMACACHE_SIZE) + idx = 0; } return NULL;
When perf profiling a wide variety of different workloads, it was found that vmacache_find() had higher than expected cost: up to 0.08% of cpu utilization in some cases. This was found to rival other core VM functions such as alloc_pages_vma() with thp enabled and default mempolicy, and the conditionals in __get_vma_policy(). VMACACHE_HASH() determines which of the four per-task_struct slots a vma is cached for a particular address. This currently depends on the pfn, so pfn 5212 occupies a different vmacache slot than its neighboring pfn 5213. vmacache_find() iterates through all four of current's vmacache slots when looking up an address. Hashing based on pfn, an address has ~1/VMACACHE_SIZE chance of being cached in the first vmacache slot, or about 25%, *if* the vma is cached. This patch hashes an address by its pmd instead of pte to optimize for workloads with good spatial locality. This results in a higher probability of vmas being cached in the first slot that is checked: normally ~70% on the same workloads instead of 25%. Signed-off-by: David Rientjes <rientjes@google.com> --- include/linux/vmacache.h | 6 ------ mm/vmacache.c | 32 ++++++++++++++++++++++---------- 2 files changed, 22 insertions(+), 16 deletions(-)