Message ID | 20210718043034.76431-1-zhengqi.arch@bytedance.com (mailing list archive) |
---|---|
Headers | show |
Series | Free user PTE page table pages | expand |
On 18.07.21 06:30, Qi Zheng wrote: > Hi, > > This patch series aims to free user PTE page table pages when all PTE entries > are empty. > > The beginning of this story is that some malloc libraries(e.g. jemalloc or > tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs. > They will use madvise(MADV_DONTNEED) to free physical memory if they want. > But the page tables do not be freed by madvise(), so it can produce many > page tables when the process touches an enormous virtual address space. ... did you see that I am actually looking into this? https://lkml.kernel.org/r/bae8b967-c206-819d-774c-f57b94c4b362@redhat.com and have already spent a significant time on it as part of my research, which is *really* unfortunate and makes me quite frustrated at the beginning of the week alreadty ... Ripping out page tables is quite difficult, as we have to stop all page table walkers from touching it, including the fast_gup, rmap and page faults. This usually involves taking the mmap lock in write. My approach does page table reclaim asynchronously from another thread and do not rely on reference counts.
On 19.07.21 09:34, David Hildenbrand wrote: > On 18.07.21 06:30, Qi Zheng wrote: >> Hi, >> >> This patch series aims to free user PTE page table pages when all PTE entries >> are empty. >> >> The beginning of this story is that some malloc libraries(e.g. jemalloc or >> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs. >> They will use madvise(MADV_DONTNEED) to free physical memory if they want. >> But the page tables do not be freed by madvise(), so it can produce many >> page tables when the process touches an enormous virtual address space. > > ... did you see that I am actually looking into this? > > https://lkml.kernel.org/r/bae8b967-c206-819d-774c-f57b94c4b362@redhat.com > > and have already spent a significant time on it as part of my research, > which is *really* unfortunate and makes me quite frustrated at the > beginning of the week alreadty ... > > Ripping out page tables is quite difficult, as we have to stop all page > table walkers from touching it, including the fast_gup, rmap and page > faults. This usually involves taking the mmap lock in write. My approach > does page table reclaim asynchronously from another thread and do not > rely on reference counts. FWIW, I had a quick peek and I like the simplistic approach using reference counting, although it seems to come with a price. By hooking using pte_alloc_get_map_lock() instead of pte_alloc_map_lock, we can handle quite some cases easily. There are cases where we might immediately see a reuse after discarding memory (especially, with virtio-balloon free page reporting), in which case it's suboptimal to immediately discard instead of waiting a bit if there is a reuse. However, the performance impact seems to be comparatively small. I do wonder if the 1% overhead you're seeing is actually because of allcoating/freeing or because of the reference count handling on some hot paths. I'm primarily looking into asynchronous reclaim, because it somewhat makes sense to only reclaim (+ pay a cost) when there is really need to reclaim memory -- similar to our shrinker infrastructure.
On Mon, Jul 19, 2021 at 7:28 PM David Hildenbrand <david@redhat.com> wrote: > > On 19.07.21 09:34, David Hildenbrand wrote: > > On 18.07.21 06:30, Qi Zheng wrote: > >> Hi, > >> > >> This patch series aims to free user PTE page table pages when all PTE entries > >> are empty. > >> > >> The beginning of this story is that some malloc libraries(e.g. jemalloc or > >> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs. > >> They will use madvise(MADV_DONTNEED) to free physical memory if they want. > >> But the page tables do not be freed by madvise(), so it can produce many > >> page tables when the process touches an enormous virtual address space. > > > > ... did you see that I am actually looking into this? > > > > https://lkml.kernel.org/r/bae8b967-c206-819d-774c-f57b94c4b362@redhat.com > > > > and have already spent a significant time on it as part of my research, > > which is *really* unfortunate and makes me quite frustrated at the > > beginning of the week alreadty ... > > > > Ripping out page tables is quite difficult, as we have to stop all page > > table walkers from touching it, including the fast_gup, rmap and page > > faults. This usually involves taking the mmap lock in write. My approach > > does page table reclaim asynchronously from another thread and do not > > rely on reference counts. > Hi David, > FWIW, I had a quick peek and I like the simplistic approach using > reference counting, although it seems to come with a price. By hooking > using pte_alloc_get_map_lock() instead of pte_alloc_map_lock, we can > handle quite some cases easily. Totally agree. > > There are cases where we might immediately see a reuse after discarding > memory (especially, with virtio-balloon free page reporting), in which > case it's suboptimal to immediately discard instead of waiting a bit if > there is a reuse. However, the performance impact seems to be > comparatively small. > > I do wonder if the 1% overhead you're seeing is actually because of > allcoating/freeing or because of the reference count handling on some > hot paths. Qi Zheng has compared the results collected by using the "perf top" command. The LRU lock is more contended with this patchset applied. I think the reason is that this patchset will free more pages (including PTE page table pages). We don't see the overhead caused by reference count handling. Thanks, Muchun > > I'm primarily looking into asynchronous reclaim, because it somewhat > makes sense to only reclaim (+ pay a cost) when there is really need to > reclaim memory -- similar to our shrinker infrastructure. > > -- > Thanks, > > David / dhildenb >
On Mon, Jul 19, 2021 at 8:42 PM Muchun Song <songmuchun@bytedance.com> wrote: > > On Mon, Jul 19, 2021 at 7:28 PM David Hildenbrand <david@redhat.com> wrote: > > > > On 19.07.21 09:34, David Hildenbrand wrote: > > > On 18.07.21 06:30, Qi Zheng wrote: > > >> Hi, > > >> > > >> This patch series aims to free user PTE page table pages when all PTE entries > > >> are empty. > > >> > > >> The beginning of this story is that some malloc libraries(e.g. jemalloc or > > >> tcmalloc) usually allocate the amount of VAs by mmap() and do not unmap those VAs. > > >> They will use madvise(MADV_DONTNEED) to free physical memory if they want. > > >> But the page tables do not be freed by madvise(), so it can produce many > > >> page tables when the process touches an enormous virtual address space. > > > > > > ... did you see that I am actually looking into this? > > > > > > https://lkml.kernel.org/r/bae8b967-c206-819d-774c-f57b94c4b362@redhat.com > > > > > > and have already spent a significant time on it as part of my research, > > > which is *really* unfortunate and makes me quite frustrated at the > > > beginning of the week alreadty ... > > > > > > Ripping out page tables is quite difficult, as we have to stop all page > > > table walkers from touching it, including the fast_gup, rmap and page > > > faults. This usually involves taking the mmap lock in write. My approach > > > does page table reclaim asynchronously from another thread and do not > > > rely on reference counts. > > > > Hi David, > > > FWIW, I had a quick peek and I like the simplistic approach using > > reference counting, although it seems to come with a price. By hooking > > using pte_alloc_get_map_lock() instead of pte_alloc_map_lock, we can > > handle quite some cases easily. > > Totally agree. > > > > > There are cases where we might immediately see a reuse after discarding > > memory (especially, with virtio-balloon free page reporting), in which > > case it's suboptimal to immediately discard instead of waiting a bit if > > there is a reuse. However, the performance impact seems to be > > comparatively small. > > > > I do wonder if the 1% overhead you're seeing is actually because of > > allcoating/freeing or because of the reference count handling on some > > hot paths. > > Qi Zheng has compared the results collected by using the "perf top" > command. The LRU lock is more contended with this patchset applied. > I think the reason is that this patchset will free more pages (including > PTE page table pages). We don't see the overhead caused by reference > count handling. Sorry for the confusion. I am wrong. The PTE page table page does not add to LRU list, so it should not be the LRU lock. We actually see that _raw_spin_unlock_irqrestore is hotter than before. I guess it is zone lock. > > Thanks, > > Muchun > > > > > I'm primarily looking into asynchronous reclaim, because it somewhat > > makes sense to only reclaim (+ pay a cost) when there is really need to > > reclaim memory -- similar to our shrinker infrastructure. > > > > -- > > Thanks, > > > > David / dhildenb > >
On 7/19/21 7:28 PM, David Hildenbrand wrote: > On 19.07.21 09:34, David Hildenbrand wrote: >> On 18.07.21 06:30, Qi Zheng wrote: >>> Hi, >>> >>> This patch series aims to free user PTE page table pages when all PTE >>> entries >>> are empty. >>> >>> The beginning of this story is that some malloc libraries(e.g. >>> jemalloc or >>> tcmalloc) usually allocate the amount of VAs by mmap() and do not >>> unmap those VAs. >>> They will use madvise(MADV_DONTNEED) to free physical memory if they >>> want. >>> But the page tables do not be freed by madvise(), so it can produce many >>> page tables when the process touches an enormous virtual address space. >> >> ... did you see that I am actually looking into this? >> >> https://lkml.kernel.org/r/bae8b967-c206-819d-774c-f57b94c4b362@redhat.com >> >> and have already spent a significant time on it as part of my research, >> which is *really* unfortunate and makes me quite frustrated at the >> beginning of the week alreadty ... >> >> Ripping out page tables is quite difficult, as we have to stop all page >> table walkers from touching it, including the fast_gup, rmap and page >> faults. This usually involves taking the mmap lock in write. My approach >> does page table reclaim asynchronously from another thread and do not >> rely on reference counts. > > FWIW, I had a quick peek and I like the simplistic approach using > reference counting, although it seems to come with a price. By hooking > using pte_alloc_get_map_lock() instead of pte_alloc_map_lock, we can > handle quite some cases easily. > > There are cases where we might immediately see a reuse after discarding > memory (especially, with virtio-balloon free page reporting), in which > case it's suboptimal to immediately discard instead of waiting a bit if > there is a reuse. However, the performance impact seems to be > comparatively small. Good point, maybe we can wait a bit in the free_pte_table() in the added optimiztion patch if the frequency of immediate reuse is high. > > I do wonder if the 1% overhead you're seeing is actually because of > allcoating/freeing or because of the reference count handling on some > hot paths. > > I'm primarily looking into asynchronous reclaim, because it somewhat > makes sense to only reclaim (+ pay a cost) when there is really need to > reclaim memory -- similar to our shrinker infrastructure. >