Message ID | 20211120201230.920082-1-shakeelb@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm: split thp synchronously on MADV_DONTNEED | expand |
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: > This patch let such applications not worry about the low level handling > of THPs in the kernel and splits the THPs synchronously on > MADV_DONTNEED. I don't mind the synchronous split, but I have concerns about the implementation. I don't think it's worth another pointer in task_struct. It's also the case that splitting is likely to succeed, so I think a better implementation would try to split and then put the page on the global deferred list if splitting fails.
On Sat, Nov 20, 2021 at 8:35 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: > > This patch let such applications not worry about the low level handling > > of THPs in the kernel and splits the THPs synchronously on > > MADV_DONTNEED. > > I don't mind the synchronous split, but I have concerns about the > implementation. I don't think it's worth another pointer in task_struct. Are you concerned about the size of task_struct? At least on my config this additional pointer was just filling the holes and not increasing the size. I can check a couple other configs as well. > It's also the case that splitting is likely to succeed, so I think a > better implementation would try to split and then put the page on the > global deferred list if splitting fails. Actually this is what this patch is doing. See the second loop in split_local_deferred_list().
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: > Many applications do sophisticated management of their heap memory for > better performance but with low cost. We have a bunch of such > applications running on our production and examples include caching and > data storage services. These applications keep their hot data on the > THPs for better performance and release the cold data through > MADV_DONTNEED to keep the memory cost low. > > The kernel defers the split and release of THPs until there is memory > pressure. This causes complicates the memory management of these > sophisticated applications which then needs to look into low level > kernel handling of THPs to better gauge their headroom for expansion. In > addition these applications are very latency sensitive and would prefer > to not face memory reclaim due to non-deterministic nature of reclaim. > > This patch let such applications not worry about the low level handling > of THPs in the kernel and splits the THPs synchronously on > MADV_DONTNEED. Have you considered impact on short-living tasks where paying splitting tax would hurt performace without any benefits? Maybe a sparete madvise opration needed? I donno. > Signed-off-by: Shakeel Butt <shakeelb@google.com> > --- > include/linux/mmzone.h | 5 ++++ > include/linux/sched.h | 4 ++++ > include/linux/sched/mm.h | 11 +++++++++ > kernel/fork.c | 3 +++ > mm/huge_memory.c | 50 ++++++++++++++++++++++++++++++++++++++++ > mm/madvise.c | 8 +++++++ > 6 files changed, 81 insertions(+) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index 58e744b78c2c..7fa0035128b9 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -795,6 +795,11 @@ struct deferred_split { > struct list_head split_queue; > unsigned long split_queue_len; > }; > +void split_local_deferred_list(struct list_head *defer_list); > +#else > +static inline void split_local_deferred_list(struct list_head *defer_list) > +{ > +} > #endif > > /* > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 9d27fd0ce5df..a984bb6509d9 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -1412,6 +1412,10 @@ struct task_struct { > struct mem_cgroup *active_memcg; > #endif > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + struct list_head *deferred_split_list; > +#endif > + > #ifdef CONFIG_BLK_CGROUP > struct request_queue *throttle_queue; > #endif It looks dirty. We really don't have options to pass it down? Maybe passdown the list via zap_details and call a new rmap remove helper if the list is present? > > +void split_local_deferred_list(struct list_head *defer_list) > +{ > + struct list_head *pos, *next; > + struct page *page; > + > + /* First iteration for split. */ > + list_for_each_safe(pos, next, defer_list) { > + page = list_entry((void *)pos, struct page, deferred_list); > + page = compound_head(page); > + > + if (!trylock_page(page)) > + continue; > + > + if (split_huge_page(page)) { > + unlock_page(page); > + continue; > + } > + /* split_huge_page() removes page from list on success */ > + unlock_page(page); > + > + /* corresponding get in deferred_split_huge_page. */ > + put_page(page); > + } > + > + /* Second iteration to putback failed pages. */ > + list_for_each_safe(pos, next, defer_list) { > + struct deferred_split *ds_queue; > + unsigned long flags; > + > + page = list_entry((void *)pos, struct page, deferred_list); > + page = compound_head(page); > + ds_queue = get_deferred_split_queue(page); > + > + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); > + list_move(page_deferred_list(page), &ds_queue->split_queue); > + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); > + > + /* corresponding get in deferred_split_huge_page. */ > + put_page(page); > + } > +} Looks like a lot of copy-paste from deferred_split_scan(). Can we get them consolidated?
On Sun, Nov 21, 2021 at 4:50 PM Kirill A. Shutemov <kirill@shutemov.name> wrote: > [...] > > Have you considered impact on short-living tasks where paying splitting > tax would hurt performace without any benefits? Maybe a sparete madvise > opration needed? I donno. > Do you have a concrete example of such short-living applications doing MADV_DONTNEED? I can try to get some numbers to measure the impact. Regarding the new advice option, I did give some thought to it but decided against it based on the reason that we should not be exposing some low level kernel implementation detail to users through a stable API. [...] > > diff --git a/include/linux/sched.h b/include/linux/sched.h > > index 9d27fd0ce5df..a984bb6509d9 100644 > > --- a/include/linux/sched.h > > +++ b/include/linux/sched.h > > @@ -1412,6 +1412,10 @@ struct task_struct { > > struct mem_cgroup *active_memcg; > > #endif > > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > + struct list_head *deferred_split_list; > > +#endif > > + > > #ifdef CONFIG_BLK_CGROUP > > struct request_queue *throttle_queue; > > #endif > > It looks dirty. We really don't have options to pass it down? > > Maybe passdown the list via zap_details and call a new rmap remove helper > if the list is present? > We already have precedence on using this technique for other cases but let me take a stab at passing the list through zap_details and see how that looks. > > > > +void split_local_deferred_list(struct list_head *defer_list) [...] > Looks like a lot of copy-paste from deferred_split_scan(). Can we get them > consolidated? I will see what I can do. Thanks for the review. Shakeel
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: > Many applications do sophisticated management of their heap memory for > better performance but with low cost. We have a bunch of such > applications running on our production and examples include caching and > data storage services. These applications keep their hot data on the > THPs for better performance and release the cold data through > MADV_DONTNEED to keep the memory cost low. > > The kernel defers the split and release of THPs until there is memory > pressure. This causes complicates the memory management of these > sophisticated applications which then needs to look into low level > kernel handling of THPs to better gauge their headroom for expansion. In > addition these applications are very latency sensitive and would prefer > to not face memory reclaim due to non-deterministic nature of reclaim. > > This patch let such applications not worry about the low level handling > of THPs in the kernel and splits the THPs synchronously on > MADV_DONTNEED. I've been wondering about whether this is really the right strategy (and this goes wider than just this one, new case) We chose to use a 2MB page here, based on whatever heuristics are currently in play. Now userspace is telling us we were wrong and should have used smaller pages. 2MB pages are precious, and we currently have one. Surely it is better to migrate the still-valid contents of this 2MB page to smaller pages, and then free the 2MB page as a single unit than it is to fragment this 2MB page into smaller chunks, and keep using some of it, virtually guaranteeing this particular 2MB page can't be reassembled without significant work?
On 20.11.21 21:12, Shakeel Butt wrote: > Many applications do sophisticated management of their heap memory for > better performance but with low cost. We have a bunch of such > applications running on our production and examples include caching and > data storage services. These applications keep their hot data on the > THPs for better performance and release the cold data through > MADV_DONTNEED to keep the memory cost low. > > The kernel defers the split and release of THPs until there is memory > pressure. This causes complicates the memory management of these > sophisticated applications which then needs to look into low level > kernel handling of THPs to better gauge their headroom for expansion. Can you elaborate a bit on that point? What exactly does such an application do? I would have assumed that it's mostly transparent for applications. > In > addition these applications are very latency sensitive and would prefer > to not face memory reclaim due to non-deterministic nature of reclaim. That makes sense. > > This patch let such applications not worry about the low level handling > of THPs in the kernel and splits the THPs synchronously on > MADV_DONTNEED. The main user I'm concerned about is virtio-balloon, which ends up discarding VM memory via MADV_DONTNEED when inflating the balloon in the guest in 4k granularity, but also during "free page reporting" continuously when e.g., a 2MiB page becomes free in the guest. We want both activities to be fast, and especially during "free page reporting", to defer any heavy work. Do we have a performance evaluation how much overhead is added e.g., for a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that covers the whole THP?
On 22.11.21 05:56, Matthew Wilcox wrote: > On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: >> Many applications do sophisticated management of their heap memory for >> better performance but with low cost. We have a bunch of such >> applications running on our production and examples include caching and >> data storage services. These applications keep their hot data on the >> THPs for better performance and release the cold data through >> MADV_DONTNEED to keep the memory cost low. >> >> The kernel defers the split and release of THPs until there is memory >> pressure. This causes complicates the memory management of these >> sophisticated applications which then needs to look into low level >> kernel handling of THPs to better gauge their headroom for expansion. In >> addition these applications are very latency sensitive and would prefer >> to not face memory reclaim due to non-deterministic nature of reclaim. >> >> This patch let such applications not worry about the low level handling >> of THPs in the kernel and splits the THPs synchronously on >> MADV_DONTNEED. > > I've been wondering about whether this is really the right strategy > (and this goes wider than just this one, new case) > > We chose to use a 2MB page here, based on whatever heuristics are > currently in play. Now userspace is telling us we were wrong and should > have used smaller pages. IIUC, not necessarily, unfortunately. User space might be discarding the whole 2MB either via a single call (MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page reporting" or by virtio-mem in QEMU). In that case, there is nothing to migrate and we were not doing anything wrong. But more extreme, user space might be discarding the whole THP in small pieces shortly over time. This for example happens when a VM inflates the memory balloon via virtio-balloon. All inflation requests are 4k, resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP range inside of the VM, mapping to a THP range inside the hypervisor, we'll essentially free a THP in the hypervisor piece by piece using individual MADV_DONTNEED calls -- this happens frequently. Something similar can happen when de-fragmentation inside the VM "moves around" inflated 4k pages piece by piece to essentially form a huge inflated range -- this happens less frequently as of now. In both cases, migration is counter-productive, as we're just about to free the whole range either way. (yes, there are ways to optimize, for example using hugepage ballooning or merging MADV_DONTNEED calls in the hypervisor, but what I described is what we currently implement in hypervisors like QEMU, because there are corner cases for everything) Long story short: it's hard to tell what will happen next based on a single MADV_DONTNEED call. Page compaction, in comparison, doesn't care and optimized the layout as it observes it.
On Mon, Nov 22, 2021 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: > > On 20.11.21 21:12, Shakeel Butt wrote: > > Many applications do sophisticated management of their heap memory for > > better performance but with low cost. We have a bunch of such > > applications running on our production and examples include caching and > > data storage services. These applications keep their hot data on the > > THPs for better performance and release the cold data through > > MADV_DONTNEED to keep the memory cost low. > > > > The kernel defers the split and release of THPs until there is memory > > pressure. This causes complicates the memory management of these > > sophisticated applications which then needs to look into low level > > kernel handling of THPs to better gauge their headroom for expansion. > > Can you elaborate a bit on that point? What exactly does such an > application do? I would have assumed that it's mostly transparent for > applications. > The application monitors its cgroup usage to decide if it can expand the memory footprint or release some (unneeded/cold) buffer. It releases madvise(MADV_DONTNEED) to release the memory which basically puts the THP into defer list. These deferred THPs are still charged to the cgroup which leads to bloated usage read by the application and making wrong decisions. Internally we added a cgroup interface to trigger the split of deferred THPs for that cgroup but this is hacky and exposing kernel internals to users. I want to solve this problem in a more general way for the users. > > In > > addition these applications are very latency sensitive and would prefer > > to not face memory reclaim due to non-deterministic nature of reclaim. > > That makes sense. > > > > > This patch let such applications not worry about the low level handling > > of THPs in the kernel and splits the THPs synchronously on > > MADV_DONTNEED. > > The main user I'm concerned about is virtio-balloon, which ends up > discarding VM memory via MADV_DONTNEED when inflating the balloon in the > guest in 4k granularity, but also during "free page reporting" > continuously when e.g., a 2MiB page becomes free in the guest. We want > both activities to be fast, and especially during "free page reporting", > to defer any heavy work. Thanks for the info. What is the source virtio-balloon used for free pages? > > Do we have a performance evaluation how much overhead is added e.g., for > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > covers the whole THP? I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on x86 for both settings you suggested. I don't see any statistically significant difference with and without the patch. Let me know if you want me to try something else. Thanks for the review. Shakeel
On 22.11.21 19:40, Shakeel Butt wrote: > On Mon, Nov 22, 2021 at 12:32 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 20.11.21 21:12, Shakeel Butt wrote: >>> Many applications do sophisticated management of their heap memory for >>> better performance but with low cost. We have a bunch of such >>> applications running on our production and examples include caching and >>> data storage services. These applications keep their hot data on the >>> THPs for better performance and release the cold data through >>> MADV_DONTNEED to keep the memory cost low. >>> >>> The kernel defers the split and release of THPs until there is memory >>> pressure. This causes complicates the memory management of these >>> sophisticated applications which then needs to look into low level >>> kernel handling of THPs to better gauge their headroom for expansion. >> >> Can you elaborate a bit on that point? What exactly does such an >> application do? I would have assumed that it's mostly transparent for >> applications. >> > > The application monitors its cgroup usage to decide if it can expand > the memory footprint or release some (unneeded/cold) buffer. It > releases madvise(MADV_DONTNEED) to release the memory which basically > puts the THP into defer list. These deferred THPs are still charged to > the cgroup which leads to bloated usage read by the application and > making wrong decisions. Internally we added a cgroup interface to > trigger the split of deferred THPs for that cgroup but this is hacky > and exposing kernel internals to users. I want to solve this problem > in a more general way for the users. Thanks for the details, that makes sense to me. It's essentially like another kernel buffer charged to the process, only reclaimed on memory reclaim. (can we add that to the patch description?) > >>> In >>> addition these applications are very latency sensitive and would prefer >>> to not face memory reclaim due to non-deterministic nature of reclaim. >> >> That makes sense. >> >>> >>> This patch let such applications not worry about the low level handling >>> of THPs in the kernel and splits the THPs synchronously on >>> MADV_DONTNEED. >> >> The main user I'm concerned about is virtio-balloon, which ends up >> discarding VM memory via MADV_DONTNEED when inflating the balloon in the >> guest in 4k granularity, but also during "free page reporting" >> continuously when e.g., a 2MiB page becomes free in the guest. We want >> both activities to be fast, and especially during "free page reporting", >> to defer any heavy work. > > Thanks for the info. What is the source virtio-balloon used for free pages? So inside the guest (driver/virtio/virtio_balloon.c), we can see: 1. Balloon inflation. We essentially allocate a 4k page and send the PFN to the hypervisor. The hypervisor will MADV_DONTNEED that memory. 2. Free page reporting. We pull some free higher-order (e.g., 2 MB on x86-64) pages from the page allocator and report the PFNs to the hypervisor. The hypervisor will MADV_DONTNEED the regions. Once reported, we putback the free pages to the free page list in the page allocator, from where it can be re-allocated eventually immediately. On some archs (especially aarch64), we can see free page reporting report sub-THP granularity. But usually we're dealing with THP granularity. > >> >> Do we have a performance evaluation how much overhead is added e.g., for >> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that >> covers the whole THP? > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > x86 for both settings you suggested. I don't see any statistically > significant difference with and without the patch. Let me know if you > want me to try something else. Awesome, thanks for benchmarking. I did not check, but I assume on re-access, we won't actually re-use pages from the underlying, partially unmapped, THP, correct? So after MADV_DONTNEED, the zapped sub-pages are essentially lost until reclaimed by splitting the THP? If they could get reused, there would be value in the deferred split when partially unmapping a THP. I do wonder which purpose the deferred split serves nowadays at all. Fortunately, there is documentation: Documentation/vm/transhuge.rst: " Unmapping part of THP (with munmap() or other way) is not going to free memory immediately. Instead, we detect that a subpage of THP is not in use in page_remove_rmap() and queue the THP for splitting if memory pressure comes. Splitting will free up unused subpages. Splitting the page right away is not an option due to locking context in the place where we can detect partial unmap. It also might be counterproductive since in many cases partial unmap happens during exit(2) if a THP crosses a VMA boundary. The function deferred_split_huge_page() is used to queue a page for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. " I do wonder which these locking contexts are exactly, and if we could also do the same thing on ordinary munmap -- because I assume it can be similarly problematic for some applications. The "exit()" case might indeed be interesting, but I really do wonder if this is even observable in actual number: I'm not so sure about the "many cases" but I might be wrong, of course.
On Mon, Nov 22, 2021 at 10:59 AM David Hildenbrand <david@redhat.com> wrote: > [...] > > Thanks for the details, that makes sense to me. It's essentially like > another kernel buffer charged to the process, only reclaimed on memory > reclaim. > > (can we add that to the patch description?) > Sure. [...] > > > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > > x86 for both settings you suggested. I don't see any statistically > > significant difference with and without the patch. Let me know if you > > want me to try something else. > > Awesome, thanks for benchmarking. I did not check, but I assume on > re-access, we won't actually re-use pages from the underlying, partially > unmapped, THP, correct? Correct. > So after MADV_DONTNEED, the zapped sub-pages are > essentially lost until reclaimed by splitting the THP? Yes. > If they could get > reused, there would be value in the deferred split when partially > unmapping a THP. > > > I do wonder which purpose the deferred split serves nowadays at all. > Fortunately, there is documentation: Documentation/vm/transhuge.rst: > > " > Unmapping part of THP (with munmap() or other way) is not going to free > memory immediately. Instead, we detect that a subpage of THP is not in > use in page_remove_rmap() and queue the THP for splitting if memory > pressure comes. Splitting will free up unused subpages. > > Splitting the page right away is not an option due to locking context in > the place where we can detect partial unmap. It also might be > counterproductive since in many cases partial unmap happens during > exit(2) if a THP crosses a VMA boundary. > > The function deferred_split_huge_page() is used to queue a page for > splitting. The splitting itself will happen when we get memory pressure > via shrinker interface. > " > > I do wonder which these locking contexts are exactly, and if we could > also do the same thing on ordinary munmap -- because I assume it can be > similarly problematic for some applications. This is a good question regarding munmap. One main difference is munmap takes mmap_lock in write mode and usually performance critical applications avoid such operations. > The "exit()" case might > indeed be interesting, but I really do wonder if this is even observable > in actual number: I'm not so sure about the "many cases" but I might be > wrong, of course. I am not worried about the exit(). The whole THP will get freed and be removed from the deferred list as well. Note that deferred list does not hold reference to the THP and has a hook in the THP destructor. thanks, Shakeel
>> I do wonder which purpose the deferred split serves nowadays at all. >> Fortunately, there is documentation: Documentation/vm/transhuge.rst: >> >> " >> Unmapping part of THP (with munmap() or other way) is not going to free >> memory immediately. Instead, we detect that a subpage of THP is not in >> use in page_remove_rmap() and queue the THP for splitting if memory >> pressure comes. Splitting will free up unused subpages. >> >> Splitting the page right away is not an option due to locking context in >> the place where we can detect partial unmap. It also might be >> counterproductive since in many cases partial unmap happens during >> exit(2) if a THP crosses a VMA boundary. >> >> The function deferred_split_huge_page() is used to queue a page for >> splitting. The splitting itself will happen when we get memory pressure >> via shrinker interface. >> " >> >> I do wonder which these locking contexts are exactly, and if we could >> also do the same thing on ordinary munmap -- because I assume it can be >> similarly problematic for some applications. > > This is a good question regarding munmap. One main difference is > munmap takes mmap_lock in write mode and usually performance critical > applications avoid such operations. Maybe we can extend it too most page zapping, if that makes things simpler. > >> The "exit()" case might >> indeed be interesting, but I really do wonder if this is even observable >> in actual number: I'm not so sure about the "many cases" but I might be >> wrong, of course. > > I am not worried about the exit(). The whole THP will get freed and be > removed from the deferred list as well. Note that deferred list does > not hold reference to the THP and has a hook in the THP destructor. Yes, you're right. We'll run into the de-constructor either way.
On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > [...] > >> > >> I do wonder which these locking contexts are exactly, and if we could > >> also do the same thing on ordinary munmap -- because I assume it can be > >> similarly problematic for some applications. > > > > This is a good question regarding munmap. One main difference is > > munmap takes mmap_lock in write mode and usually performance critical > > applications avoid such operations. > > Maybe we can extend it too most page zapping, if that makes things simpler. > Do you mean doing sync THP split for most of page zapping functions (but only if that makes things simpler)?
On 23.11.21 18:17, Shakeel Butt wrote: > On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: >> > [...] >>>> >>>> I do wonder which these locking contexts are exactly, and if we could >>>> also do the same thing on ordinary munmap -- because I assume it can be >>>> similarly problematic for some applications. >>> >>> This is a good question regarding munmap. One main difference is >>> munmap takes mmap_lock in write mode and usually performance critical >>> applications avoid such operations. >> >> Maybe we can extend it too most page zapping, if that makes things simpler. >> > > Do you mean doing sync THP split for most of page zapping functions > (but only if that makes things simpler)? > Yes -- if there are no downsides.
On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote: > > On 23.11.21 18:17, Shakeel Butt wrote: > > On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > >> > > [...] > >>>> > >>>> I do wonder which these locking contexts are exactly, and if we could > >>>> also do the same thing on ordinary munmap -- because I assume it can be > >>>> similarly problematic for some applications. > >>> > >>> This is a good question regarding munmap. One main difference is > >>> munmap takes mmap_lock in write mode and usually performance critical > >>> applications avoid such operations. > >> > >> Maybe we can extend it too most page zapping, if that makes things simpler. > >> > > > > Do you mean doing sync THP split for most of page zapping functions > > (but only if that makes things simpler)? > > > > Yes -- if there are no downsides. > I will try. At the moment the assumption of "Not null zap_details implies leave swap entries" is giving me a headache. Thanks for the suggestions and your time.
On 23.11.21 18:24, Shakeel Butt wrote: > On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote: >> >> On 23.11.21 18:17, Shakeel Butt wrote: >>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: >>>> >>> [...] >>>>>> >>>>>> I do wonder which these locking contexts are exactly, and if we could >>>>>> also do the same thing on ordinary munmap -- because I assume it can be >>>>>> similarly problematic for some applications. >>>>> >>>>> This is a good question regarding munmap. One main difference is >>>>> munmap takes mmap_lock in write mode and usually performance critical >>>>> applications avoid such operations. >>>> >>>> Maybe we can extend it too most page zapping, if that makes things simpler. >>>> >>> >>> Do you mean doing sync THP split for most of page zapping functions >>> (but only if that makes things simpler)? >>> >> >> Yes -- if there are no downsides. >> > > I will try. At the moment the assumption of "Not null zap_details > implies leave swap entries" is giving me a headache. Not only you, did you stumble over https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com already?
On Tue, Nov 23, 2021 at 9:26 AM David Hildenbrand <david@redhat.com> wrote: > > On 23.11.21 18:24, Shakeel Butt wrote: > > On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote: > >> > >> On 23.11.21 18:17, Shakeel Butt wrote: > >>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > >>>> > >>> [...] > >>>>>> > >>>>>> I do wonder which these locking contexts are exactly, and if we could > >>>>>> also do the same thing on ordinary munmap -- because I assume it can be > >>>>>> similarly problematic for some applications. > >>>>> > >>>>> This is a good question regarding munmap. One main difference is > >>>>> munmap takes mmap_lock in write mode and usually performance critical > >>>>> applications avoid such operations. > >>>> > >>>> Maybe we can extend it too most page zapping, if that makes things simpler. > >>>> > >>> > >>> Do you mean doing sync THP split for most of page zapping functions > >>> (but only if that makes things simpler)? > >>> > >> > >> Yes -- if there are no downsides. > >> > > > > I will try. At the moment the assumption of "Not null zap_details > > implies leave swap entries" is giving me a headache. > > Not only you, did you stumble over > > https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com > > already? > Oh thanks for the pointer. I missed that. I will take a look. Thanks again.
On Tue, Nov 23, 2021 at 09:28:34AM -0800, Shakeel Butt wrote: > On Tue, Nov 23, 2021 at 9:26 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 23.11.21 18:24, Shakeel Butt wrote: > > > On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote: > > >> > > >> On 23.11.21 18:17, Shakeel Butt wrote: > > >>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote: > > >>>> > > >>> [...] > > >>>>>> > > >>>>>> I do wonder which these locking contexts are exactly, and if we could > > >>>>>> also do the same thing on ordinary munmap -- because I assume it can be > > >>>>>> similarly problematic for some applications. > > >>>>> > > >>>>> This is a good question regarding munmap. One main difference is > > >>>>> munmap takes mmap_lock in write mode and usually performance critical > > >>>>> applications avoid such operations. > > >>>> > > >>>> Maybe we can extend it too most page zapping, if that makes things simpler. > > >>>> > > >>> > > >>> Do you mean doing sync THP split for most of page zapping functions > > >>> (but only if that makes things simpler)? > > >>> > > >> > > >> Yes -- if there are no downsides. > > >> > > > > > > I will try. At the moment the assumption of "Not null zap_details > > > implies leave swap entries" is giving me a headache. > > > > Not only you, did you stumble over > > > > https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com (thanks for raising this, David) > > > > already? > > > > Oh thanks for the pointer. I missed that. I will take a look. Thanks again. Hi, Shakeel, I saw your v2 has started to pass in zap_details, then we need know the side effect on that skip-swap-entry thing because with your v2 code munmap() will start to skip swap entry too (while it was not before). I saw that you didn't mention this in v2 patch either in commit message or code, not sure whether you digged that up. I think it needs some double check (or feel free to start this digging by reviewing my small patch above :). Thanks,
On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: > > Do we have a performance evaluation how much overhead is added e.g., for > > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > > covers the whole THP? > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > x86 for both settings you suggested. I don't see any statistically > significant difference with and without the patch. Let me know if you > want me to try something else. I'm a bit surprised that sync split thp didn't bring any extra overhead. "unmap whole thp" is understandable from that pov, because afaict that won't even trigger any thp split anyway even delayed, if this is the simplest case that only this process mapped this thp, and it mapped once. For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be fast; while what I don't understand since thp split requires all hand-made work for copying thp flags into small pages and so on, so I thought there should at least be some overhead measured. Shakeel, could there be something overlooked in the test, or maybe it's me that overlooked? I had the same concern as what Kirill/Matthew raised in the other thread - I'm worried proactively splitting simply because any 4k page is zapped might quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate the defragmentation of the system memory in general. I'm also not sure whether that's ideal for some very common workload that frequently uses DONTNEED to proactively drop some pages. To me, the old deffered-split has a point in that it'll only be done when at least the memory or cgroup is in low mem, that means we're in extreme cases so we'd better start to worry page allocation failures rather than number of thps and memory performance. v2 even added unmap() into account, so that'll further amplify that effect, imho. I'm wondering whether MADV_SPLIT would make more sense so as to keep the old DONTNEED/unmap behaviors, however before that I think I should understand the test results first, because besides 2m pages missing that'll be another important factor for "whether a new interface is more welcomed" from perf pov. Thanks,
On 25.11.21 11:24, Peter Xu wrote: > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: >>> Do we have a performance evaluation how much overhead is added e.g., for >>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that >>> covers the whole THP? >> >> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on >> x86 for both settings you suggested. I don't see any statistically >> significant difference with and without the patch. Let me know if you >> want me to try something else. > > I'm a bit surprised that sync split thp didn't bring any extra overhead. > > "unmap whole thp" is understandable from that pov, because afaict that won't > even trigger any thp split anyway even delayed, if this is the simplest case > that only this process mapped this thp, and it mapped once. > > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be > fast; while what I don't understand since thp split requires all hand-made work > for copying thp flags into small pages and so on, so I thought there should at > least be some overhead measured. Shakeel, could there be something overlooked > in the test, or maybe it's me that overlooked? > > I had the same concern as what Kirill/Matthew raised in the other thread - I'm > worried proactively splitting simply because any 4k page is zapped might > quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate > the defragmentation of the system memory in general. I'm also not sure whether > that's ideal for some very common workload that frequently uses DONTNEED to > proactively drop some pages. The pageblock corresponding to the THP is movable. So (unless we start spilling unmovable allocations into movable pageblocks) we'd only place movable allocations in there. Compaction will be able to migrate to re-create a free THP. In contrast I think, compaction will happily skip over the THP and ignore it, because it has no clue that the THP could be repurposed by split+migrate (at least I am not aware of code that does it). Unless I am missing something, with the above in mind it could make sense to split as soon as possible, even before we're under memory pressure -- for example, for proactive compaction. [proactive compaction could try splitting first as well I think]
On Thu, Nov 25, 2021 at 2:09 AM Peter Xu <peterx@redhat.com> wrote: > [...] > > Hi, Shakeel, > > I saw your v2 has started to pass in zap_details, then we need know the side > effect on that skip-swap-entry thing because with your v2 code munmap() will > start to skip swap entry too (while it was not before). > > I saw that you didn't mention this in v2 patch either in commit message or > code, not sure whether you digged that up. I think it needs some double check > (or feel free to start this digging by reviewing my small patch above :). > I remember mentioning in the patch that this has dependency on your rfc patch. Just checked again and yeah I only mentioned in the 'changes since v1' section. I will add it more explicitly in the following versions. I will take a close look at your patch as well.
On Thu, Nov 25, 2021 at 09:14:30AM -0800, Shakeel Butt wrote: > On Thu, Nov 25, 2021 at 2:09 AM Peter Xu <peterx@redhat.com> wrote: > > > [...] > > > > Hi, Shakeel, > > > > I saw your v2 has started to pass in zap_details, then we need know the side > > effect on that skip-swap-entry thing because with your v2 code munmap() will > > start to skip swap entry too (while it was not before). > > > > I saw that you didn't mention this in v2 patch either in commit message or > > code, not sure whether you digged that up. I think it needs some double check > > (or feel free to start this digging by reviewing my small patch above :). > > > > I remember mentioning in the patch that this has dependency on your > rfc patch. Just checked again and yeah I only mentioned in the > 'changes since v1' section. I will add it more explicitly in the > following versions. Apologize for missing that section. Yeah depending on that patch should work. > > I will take a close look at your patch as well. Thanks a lot,
On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote: > On 25.11.21 11:24, Peter Xu wrote: > > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: > >>> Do we have a performance evaluation how much overhead is added e.g., for > >>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > >>> covers the whole THP? > >> > >> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > >> x86 for both settings you suggested. I don't see any statistically > >> significant difference with and without the patch. Let me know if you > >> want me to try something else. > > > > I'm a bit surprised that sync split thp didn't bring any extra overhead. > > > > "unmap whole thp" is understandable from that pov, because afaict that won't > > even trigger any thp split anyway even delayed, if this is the simplest case > > that only this process mapped this thp, and it mapped once. > > > > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be > > fast; while what I don't understand since thp split requires all hand-made work > > for copying thp flags into small pages and so on, so I thought there should at > > least be some overhead measured. Shakeel, could there be something overlooked > > in the test, or maybe it's me that overlooked? > > > > I had the same concern as what Kirill/Matthew raised in the other thread - I'm > > worried proactively splitting simply because any 4k page is zapped might > > quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate > > the defragmentation of the system memory in general. I'm also not sure whether > > that's ideal for some very common workload that frequently uses DONTNEED to > > proactively drop some pages. > > The pageblock corresponding to the THP is movable. So (unless we start > spilling unmovable allocations into movable pageblocks) we'd only place > movable allocations in there. Compaction will be able to migrate to > re-create a free THP. > > In contrast I think, compaction will happily skip over the THP and > ignore it, because it has no clue that the THP could be repurposed by > split+migrate (at least I am not aware of code that does it). > > Unless I am missing something, with the above in mind it could make > sense to split as soon as possible, even before we're under memory > pressure -- for example, for proactive compaction. > > [proactive compaction could try splitting first as well I think] But we can't rely on proactive compaction for rapid operations, because it's still adding overhead to the overall system by split+merge, right? +compaction_proactiveness +======================== + ... +Note that compaction has a non-trivial system-wide impact as pages +belonging to different processes are moved around, which could also lead +to latency spikes in unsuspecting applications. The kernel employs +various heuristics to avoid wasting CPU cycles if it detects that +proactive compaction is not being effective. Delaying split makes sense to me because after all the kernel is not aware of the userspace's preference, so the best thing is to do nothing until necessary. Proactively split thps in dontneed/unmap added an assumption that the userspace wants to break the pages by default. It's 100% true for Shakeel's use case, but I'm not sure whether it may always be true. That's why I thought maybe a new interface is more proper, so we at least won't break anyone by accident.
On Thu, Nov 25, 2021 at 2:24 AM Peter Xu <peterx@redhat.com> wrote: > > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: > > > Do we have a performance evaluation how much overhead is added e.g., for > > > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > > > covers the whole THP? > > > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > > x86 for both settings you suggested. I don't see any statistically > > significant difference with and without the patch. Let me know if you > > want me to try something else. > > I'm a bit surprised that sync split thp didn't bring any extra overhead. > > "unmap whole thp" is understandable from that pov, because afaict that won't > even trigger any thp split anyway even delayed, if this is the simplest case > that only this process mapped this thp, and it mapped once. > > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be > fast; while what I don't understand since thp split requires all hand-made work > for copying thp flags into small pages and so on, so I thought there should at > least be some overhead measured. Shakeel, could there be something overlooked > in the test, or maybe it's me that overlooked? > Thanks for making me rerun this and yes indeed I had a very silly bug in the benchmark code (i.e. madvise the same page for the whole loop) and this is indeed several times slower than without the patch (sorry David for misleading you). To better understand what is happening, I profiled the benchmark: - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync - 31.27% zap_page_range_sync - 30.25% split_local_deferred_list - 30.16% split_huge_page_to_list - 21.05% try_to_migrate + rmap_walk_anon - 7.47% remove_migration_ptes + 7.34% rmap_walk_locked + 1.02% zap_page_range_details The overhead is not due to copying page flags but rather due to two rmap walks. I don't think this much overhead is justified for current users of MADV_DONTNEED and munmap. I have to rethink the approach. thanks, Shakeel
On Thu, Nov 25, 2021 at 07:21:54PM -0800, Shakeel Butt wrote: > On Thu, Nov 25, 2021 at 2:24 AM Peter Xu <peterx@redhat.com> wrote: > > > > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: > > > > Do we have a performance evaluation how much overhead is added e.g., for > > > > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > > > > covers the whole THP? > > > > > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > > > x86 for both settings you suggested. I don't see any statistically > > > significant difference with and without the patch. Let me know if you > > > want me to try something else. > > > > I'm a bit surprised that sync split thp didn't bring any extra overhead. > > > > "unmap whole thp" is understandable from that pov, because afaict that won't > > even trigger any thp split anyway even delayed, if this is the simplest case > > that only this process mapped this thp, and it mapped once. > > > > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be > > fast; while what I don't understand since thp split requires all hand-made work > > for copying thp flags into small pages and so on, so I thought there should at > > least be some overhead measured. Shakeel, could there be something overlooked > > in the test, or maybe it's me that overlooked? > > > > Thanks for making me rerun this and yes indeed I had a very silly bug in the > benchmark code (i.e. madvise the same page for the whole loop) and this is > indeed several times slower than without the patch (sorry David for misleading > you). > > To better understand what is happening, I profiled the benchmark: > > - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync > - 31.27% zap_page_range_sync > - 30.25% split_local_deferred_list > - 30.16% split_huge_page_to_list > - 21.05% try_to_migrate > + rmap_walk_anon > - 7.47% remove_migration_ptes > + 7.34% rmap_walk_locked > + 1.02% zap_page_range_details Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk itself. I believe this effect will be exaggerated when the mapping is shared, e.g. shmem file thp between processes. What's worse is that when one process DONTNEED one 4k page, all the rest mms will need to split the huge pmd without even being noticed, so that's a direct suffer from perf degrade. > > The overhead is not due to copying page flags but rather due to two rmap walks. > I don't think this much overhead is justified for current users of MADV_DONTNEED > and munmap. I have to rethink the approach. Some side notes: I digged out the old MADV_COLLAPSE proposal right after I thought about MADV_SPLIT (or any of its variance): https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ My memory was that there's some issue to be solved so that was blocked, however when I read the thread it sounds like the list was mostly reaching a consensus on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I missed something important. If we think MADV_COLLAPSE can help to implement an userspace (and more importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of kcompactd, perhaps. That's probably a bit off topic of this specific discussion on the specific use case, but so far it seems all reasonable and discussable.
On 26.11.21 03:52, Peter Xu wrote: > On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote: >> On 25.11.21 11:24, Peter Xu wrote: >>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: >>>>> Do we have a performance evaluation how much overhead is added e.g., for >>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that >>>>> covers the whole THP? >>>> >>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on >>>> x86 for both settings you suggested. I don't see any statistically >>>> significant difference with and without the patch. Let me know if you >>>> want me to try something else. >>> >>> I'm a bit surprised that sync split thp didn't bring any extra overhead. >>> >>> "unmap whole thp" is understandable from that pov, because afaict that won't >>> even trigger any thp split anyway even delayed, if this is the simplest case >>> that only this process mapped this thp, and it mapped once. >>> >>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be >>> fast; while what I don't understand since thp split requires all hand-made work >>> for copying thp flags into small pages and so on, so I thought there should at >>> least be some overhead measured. Shakeel, could there be something overlooked >>> in the test, or maybe it's me that overlooked? >>> >>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm >>> worried proactively splitting simply because any 4k page is zapped might >>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate >>> the defragmentation of the system memory in general. I'm also not sure whether >>> that's ideal for some very common workload that frequently uses DONTNEED to >>> proactively drop some pages. >> >> The pageblock corresponding to the THP is movable. So (unless we start >> spilling unmovable allocations into movable pageblocks) we'd only place >> movable allocations in there. Compaction will be able to migrate to >> re-create a free THP. >> >> In contrast I think, compaction will happily skip over the THP and >> ignore it, because it has no clue that the THP could be repurposed by >> split+migrate (at least I am not aware of code that does it). >> >> Unless I am missing something, with the above in mind it could make >> sense to split as soon as possible, even before we're under memory >> pressure -- for example, for proactive compaction. >> >> [proactive compaction could try splitting first as well I think] > > But we can't rely on proactive compaction for rapid operations, because it's > still adding overhead to the overall system by split+merge, right? Yes, but there is also direct compaction that can be triggered without the shrinker getting involved. I think we can summarize as "there might not be a right or wrong when to split". An application that MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to consume memory, yet it looks like it's still consuming that memory. I do wonder how THP on the deferred split queue behave in respect to page migration -- memory offlining, alloc_contig_range(). I saw reports that there are some cases where THP can be problematic when stress-testing THP: https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt But not sure if that's related to deferred splitting. Most probably not. > > +compaction_proactiveness > +======================== > + ... > +Note that compaction has a non-trivial system-wide impact as pages > +belonging to different processes are moved around, which could also lead > +to latency spikes in unsuspecting applications. The kernel employs > +various heuristics to avoid wasting CPU cycles if it detects that > +proactive compaction is not being effective. > > Delaying split makes sense to me because after all the kernel is not aware of > the userspace's preference, so the best thing is to do nothing until necessary. > > Proactively split thps in dontneed/unmap added an assumption that the userspace > wants to break the pages by default. It's 100% true for Shakeel's use case, > but I'm not sure whether it may always be true. That's why I thought maybe a > new interface is more proper, so we at least won't break anyone by accident. Well, we already broke the PMD into PTEs. So the performance gain at least for that user is really gone until we "fix that" again via khugepaged -- which might just be configured to not "fix" if there are empty PTEs. It for sure is interesting if you have a COW huge page and only one party zaps/unmaps some part.
>> >> Thanks for making me rerun this and yes indeed I had a very silly bug in the >> benchmark code (i.e. madvise the same page for the whole loop) and this is >> indeed several times slower than without the patch (sorry David for misleading >> you). No worries, BUGs happen :) >> >> To better understand what is happening, I profiled the benchmark: >> >> - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync >> - 31.27% zap_page_range_sync >> - 30.25% split_local_deferred_list >> - 30.16% split_huge_page_to_list >> - 21.05% try_to_migrate >> + rmap_walk_anon >> - 7.47% remove_migration_ptes >> + 7.34% rmap_walk_locked >> + 1.02% zap_page_range_details > > Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk > itself. > > I believe this effect will be exaggerated when the mapping is shared, > e.g. shmem file thp between processes. What's worse is that when one process > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without > even being noticed, so that's a direct suffer from perf degrade. Would this really apply to MADV_DONTNEED on shmem, and would deferred splitting apply on shmem? I'm constantly confused about shmem vs. anon, but I would have assumed that shmem is fd-based and we wouldn't end up in rmap_walk_anon. For shmem, the pagecache would contain the THP which would stick around and deferred splits don't even apply. But again, I'm constantly confused so I'd love to be enlighted. > >> >> The overhead is not due to copying page flags but rather due to two rmap walks. >> I don't think this much overhead is justified for current users of MADV_DONTNEED >> and munmap. I have to rethink the approach. Most probably not. > > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I > thought about MADV_SPLIT (or any of its variance): > > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > My memory was that there's some issue to be solved so that was blocked, however > when I read the thread it sounds like the list was mostly reaching a consensus > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I > missed something important. > > If we think MADV_COLLAPSE can help to implement an userspace (and more > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of > kcompactd, perhaps. > > That's probably a bit off topic of this specific discussion on the specific use > case, but so far it seems all reasonable and discussable. User space can trigger a split manually using some MADV hackery. But it can only be used for the use case here, where we actually want to zap a page. 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE and the compound page. 2. MADV_DONTNEED either the complete range or the single 4k page.
On Fri, Nov 26, 2021 at 10:16:58AM +0100, David Hildenbrand wrote: > Would this really apply to MADV_DONTNEED on shmem, and would deferred > splitting apply on shmem? I'm constantly confused about shmem vs. anon, > but I would have assumed that shmem is fd-based and we wouldn't end up > in rmap_walk_anon. For shmem, the pagecache would contain the THP which > would stick around and deferred splits don't even apply. Good point.. when split on shmem we just clear pmd, so yeah I don't think we'll ever add it into the deferred list. > User space can trigger a split manually using some MADV hackery. But it > can only be used for the use case here, where we actually want to zap a > page. > > 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE > and the compound page. Seems to be a very implicit but working solution indeed. > 2. MADV_DONTNEED either the complete range or the single 4k page. Is this what this patch is working on? Thanks,
On Fri, Nov 26, 2021 at 1:17 AM David Hildenbrand <david@redhat.com> wrote: > > >> > >> Thanks for making me rerun this and yes indeed I had a very silly bug in the > >> benchmark code (i.e. madvise the same page for the whole loop) and this is > >> indeed several times slower than without the patch (sorry David for misleading > >> you). > > No worries, BUGs happen :) > > >> > >> To better understand what is happening, I profiled the benchmark: > >> > >> - 31.27% 0.01% dontneed [kernel.kallsyms] [k] zap_page_range_sync > >> - 31.27% zap_page_range_sync > >> - 30.25% split_local_deferred_list > >> - 30.16% split_huge_page_to_list > >> - 21.05% try_to_migrate > >> + rmap_walk_anon > >> - 7.47% remove_migration_ptes > >> + 7.34% rmap_walk_locked > >> + 1.02% zap_page_range_details > > > > Makes sense, thanks for verifying it, Shakeel. I forgot it'll also walk > > itself. > > > > I believe this effect will be exaggerated when the mapping is shared, > > e.g. shmem file thp between processes. What's worse is that when one process > > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without > > even being noticed, so that's a direct suffer from perf degrade. > > Would this really apply to MADV_DONTNEED on shmem, and would deferred > splitting apply on shmem? I'm constantly confused about shmem vs. anon, > but I would have assumed that shmem is fd-based and we wouldn't end up > in rmap_walk_anon. For shmem, the pagecache would contain the THP which > would stick around and deferred splits don't even apply. The deferred split is anon THP only, it doesn't apply to shmem THP. For the rmap walk, there are two ramp walks for anon THP, but just one for shmem THP. Both needs one rmap walk to unmap the THP before doing actual split, but anon THP needs another rmap walk to reinstall PTEs for still mapped pages, however this is not needed for shmem pages since they could be reached via page cache. > > But again, I'm constantly confused so I'd love to be enlighted. > > > > >> > >> The overhead is not due to copying page flags but rather due to two rmap walks. > >> I don't think this much overhead is justified for current users of MADV_DONTNEED > >> and munmap. I have to rethink the approach. > > Most probably not. > > > > > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I > > thought about MADV_SPLIT (or any of its variance): > > > > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > > > My memory was that there's some issue to be solved so that was blocked, however > > when I read the thread it sounds like the list was mostly reaching a consensus > > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I > > missed something important. > > > > If we think MADV_COLLAPSE can help to implement an userspace (and more > > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of > > kcompactd, perhaps. > > > > That's probably a bit off topic of this specific discussion on the specific use > > case, but so far it seems all reasonable and discussable. > > User space can trigger a split manually using some MADV hackery. But it > can only be used for the use case here, where we actually want to zap a > page. > > 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE > and the compound page. > 2. MADV_DONTNEED either the complete range or the single 4k page. > > -- > Thanks, > > David / dhildenb >
On Fri, Nov 26, 2021 at 1:04 AM David Hildenbrand <david@redhat.com> wrote: > > On 26.11.21 03:52, Peter Xu wrote: > > On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote: > >> On 25.11.21 11:24, Peter Xu wrote: > >>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote: > >>>>> Do we have a performance evaluation how much overhead is added e.g., for > >>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that > >>>>> covers the whole THP? > >>>> > >>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on > >>>> x86 for both settings you suggested. I don't see any statistically > >>>> significant difference with and without the patch. Let me know if you > >>>> want me to try something else. > >>> > >>> I'm a bit surprised that sync split thp didn't bring any extra overhead. > >>> > >>> "unmap whole thp" is understandable from that pov, because afaict that won't > >>> even trigger any thp split anyway even delayed, if this is the simplest case > >>> that only this process mapped this thp, and it mapped once. > >>> > >>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be > >>> fast; while what I don't understand since thp split requires all hand-made work > >>> for copying thp flags into small pages and so on, so I thought there should at > >>> least be some overhead measured. Shakeel, could there be something overlooked > >>> in the test, or maybe it's me that overlooked? > >>> > >>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm > >>> worried proactively splitting simply because any 4k page is zapped might > >>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate > >>> the defragmentation of the system memory in general. I'm also not sure whether > >>> that's ideal for some very common workload that frequently uses DONTNEED to > >>> proactively drop some pages. > >> > >> The pageblock corresponding to the THP is movable. So (unless we start > >> spilling unmovable allocations into movable pageblocks) we'd only place > >> movable allocations in there. Compaction will be able to migrate to > >> re-create a free THP. > >> > >> In contrast I think, compaction will happily skip over the THP and > >> ignore it, because it has no clue that the THP could be repurposed by > >> split+migrate (at least I am not aware of code that does it). > >> > >> Unless I am missing something, with the above in mind it could make > >> sense to split as soon as possible, even before we're under memory > >> pressure -- for example, for proactive compaction. > >> > >> [proactive compaction could try splitting first as well I think] > > > > But we can't rely on proactive compaction for rapid operations, because it's > > still adding overhead to the overall system by split+merge, right? > > Yes, but there is also direct compaction that can be triggered without > the shrinker getting involved. I think we can summarize as "there might > not be a right or wrong when to split". An application that > MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to > consume memory, yet it looks like it's still consuming that memory. > > I do wonder how THP on the deferred split queue behave in respect to > page migration -- memory offlining, alloc_contig_range(). I saw reports IIUC the old page that is on deferred split queue would be freed and off the list once the migration is done. But the new page is *NOT* added to the deferred split list at all. This would not cause any severe bugs but the partial unmapped pages can no longer be shrunk under memory pressure. > that there are some cases where THP can be problematic when > stress-testing THP: > https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt > > But not sure if that's related to deferred splitting. Most probably not. > > > > > +compaction_proactiveness > > +======================== > > + ... > > +Note that compaction has a non-trivial system-wide impact as pages > > +belonging to different processes are moved around, which could also lead > > +to latency spikes in unsuspecting applications. The kernel employs > > +various heuristics to avoid wasting CPU cycles if it detects that > > +proactive compaction is not being effective. > > > > Delaying split makes sense to me because after all the kernel is not aware of > > the userspace's preference, so the best thing is to do nothing until necessary. > > > > Proactively split thps in dontneed/unmap added an assumption that the userspace > > wants to break the pages by default. It's 100% true for Shakeel's use case, > > but I'm not sure whether it may always be true. That's why I thought maybe a > > new interface is more proper, so we at least won't break anyone by accident. > > Well, we already broke the PMD into PTEs. So the performance gain at > least for that user is really gone until we "fix that" again via > khugepaged -- which might just be configured to not "fix" if there are > empty PTEs. > > It for sure is interesting if you have a COW huge page and only one > party zaps/unmaps some part. > > > -- > Thanks, > > David / dhildenb >
> >> Many applications do sophisticated management of their heap memory for > >> better performance but with low cost. We have a bunch of such > >> applications running on our production and examples include caching and > >> data storage services. These applications keep their hot data on the > >> THPs for better performance and release the cold data through > >> MADV_DONTNEED to keep the memory cost low. > >> > >> The kernel defers the split and release of THPs until there is memory > >> pressure. This causes complicates the memory management of these > >> sophisticated applications which then needs to look into low level > >> kernel handling of THPs to better gauge their headroom for expansion. In > >> addition these applications are very latency sensitive and would prefer > >> to not face memory reclaim due to non-deterministic nature of reclaim. > >> > >> This patch let such applications not worry about the low level handling > >> of THPs in the kernel and splits the THPs synchronously on > >> MADV_DONTNEED. > > > > I've been wondering about whether this is really the right strategy > > (and this goes wider than just this one, new case) > > > > We chose to use a 2MB page here, based on whatever heuristics are > > currently in play. Now userspace is telling us we were wrong and should > > have used smaller pages. > > IIUC, not necessarily, unfortunately. > > User space might be discarding the whole 2MB either via a single call > (MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page > reporting" or by virtio-mem in QEMU). In that case, there is nothing to > migrate and we were not doing anything wrong. > > But more extreme, user space might be discarding the whole THP in small > pieces shortly over time. This for example happens when a VM inflates > the memory balloon via virtio-balloon. All inflation requests are 4k, > resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP > range inside of the VM, mapping to a THP range inside the hypervisor, > we'll essentially free a THP in the hypervisor piece by piece using > individual MADV_DONTNEED calls -- this happens frequently. Something > similar can happen when de-fragmentation inside the VM "moves around" > inflated 4k pages piece by piece to essentially form a huge inflated > range -- this happens less frequently as of now. In both cases, > migration is counter-productive, as we're just about to free the whole > range either way. > > (yes, there are ways to optimize, for example using hugepage ballooning > or merging MADV_DONTNEED calls in the hypervisor, but what I described > is what we currently implement in hypervisors like QEMU, because there > are corner cases for everything) It seems this can happen when guest using huge pages or THP. If we end up not freeing hypervisor memory(THP) till memory pressure mounts, this can be a problem for "free page reporting" as well? > > Long story short: it's hard to tell what will happen next based on a > single MADV_DONTNEED call. Page compaction, in comparison, doesn't care > and optimized the layout as it observes it.
On Fri, 26 Nov 2021, Peter Xu wrote: > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I > thought about MADV_SPLIT (or any of its variance): > > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/ > > My memory was that there's some issue to be solved so that was blocked, however > when I read the thread it sounds like the list was mostly reaching a consensus > on considering MADV_COLLAPSE being beneficial. Still copying DavidR in case I > missed something important. > > If we think MADV_COLLAPSE can help to implement an userspace (and more > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of > kcompactd, perhaps. > > That's probably a bit off topic of this specific discussion on the specific use > case, but so far it seems all reasonable and discussable. > Hi Peter, Providing a (late) update since we now have some better traction on this, I think we'll be ready to post an RFC soon that introduces MADV_COLLAPSE. The work is being driven by Zach, now cc'd. Let's also include SeongJae Park <sj@kernel.org> as well and keep him in the loop since DAMON could easily be extended with a DAMOS_COLLAPSE action to use MADV_COLLAPSE for hot regions of memory. Idea for initial approach: - MADV_COLLAPSE core code based on the proposal you cite above for anon memory as the inaugural support, collapse memory into thp in process context - Batching support to collapse ranges of memory into multiple THP - Wire this up for madvise(2) (and process_madvise(2)) - Enlightenment for file-backed thp I think Zach's RFC will cover the first three, it could be debated if the initial patch series *must* support file-backed thp. We'll see based on the feedback to the RFC. There's also an extension where MADV_COLLAPSE could be potentially useful for hugetlb backed memory. We have another effort underway that we've been talking with Mike Kravetz about that allows hugetlb memory to be mapped at multiple levels of the page tables. There are several use cases but one of the driving factors is the performance of post-copy live migration; in this case, you'd be able to send smaller sized pages over the wire rather than, say, a 1GB gigantic page. In this case, MADV_COLLAPSE could be useful to map smaller pages by a larger page table entry before all of the smaller pages have been live migrated. That said, we have not invested time into an MADV_SPLIT yet. Do you (or anybody else) have concerns about this approach? Ideas for extensions? Thanks!
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 58e744b78c2c..7fa0035128b9 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -795,6 +795,11 @@ struct deferred_split { struct list_head split_queue; unsigned long split_queue_len; }; +void split_local_deferred_list(struct list_head *defer_list); +#else +static inline void split_local_deferred_list(struct list_head *defer_list) +{ +} #endif /* diff --git a/include/linux/sched.h b/include/linux/sched.h index 9d27fd0ce5df..a984bb6509d9 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1412,6 +1412,10 @@ struct task_struct { struct mem_cgroup *active_memcg; #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + struct list_head *deferred_split_list; +#endif + #ifdef CONFIG_BLK_CGROUP struct request_queue *throttle_queue; #endif diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h index fdf742e15e27..9b438c7e811e 100644 --- a/include/linux/sched/mm.h +++ b/include/linux/sched/mm.h @@ -374,6 +374,17 @@ set_active_memcg(struct mem_cgroup *memcg) } #endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +static inline void set_local_deferred_list(struct list_head *list) +{ + current->deferred_split_list = list; +} +#else +static inline void set_local_deferred_list(struct list_head *list) +{ +} +#endif + #ifdef CONFIG_MEMBARRIER enum { MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY = (1U << 0), diff --git a/kernel/fork.c b/kernel/fork.c index 01af6129aa38..8197b8ed4b7a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1019,6 +1019,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node) #ifdef CONFIG_MEMCG tsk->active_memcg = NULL; +#endif +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + tsk->deferred_split_list = NULL; #endif return tsk; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index e5483347291c..2f73eeecb857 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2754,6 +2754,7 @@ void free_transhuge_page(struct page *page) void deferred_split_huge_page(struct page *page) { struct deferred_split *ds_queue = get_deferred_split_queue(page); + struct list_head *list = current->deferred_split_list; #ifdef CONFIG_MEMCG struct mem_cgroup *memcg = page_memcg(compound_head(page)); #endif @@ -2774,7 +2775,14 @@ void deferred_split_huge_page(struct page *page) if (PageSwapCache(page)) return; + if (list && list_empty(page_deferred_list(page))) { + /* corresponding put in split_local_deferred_list. */ + get_page(page); + list_add(page_deferred_list(page), list); + } + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + if (list_empty(page_deferred_list(page))) { count_vm_event(THP_DEFERRED_SPLIT_PAGE); list_add_tail(page_deferred_list(page), &ds_queue->split_queue); @@ -2801,6 +2809,48 @@ static unsigned long deferred_split_count(struct shrinker *shrink, return READ_ONCE(ds_queue->split_queue_len); } +void split_local_deferred_list(struct list_head *defer_list) +{ + struct list_head *pos, *next; + struct page *page; + + /* First iteration for split. */ + list_for_each_safe(pos, next, defer_list) { + page = list_entry((void *)pos, struct page, deferred_list); + page = compound_head(page); + + if (!trylock_page(page)) + continue; + + if (split_huge_page(page)) { + unlock_page(page); + continue; + } + /* split_huge_page() removes page from list on success */ + unlock_page(page); + + /* corresponding get in deferred_split_huge_page. */ + put_page(page); + } + + /* Second iteration to putback failed pages. */ + list_for_each_safe(pos, next, defer_list) { + struct deferred_split *ds_queue; + unsigned long flags; + + page = list_entry((void *)pos, struct page, deferred_list); + page = compound_head(page); + ds_queue = get_deferred_split_queue(page); + + spin_lock_irqsave(&ds_queue->split_queue_lock, flags); + list_move(page_deferred_list(page), &ds_queue->split_queue); + spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags); + + /* corresponding get in deferred_split_huge_page. */ + put_page(page); + } +} + static unsigned long deferred_split_scan(struct shrinker *shrink, struct shrink_control *sc) { diff --git a/mm/madvise.c b/mm/madvise.c index 8c927202bbe6..15614115e359 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -762,7 +762,15 @@ static int madvise_free_single_vma(struct vm_area_struct *vma, static long madvise_dontneed_single_vma(struct vm_area_struct *vma, unsigned long start, unsigned long end) { + LIST_HEAD(list); + + set_local_deferred_list(&list); + zap_page_range(vma, start, end - start); + + set_local_deferred_list(NULL); + split_local_deferred_list(&list); + return 0; }
Many applications do sophisticated management of their heap memory for better performance but with low cost. We have a bunch of such applications running on our production and examples include caching and data storage services. These applications keep their hot data on the THPs for better performance and release the cold data through MADV_DONTNEED to keep the memory cost low. The kernel defers the split and release of THPs until there is memory pressure. This causes complicates the memory management of these sophisticated applications which then needs to look into low level kernel handling of THPs to better gauge their headroom for expansion. In addition these applications are very latency sensitive and would prefer to not face memory reclaim due to non-deterministic nature of reclaim. This patch let such applications not worry about the low level handling of THPs in the kernel and splits the THPs synchronously on MADV_DONTNEED. Signed-off-by: Shakeel Butt <shakeelb@google.com> --- include/linux/mmzone.h | 5 ++++ include/linux/sched.h | 4 ++++ include/linux/sched/mm.h | 11 +++++++++ kernel/fork.c | 3 +++ mm/huge_memory.c | 50 ++++++++++++++++++++++++++++++++++++++++ mm/madvise.c | 8 +++++++ 6 files changed, 81 insertions(+)