diff mbox series

mm: split thp synchronously on MADV_DONTNEED

Message ID 20211120201230.920082-1-shakeelb@google.com (mailing list archive)
State New
Headers show
Series mm: split thp synchronously on MADV_DONTNEED | expand

Commit Message

Shakeel Butt Nov. 20, 2021, 8:12 p.m. UTC
Many applications do sophisticated management of their heap memory for
better performance but with low cost. We have a bunch of such
applications running on our production and examples include caching and
data storage services. These applications keep their hot data on the
THPs for better performance and release the cold data through
MADV_DONTNEED to keep the memory cost low.

The kernel defers the split and release of THPs until there is memory
pressure. This causes complicates the memory management of these
sophisticated applications which then needs to look into low level
kernel handling of THPs to better gauge their headroom for expansion. In
addition these applications are very latency sensitive and would prefer
to not face memory reclaim due to non-deterministic nature of reclaim.

This patch let such applications not worry about the low level handling
of THPs in the kernel and splits the THPs synchronously on
MADV_DONTNEED.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
 include/linux/mmzone.h   |  5 ++++
 include/linux/sched.h    |  4 ++++
 include/linux/sched/mm.h | 11 +++++++++
 kernel/fork.c            |  3 +++
 mm/huge_memory.c         | 50 ++++++++++++++++++++++++++++++++++++++++
 mm/madvise.c             |  8 +++++++
 6 files changed, 81 insertions(+)

Comments

Matthew Wilcox Nov. 21, 2021, 4:35 a.m. UTC | #1
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
> This patch let such applications not worry about the low level handling
> of THPs in the kernel and splits the THPs synchronously on
> MADV_DONTNEED.

I don't mind the synchronous split, but I have concerns about the
implementation.  I don't think it's worth another pointer in task_struct.
It's also the case that splitting is likely to succeed, so I think a
better implementation would try to split and then put the page on the
global deferred list if splitting fails.
Shakeel Butt Nov. 21, 2021, 5:25 a.m. UTC | #2
On Sat, Nov 20, 2021 at 8:35 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
> > This patch let such applications not worry about the low level handling
> > of THPs in the kernel and splits the THPs synchronously on
> > MADV_DONTNEED.
>
> I don't mind the synchronous split, but I have concerns about the
> implementation.  I don't think it's worth another pointer in task_struct.

Are you concerned about the size of task_struct? At least on my config
this additional pointer was just filling the holes and not increasing
the size. I can check a couple other configs as well.

> It's also the case that splitting is likely to succeed, so I think a
> better implementation would try to split and then put the page on the
> global deferred list if splitting fails.

Actually this is what this patch is doing. See the second loop in
split_local_deferred_list().
Kirill A. Shutemov Nov. 22, 2021, 12:50 a.m. UTC | #3
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
> Many applications do sophisticated management of their heap memory for
> better performance but with low cost. We have a bunch of such
> applications running on our production and examples include caching and
> data storage services. These applications keep their hot data on the
> THPs for better performance and release the cold data through
> MADV_DONTNEED to keep the memory cost low.
> 
> The kernel defers the split and release of THPs until there is memory
> pressure. This causes complicates the memory management of these
> sophisticated applications which then needs to look into low level
> kernel handling of THPs to better gauge their headroom for expansion. In
> addition these applications are very latency sensitive and would prefer
> to not face memory reclaim due to non-deterministic nature of reclaim.
> 
> This patch let such applications not worry about the low level handling
> of THPs in the kernel and splits the THPs synchronously on
> MADV_DONTNEED.

Have you considered impact on short-living tasks where paying splitting
tax would hurt performace without any benefits? Maybe a sparete madvise
opration needed? I donno.

> Signed-off-by: Shakeel Butt <shakeelb@google.com>
> ---
>  include/linux/mmzone.h   |  5 ++++
>  include/linux/sched.h    |  4 ++++
>  include/linux/sched/mm.h | 11 +++++++++
>  kernel/fork.c            |  3 +++
>  mm/huge_memory.c         | 50 ++++++++++++++++++++++++++++++++++++++++
>  mm/madvise.c             |  8 +++++++
>  6 files changed, 81 insertions(+)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 58e744b78c2c..7fa0035128b9 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -795,6 +795,11 @@ struct deferred_split {
>  	struct list_head split_queue;
>  	unsigned long split_queue_len;
>  };
> +void split_local_deferred_list(struct list_head *defer_list);
> +#else
> +static inline void split_local_deferred_list(struct list_head *defer_list)
> +{
> +}
>  #endif
>  
>  /*
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9d27fd0ce5df..a984bb6509d9 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1412,6 +1412,10 @@ struct task_struct {
>  	struct mem_cgroup		*active_memcg;
>  #endif
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	struct list_head		*deferred_split_list;
> +#endif
> +
>  #ifdef CONFIG_BLK_CGROUP
>  	struct request_queue		*throttle_queue;
>  #endif

It looks dirty. We really don't have options to pass it down?

Maybe passdown the list via zap_details and call a new rmap remove helper
if the list is present?

>  
> +void split_local_deferred_list(struct list_head *defer_list)
> +{
> +	struct list_head *pos, *next;
> +	struct page *page;
> +
> +	/* First iteration for split. */
> +	list_for_each_safe(pos, next, defer_list) {
> +		page = list_entry((void *)pos, struct page, deferred_list);
> +		page = compound_head(page);
> +
> +		if (!trylock_page(page))
> +			continue;
> +
> +		if (split_huge_page(page)) {
> +			unlock_page(page);
> +			continue;
> +		}
> +		/* split_huge_page() removes page from list on success */
> +		unlock_page(page);
> +
> +		/* corresponding get in deferred_split_huge_page. */
> +		put_page(page);
> +	}
> +
> +	/* Second iteration to putback failed pages. */
> +	list_for_each_safe(pos, next, defer_list) {
> +		struct deferred_split *ds_queue;
> +		unsigned long flags;
> +
> +		page = list_entry((void *)pos, struct page, deferred_list);
> +		page = compound_head(page);
> +		ds_queue = get_deferred_split_queue(page);
> +
> +		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +		list_move(page_deferred_list(page), &ds_queue->split_queue);
> +		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +
> +		/* corresponding get in deferred_split_huge_page. */
> +		put_page(page);
> +	}
> +}

Looks like a lot of copy-paste from deferred_split_scan(). Can we get them
consolidated?
Shakeel Butt Nov. 22, 2021, 3:42 a.m. UTC | #4
On Sun, Nov 21, 2021 at 4:50 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
[...]
>
> Have you considered impact on short-living tasks where paying splitting
> tax would hurt performace without any benefits? Maybe a sparete madvise
> opration needed? I donno.
>

Do you have a concrete example of such short-living applications doing
MADV_DONTNEED? I can try to get some numbers to measure the impact.

Regarding the new advice option, I did give some thought to it but
decided against it based on the reason that we should not be exposing
some low level kernel implementation detail to users through a stable
API.

[...]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 9d27fd0ce5df..a984bb6509d9 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1412,6 +1412,10 @@ struct task_struct {
> >       struct mem_cgroup               *active_memcg;
> >  #endif
> >
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     struct list_head                *deferred_split_list;
> > +#endif
> > +
> >  #ifdef CONFIG_BLK_CGROUP
> >       struct request_queue            *throttle_queue;
> >  #endif
>
> It looks dirty. We really don't have options to pass it down?
>
> Maybe passdown the list via zap_details and call a new rmap remove helper
> if the list is present?
>

We already have precedence on using this technique for other cases but
let me take a stab at passing the list through zap_details and see how
that looks.

> >
> > +void split_local_deferred_list(struct list_head *defer_list)
[...]
> Looks like a lot of copy-paste from deferred_split_scan(). Can we get them
> consolidated?

I will see what I can do.

Thanks for the review.
Shakeel
Matthew Wilcox Nov. 22, 2021, 4:56 a.m. UTC | #5
On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
> Many applications do sophisticated management of their heap memory for
> better performance but with low cost. We have a bunch of such
> applications running on our production and examples include caching and
> data storage services. These applications keep their hot data on the
> THPs for better performance and release the cold data through
> MADV_DONTNEED to keep the memory cost low.
> 
> The kernel defers the split and release of THPs until there is memory
> pressure. This causes complicates the memory management of these
> sophisticated applications which then needs to look into low level
> kernel handling of THPs to better gauge their headroom for expansion. In
> addition these applications are very latency sensitive and would prefer
> to not face memory reclaim due to non-deterministic nature of reclaim.
> 
> This patch let such applications not worry about the low level handling
> of THPs in the kernel and splits the THPs synchronously on
> MADV_DONTNEED.

I've been wondering about whether this is really the right strategy
(and this goes wider than just this one, new case)

We chose to use a 2MB page here, based on whatever heuristics are
currently in play.  Now userspace is telling us we were wrong and should
have used smaller pages.  2MB pages are precious, and we currently
have one.  Surely it is better to migrate the still-valid contents of
this 2MB page to smaller pages, and then free the 2MB page as a single
unit than it is to fragment this 2MB page into smaller chunks, and keep
using some of it, virtually guaranteeing this particular 2MB page can't
be reassembled without significant work?
David Hildenbrand Nov. 22, 2021, 8:32 a.m. UTC | #6
On 20.11.21 21:12, Shakeel Butt wrote:
> Many applications do sophisticated management of their heap memory for
> better performance but with low cost. We have a bunch of such
> applications running on our production and examples include caching and
> data storage services. These applications keep their hot data on the
> THPs for better performance and release the cold data through
> MADV_DONTNEED to keep the memory cost low.
> 
> The kernel defers the split and release of THPs until there is memory
> pressure. This causes complicates the memory management of these
> sophisticated applications which then needs to look into low level
> kernel handling of THPs to better gauge their headroom for expansion.

Can you elaborate a bit on that point? What exactly does such an
application do? I would have assumed that it's mostly transparent for
applications.

> In
> addition these applications are very latency sensitive and would prefer
> to not face memory reclaim due to non-deterministic nature of reclaim.

That makes sense.

> 
> This patch let such applications not worry about the low level handling
> of THPs in the kernel and splits the THPs synchronously on
> MADV_DONTNEED.

The main user I'm concerned about is virtio-balloon, which ends up
discarding VM memory via MADV_DONTNEED when inflating the balloon in the
guest in 4k granularity, but also during "free page reporting"
continuously when e.g., a 2MiB page becomes free in the guest. We want
both activities to be fast, and especially during "free page reporting",
to defer any heavy work.

Do we have a performance evaluation how much overhead is added e.g., for
a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
covers the whole THP?
David Hildenbrand Nov. 22, 2021, 9:19 a.m. UTC | #7
On 22.11.21 05:56, Matthew Wilcox wrote:
> On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote:
>> Many applications do sophisticated management of their heap memory for
>> better performance but with low cost. We have a bunch of such
>> applications running on our production and examples include caching and
>> data storage services. These applications keep their hot data on the
>> THPs for better performance and release the cold data through
>> MADV_DONTNEED to keep the memory cost low.
>>
>> The kernel defers the split and release of THPs until there is memory
>> pressure. This causes complicates the memory management of these
>> sophisticated applications which then needs to look into low level
>> kernel handling of THPs to better gauge their headroom for expansion. In
>> addition these applications are very latency sensitive and would prefer
>> to not face memory reclaim due to non-deterministic nature of reclaim.
>>
>> This patch let such applications not worry about the low level handling
>> of THPs in the kernel and splits the THPs synchronously on
>> MADV_DONTNEED.
> 
> I've been wondering about whether this is really the right strategy
> (and this goes wider than just this one, new case)
> 
> We chose to use a 2MB page here, based on whatever heuristics are
> currently in play.  Now userspace is telling us we were wrong and should
> have used smaller pages.

IIUC, not necessarily, unfortunately.

User space might be discarding the whole 2MB either via a single call
(MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page
reporting" or by virtio-mem in QEMU). In that case, there is nothing to
migrate and we were not doing anything wrong.

But more extreme, user space might be discarding the whole THP in small
pieces shortly over time. This for example happens when a VM inflates
the memory balloon via virtio-balloon. All inflation requests are 4k,
resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP
range inside of the VM, mapping to a THP range inside the hypervisor,
we'll essentially free a THP in the hypervisor piece by piece using
individual MADV_DONTNEED calls -- this happens frequently. Something
similar can happen when de-fragmentation inside the VM "moves around"
inflated 4k pages piece by piece to essentially form a huge inflated
range -- this happens less frequently as of now. In both cases,
migration is counter-productive, as we're just about to free the whole
range either way.

(yes, there are ways to optimize, for example using hugepage ballooning
or merging MADV_DONTNEED calls in the hypervisor, but what I described
is what we currently implement in hypervisors like QEMU, because there
are corner cases for everything)

Long story short: it's hard to tell what will happen next based on a
single MADV_DONTNEED call. Page compaction, in comparison, doesn't care
and optimized the layout as it observes it.
Shakeel Butt Nov. 22, 2021, 6:40 p.m. UTC | #8
On Mon, Nov 22, 2021 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 20.11.21 21:12, Shakeel Butt wrote:
> > Many applications do sophisticated management of their heap memory for
> > better performance but with low cost. We have a bunch of such
> > applications running on our production and examples include caching and
> > data storage services. These applications keep their hot data on the
> > THPs for better performance and release the cold data through
> > MADV_DONTNEED to keep the memory cost low.
> >
> > The kernel defers the split and release of THPs until there is memory
> > pressure. This causes complicates the memory management of these
> > sophisticated applications which then needs to look into low level
> > kernel handling of THPs to better gauge their headroom for expansion.
>
> Can you elaborate a bit on that point? What exactly does such an
> application do? I would have assumed that it's mostly transparent for
> applications.
>

The application monitors its cgroup usage to decide if it can expand
the memory footprint or release some (unneeded/cold) buffer. It
releases madvise(MADV_DONTNEED) to release the memory which basically
puts the THP into defer list. These deferred THPs are still charged to
the cgroup which leads to bloated usage read by the application and
making wrong decisions. Internally we added a cgroup interface to
trigger the split of deferred THPs for that cgroup but this is hacky
and exposing kernel internals to users. I want to solve this problem
in a more general way for the users.

> > In
> > addition these applications are very latency sensitive and would prefer
> > to not face memory reclaim due to non-deterministic nature of reclaim.
>
> That makes sense.
>
> >
> > This patch let such applications not worry about the low level handling
> > of THPs in the kernel and splits the THPs synchronously on
> > MADV_DONTNEED.
>
> The main user I'm concerned about is virtio-balloon, which ends up
> discarding VM memory via MADV_DONTNEED when inflating the balloon in the
> guest in 4k granularity, but also during "free page reporting"
> continuously when e.g., a 2MiB page becomes free in the guest. We want
> both activities to be fast, and especially during "free page reporting",
> to defer any heavy work.

Thanks for the info. What is the source virtio-balloon used for free pages?

>
> Do we have a performance evaluation how much overhead is added e.g., for
> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> covers the whole THP?

I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
x86 for both settings you suggested. I don't see any statistically
significant difference with and without the patch. Let me know if you
want me to try something else.

Thanks for the review.
Shakeel
David Hildenbrand Nov. 22, 2021, 6:59 p.m. UTC | #9
On 22.11.21 19:40, Shakeel Butt wrote:
> On Mon, Nov 22, 2021 at 12:32 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 20.11.21 21:12, Shakeel Butt wrote:
>>> Many applications do sophisticated management of their heap memory for
>>> better performance but with low cost. We have a bunch of such
>>> applications running on our production and examples include caching and
>>> data storage services. These applications keep their hot data on the
>>> THPs for better performance and release the cold data through
>>> MADV_DONTNEED to keep the memory cost low.
>>>
>>> The kernel defers the split and release of THPs until there is memory
>>> pressure. This causes complicates the memory management of these
>>> sophisticated applications which then needs to look into low level
>>> kernel handling of THPs to better gauge their headroom for expansion.
>>
>> Can you elaborate a bit on that point? What exactly does such an
>> application do? I would have assumed that it's mostly transparent for
>> applications.
>>
> 
> The application monitors its cgroup usage to decide if it can expand
> the memory footprint or release some (unneeded/cold) buffer. It
> releases madvise(MADV_DONTNEED) to release the memory which basically
> puts the THP into defer list. These deferred THPs are still charged to
> the cgroup which leads to bloated usage read by the application and
> making wrong decisions. Internally we added a cgroup interface to
> trigger the split of deferred THPs for that cgroup but this is hacky
> and exposing kernel internals to users. I want to solve this problem
> in a more general way for the users.

Thanks for the details, that makes sense to me. It's essentially like
another kernel buffer charged to the process, only reclaimed on memory
reclaim.

(can we add that to the patch description?)

> 
>>> In
>>> addition these applications are very latency sensitive and would prefer
>>> to not face memory reclaim due to non-deterministic nature of reclaim.
>>
>> That makes sense.
>>
>>>
>>> This patch let such applications not worry about the low level handling
>>> of THPs in the kernel and splits the THPs synchronously on
>>> MADV_DONTNEED.
>>
>> The main user I'm concerned about is virtio-balloon, which ends up
>> discarding VM memory via MADV_DONTNEED when inflating the balloon in the
>> guest in 4k granularity, but also during "free page reporting"
>> continuously when e.g., a 2MiB page becomes free in the guest. We want
>> both activities to be fast, and especially during "free page reporting",
>> to defer any heavy work.
> 
> Thanks for the info. What is the source virtio-balloon used for free pages?

So inside the guest (driver/virtio/virtio_balloon.c), we can see:

1. Balloon inflation. We essentially allocate a 4k page and send the PFN
to the hypervisor. The hypervisor will MADV_DONTNEED that memory.

2. Free page reporting. We pull some free higher-order (e.g., 2 MB on
x86-64) pages from the page allocator and report the PFNs to the
hypervisor. The hypervisor will MADV_DONTNEED the regions. Once
reported, we putback the free pages to the free page list in the page
allocator, from where it can be re-allocated eventually immediately.

On some archs (especially aarch64), we can see free page reporting
report sub-THP granularity. But usually we're dealing with THP granularity.

> 
>>
>> Do we have a performance evaluation how much overhead is added e.g., for
>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
>> covers the whole THP?
> 
> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> x86 for both settings you suggested. I don't see any statistically
> significant difference with and without the patch. Let me know if you
> want me to try something else.

Awesome, thanks for benchmarking. I did not check, but I assume on
re-access, we won't actually re-use pages from the underlying, partially
unmapped, THP, correct? So after MADV_DONTNEED, the zapped sub-pages are
essentially lost until reclaimed by splitting the THP? If they could get
reused, there would be value in the deferred split when partially
unmapping a THP.


I do wonder which purpose the deferred split serves nowadays at all.
Fortunately, there is documentation: Documentation/vm/transhuge.rst:

"
Unmapping part of THP (with munmap() or other way) is not going to free
memory immediately. Instead, we detect that a subpage of THP is not in
use in page_remove_rmap() and queue the THP for splitting if memory
pressure comes. Splitting will free up unused subpages.

Splitting the page right away is not an option due to locking context in
the place where we can detect partial unmap. It also might be
counterproductive since in many cases partial unmap happens during
exit(2) if a THP crosses a VMA boundary.

The function deferred_split_huge_page() is used to queue a page for
splitting. The splitting itself will happen when we get memory pressure
via shrinker interface.
"

I do wonder which these locking contexts are exactly, and if we could
also do the same thing on ordinary munmap -- because I assume it can be
similarly problematic for some applications. The "exit()" case might
indeed be interesting, but I really do wonder if this is even observable
in actual number: I'm not so sure about the "many cases" but I might be
wrong, of course.
Shakeel Butt Nov. 23, 2021, 1:20 a.m. UTC | #10
On Mon, Nov 22, 2021 at 10:59 AM David Hildenbrand <david@redhat.com> wrote:
>
[...]
>
> Thanks for the details, that makes sense to me. It's essentially like
> another kernel buffer charged to the process, only reclaimed on memory
> reclaim.
>
> (can we add that to the patch description?)
>

Sure.

[...]
> >
> > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> > x86 for both settings you suggested. I don't see any statistically
> > significant difference with and without the patch. Let me know if you
> > want me to try something else.
>
> Awesome, thanks for benchmarking. I did not check, but I assume on
> re-access, we won't actually re-use pages from the underlying, partially
> unmapped, THP, correct?

Correct.

> So after MADV_DONTNEED, the zapped sub-pages are
> essentially lost until reclaimed by splitting the THP?

Yes.

> If they could get
> reused, there would be value in the deferred split when partially
> unmapping a THP.
>
>
> I do wonder which purpose the deferred split serves nowadays at all.
> Fortunately, there is documentation: Documentation/vm/transhuge.rst:
>
> "
> Unmapping part of THP (with munmap() or other way) is not going to free
> memory immediately. Instead, we detect that a subpage of THP is not in
> use in page_remove_rmap() and queue the THP for splitting if memory
> pressure comes. Splitting will free up unused subpages.
>
> Splitting the page right away is not an option due to locking context in
> the place where we can detect partial unmap. It also might be
> counterproductive since in many cases partial unmap happens during
> exit(2) if a THP crosses a VMA boundary.
>
> The function deferred_split_huge_page() is used to queue a page for
> splitting. The splitting itself will happen when we get memory pressure
> via shrinker interface.
> "
>
> I do wonder which these locking contexts are exactly, and if we could
> also do the same thing on ordinary munmap -- because I assume it can be
> similarly problematic for some applications.

This is a good question regarding munmap. One main difference is
munmap takes mmap_lock in write mode and usually performance critical
applications avoid such operations.

> The "exit()" case might
> indeed be interesting, but I really do wonder if this is even observable
> in actual number: I'm not so sure about the "many cases" but I might be
> wrong, of course.

I am not worried about the exit(). The whole THP will get freed and be
removed from the deferred list as well. Note that deferred list does
not hold reference to the THP and has a hook in the THP destructor.

thanks,
Shakeel
David Hildenbrand Nov. 23, 2021, 4:56 p.m. UTC | #11
>> I do wonder which purpose the deferred split serves nowadays at all.
>> Fortunately, there is documentation: Documentation/vm/transhuge.rst:
>>
>> "
>> Unmapping part of THP (with munmap() or other way) is not going to free
>> memory immediately. Instead, we detect that a subpage of THP is not in
>> use in page_remove_rmap() and queue the THP for splitting if memory
>> pressure comes. Splitting will free up unused subpages.
>>
>> Splitting the page right away is not an option due to locking context in
>> the place where we can detect partial unmap. It also might be
>> counterproductive since in many cases partial unmap happens during
>> exit(2) if a THP crosses a VMA boundary.
>>
>> The function deferred_split_huge_page() is used to queue a page for
>> splitting. The splitting itself will happen when we get memory pressure
>> via shrinker interface.
>> "
>>
>> I do wonder which these locking contexts are exactly, and if we could
>> also do the same thing on ordinary munmap -- because I assume it can be
>> similarly problematic for some applications.
> 
> This is a good question regarding munmap. One main difference is
> munmap takes mmap_lock in write mode and usually performance critical
> applications avoid such operations.

Maybe we can extend it too most page zapping, if that makes things simpler.

> 
>> The "exit()" case might
>> indeed be interesting, but I really do wonder if this is even observable
>> in actual number: I'm not so sure about the "many cases" but I might be
>> wrong, of course.
> 
> I am not worried about the exit(). The whole THP will get freed and be
> removed from the deferred list as well. Note that deferred list does
> not hold reference to the THP and has a hook in the THP destructor.

Yes, you're right. We'll run into the de-constructor either way.
Shakeel Butt Nov. 23, 2021, 5:17 p.m. UTC | #12
On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>
[...]
> >>
> >> I do wonder which these locking contexts are exactly, and if we could
> >> also do the same thing on ordinary munmap -- because I assume it can be
> >> similarly problematic for some applications.
> >
> > This is a good question regarding munmap. One main difference is
> > munmap takes mmap_lock in write mode and usually performance critical
> > applications avoid such operations.
>
> Maybe we can extend it too most page zapping, if that makes things simpler.
>

Do you mean doing sync THP split for most of page zapping functions
(but only if that makes things simpler)?
David Hildenbrand Nov. 23, 2021, 5:20 p.m. UTC | #13
On 23.11.21 18:17, Shakeel Butt wrote:
> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>>
> [...]
>>>>
>>>> I do wonder which these locking contexts are exactly, and if we could
>>>> also do the same thing on ordinary munmap -- because I assume it can be
>>>> similarly problematic for some applications.
>>>
>>> This is a good question regarding munmap. One main difference is
>>> munmap takes mmap_lock in write mode and usually performance critical
>>> applications avoid such operations.
>>
>> Maybe we can extend it too most page zapping, if that makes things simpler.
>>
> 
> Do you mean doing sync THP split for most of page zapping functions
> (but only if that makes things simpler)?
> 

Yes -- if there are no downsides.
Shakeel Butt Nov. 23, 2021, 5:24 p.m. UTC | #14
On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 23.11.21 18:17, Shakeel Butt wrote:
> > On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> > [...]
> >>>>
> >>>> I do wonder which these locking contexts are exactly, and if we could
> >>>> also do the same thing on ordinary munmap -- because I assume it can be
> >>>> similarly problematic for some applications.
> >>>
> >>> This is a good question regarding munmap. One main difference is
> >>> munmap takes mmap_lock in write mode and usually performance critical
> >>> applications avoid such operations.
> >>
> >> Maybe we can extend it too most page zapping, if that makes things simpler.
> >>
> >
> > Do you mean doing sync THP split for most of page zapping functions
> > (but only if that makes things simpler)?
> >
>
> Yes -- if there are no downsides.
>

I will try. At the moment the assumption of "Not null zap_details
implies leave swap entries" is giving me a headache.

Thanks for the suggestions and your time.
David Hildenbrand Nov. 23, 2021, 5:26 p.m. UTC | #15
On 23.11.21 18:24, Shakeel Butt wrote:
> On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 23.11.21 18:17, Shakeel Butt wrote:
>>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>> [...]
>>>>>>
>>>>>> I do wonder which these locking contexts are exactly, and if we could
>>>>>> also do the same thing on ordinary munmap -- because I assume it can be
>>>>>> similarly problematic for some applications.
>>>>>
>>>>> This is a good question regarding munmap. One main difference is
>>>>> munmap takes mmap_lock in write mode and usually performance critical
>>>>> applications avoid such operations.
>>>>
>>>> Maybe we can extend it too most page zapping, if that makes things simpler.
>>>>
>>>
>>> Do you mean doing sync THP split for most of page zapping functions
>>> (but only if that makes things simpler)?
>>>
>>
>> Yes -- if there are no downsides.
>>
> 
> I will try. At the moment the assumption of "Not null zap_details
> implies leave swap entries" is giving me a headache.

Not only you, did you stumble over

https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com

already?
Shakeel Butt Nov. 23, 2021, 5:28 p.m. UTC | #16
On Tue, Nov 23, 2021 at 9:26 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 23.11.21 18:24, Shakeel Butt wrote:
> > On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 23.11.21 18:17, Shakeel Butt wrote:
> >>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>> [...]
> >>>>>>
> >>>>>> I do wonder which these locking contexts are exactly, and if we could
> >>>>>> also do the same thing on ordinary munmap -- because I assume it can be
> >>>>>> similarly problematic for some applications.
> >>>>>
> >>>>> This is a good question regarding munmap. One main difference is
> >>>>> munmap takes mmap_lock in write mode and usually performance critical
> >>>>> applications avoid such operations.
> >>>>
> >>>> Maybe we can extend it too most page zapping, if that makes things simpler.
> >>>>
> >>>
> >>> Do you mean doing sync THP split for most of page zapping functions
> >>> (but only if that makes things simpler)?
> >>>
> >>
> >> Yes -- if there are no downsides.
> >>
> >
> > I will try. At the moment the assumption of "Not null zap_details
> > implies leave swap entries" is giving me a headache.
>
> Not only you, did you stumble over
>
> https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com
>
> already?
>

Oh thanks for the pointer. I missed that. I will take a look. Thanks again.
Peter Xu Nov. 25, 2021, 10:09 a.m. UTC | #17
On Tue, Nov 23, 2021 at 09:28:34AM -0800, Shakeel Butt wrote:
> On Tue, Nov 23, 2021 at 9:26 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 23.11.21 18:24, Shakeel Butt wrote:
> > > On Tue, Nov 23, 2021 at 9:20 AM David Hildenbrand <david@redhat.com> wrote:
> > >>
> > >> On 23.11.21 18:17, Shakeel Butt wrote:
> > >>> On Tue, Nov 23, 2021 at 8:57 AM David Hildenbrand <david@redhat.com> wrote:
> > >>>>
> > >>> [...]
> > >>>>>>
> > >>>>>> I do wonder which these locking contexts are exactly, and if we could
> > >>>>>> also do the same thing on ordinary munmap -- because I assume it can be
> > >>>>>> similarly problematic for some applications.
> > >>>>>
> > >>>>> This is a good question regarding munmap. One main difference is
> > >>>>> munmap takes mmap_lock in write mode and usually performance critical
> > >>>>> applications avoid such operations.
> > >>>>
> > >>>> Maybe we can extend it too most page zapping, if that makes things simpler.
> > >>>>
> > >>>
> > >>> Do you mean doing sync THP split for most of page zapping functions
> > >>> (but only if that makes things simpler)?
> > >>>
> > >>
> > >> Yes -- if there are no downsides.
> > >>
> > >
> > > I will try. At the moment the assumption of "Not null zap_details
> > > implies leave swap entries" is giving me a headache.
> >
> > Not only you, did you stumble over
> >
> > https://lkml.kernel.org/r/20211115134951.85286-1-peterx@redhat.com

(thanks for raising this, David)

> >
> > already?
> >
> 
> Oh thanks for the pointer. I missed that. I will take a look. Thanks again.

Hi, Shakeel, 

I saw your v2 has started to pass in zap_details, then we need know the side
effect on that skip-swap-entry thing because with your v2 code munmap() will
start to skip swap entry too (while it was not before).

I saw that you didn't mention this in v2 patch either in commit message or
code, not sure whether you digged that up.  I think it needs some double check
(or feel free to start this digging by reviewing my small patch above :).

Thanks,
Peter Xu Nov. 25, 2021, 10:24 a.m. UTC | #18
On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> > Do we have a performance evaluation how much overhead is added e.g., for
> > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> > covers the whole THP?
> 
> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> x86 for both settings you suggested. I don't see any statistically
> significant difference with and without the patch. Let me know if you
> want me to try something else.

I'm a bit surprised that sync split thp didn't bring any extra overhead.

"unmap whole thp" is understandable from that pov, because afaict that won't
even trigger any thp split anyway even delayed, if this is the simplest case
that only this process mapped this thp, and it mapped once.

For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
fast; while what I don't understand since thp split requires all hand-made work
for copying thp flags into small pages and so on, so I thought there should at
least be some overhead measured.  Shakeel, could there be something overlooked
in the test, or maybe it's me that overlooked?

I had the same concern as what Kirill/Matthew raised in the other thread - I'm
worried proactively splitting simply because any 4k page is zapped might
quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
the defragmentation of the system memory in general.  I'm also not sure whether
that's ideal for some very common workload that frequently uses DONTNEED to
proactively drop some pages.

To me, the old deffered-split has a point in that it'll only be done when at
least the memory or cgroup is in low mem, that means we're in extreme cases so
we'd better start to worry page allocation failures rather than number of thps
and memory performance.  v2 even added unmap() into account, so that'll further
amplify that effect, imho.

I'm wondering whether MADV_SPLIT would make more sense so as to keep the old
DONTNEED/unmap behaviors, however before that I think I should understand the
test results first, because besides 2m pages missing that'll be another
important factor for "whether a new interface is more welcomed" from perf pov.

Thanks,
David Hildenbrand Nov. 25, 2021, 10:32 a.m. UTC | #19
On 25.11.21 11:24, Peter Xu wrote:
> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
>>> Do we have a performance evaluation how much overhead is added e.g., for
>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
>>> covers the whole THP?
>>
>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
>> x86 for both settings you suggested. I don't see any statistically
>> significant difference with and without the patch. Let me know if you
>> want me to try something else.
> 
> I'm a bit surprised that sync split thp didn't bring any extra overhead.
> 
> "unmap whole thp" is understandable from that pov, because afaict that won't
> even trigger any thp split anyway even delayed, if this is the simplest case
> that only this process mapped this thp, and it mapped once.
> 
> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> fast; while what I don't understand since thp split requires all hand-made work
> for copying thp flags into small pages and so on, so I thought there should at
> least be some overhead measured.  Shakeel, could there be something overlooked
> in the test, or maybe it's me that overlooked?
> 
> I had the same concern as what Kirill/Matthew raised in the other thread - I'm
> worried proactively splitting simply because any 4k page is zapped might
> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
> the defragmentation of the system memory in general.  I'm also not sure whether
> that's ideal for some very common workload that frequently uses DONTNEED to
> proactively drop some pages.

The pageblock corresponding to the THP is movable. So (unless we start
spilling unmovable allocations into movable pageblocks) we'd only place
movable allocations in there. Compaction will be able to migrate to
re-create a free THP.

In contrast I think, compaction will happily skip over the THP and
ignore it, because it has no clue that the THP could be repurposed by
split+migrate (at least I am not aware of code that does it).

Unless I am missing something, with the above in mind it could make
sense to split as soon as possible, even before we're under memory
pressure -- for example, for proactive compaction.

[proactive compaction could try splitting first as well I think]
Shakeel Butt Nov. 25, 2021, 5:14 p.m. UTC | #20
On Thu, Nov 25, 2021 at 2:09 AM Peter Xu <peterx@redhat.com> wrote:
>
[...]
>
> Hi, Shakeel,
>
> I saw your v2 has started to pass in zap_details, then we need know the side
> effect on that skip-swap-entry thing because with your v2 code munmap() will
> start to skip swap entry too (while it was not before).
>
> I saw that you didn't mention this in v2 patch either in commit message or
> code, not sure whether you digged that up.  I think it needs some double check
> (or feel free to start this digging by reviewing my small patch above :).
>

I remember mentioning in the patch that this has dependency on your
rfc patch. Just checked again and yeah I only mentioned in the
'changes since v1' section. I will add it more explicitly in the
following versions.

I will take a close look at your patch as well.
Peter Xu Nov. 26, 2021, midnight UTC | #21
On Thu, Nov 25, 2021 at 09:14:30AM -0800, Shakeel Butt wrote:
> On Thu, Nov 25, 2021 at 2:09 AM Peter Xu <peterx@redhat.com> wrote:
> >
> [...]
> >
> > Hi, Shakeel,
> >
> > I saw your v2 has started to pass in zap_details, then we need know the side
> > effect on that skip-swap-entry thing because with your v2 code munmap() will
> > start to skip swap entry too (while it was not before).
> >
> > I saw that you didn't mention this in v2 patch either in commit message or
> > code, not sure whether you digged that up.  I think it needs some double check
> > (or feel free to start this digging by reviewing my small patch above :).
> >
> 
> I remember mentioning in the patch that this has dependency on your
> rfc patch. Just checked again and yeah I only mentioned in the
> 'changes since v1' section. I will add it more explicitly in the
> following versions.

Apologize for missing that section.  Yeah depending on that patch should work.

> 
> I will take a close look at your patch as well.

Thanks a lot,
Peter Xu Nov. 26, 2021, 2:52 a.m. UTC | #22
On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote:
> On 25.11.21 11:24, Peter Xu wrote:
> > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> >>> Do we have a performance evaluation how much overhead is added e.g., for
> >>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> >>> covers the whole THP?
> >>
> >> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> >> x86 for both settings you suggested. I don't see any statistically
> >> significant difference with and without the patch. Let me know if you
> >> want me to try something else.
> > 
> > I'm a bit surprised that sync split thp didn't bring any extra overhead.
> > 
> > "unmap whole thp" is understandable from that pov, because afaict that won't
> > even trigger any thp split anyway even delayed, if this is the simplest case
> > that only this process mapped this thp, and it mapped once.
> > 
> > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> > fast; while what I don't understand since thp split requires all hand-made work
> > for copying thp flags into small pages and so on, so I thought there should at
> > least be some overhead measured.  Shakeel, could there be something overlooked
> > in the test, or maybe it's me that overlooked?
> > 
> > I had the same concern as what Kirill/Matthew raised in the other thread - I'm
> > worried proactively splitting simply because any 4k page is zapped might
> > quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
> > the defragmentation of the system memory in general.  I'm also not sure whether
> > that's ideal for some very common workload that frequently uses DONTNEED to
> > proactively drop some pages.
> 
> The pageblock corresponding to the THP is movable. So (unless we start
> spilling unmovable allocations into movable pageblocks) we'd only place
> movable allocations in there. Compaction will be able to migrate to
> re-create a free THP.
> 
> In contrast I think, compaction will happily skip over the THP and
> ignore it, because it has no clue that the THP could be repurposed by
> split+migrate (at least I am not aware of code that does it).
> 
> Unless I am missing something, with the above in mind it could make
> sense to split as soon as possible, even before we're under memory
> pressure -- for example, for proactive compaction.
> 
> [proactive compaction could try splitting first as well I think]

But we can't rely on proactive compaction for rapid operations, because it's
still adding overhead to the overall system by split+merge, right?

+compaction_proactiveness
+========================
+ ...
+Note that compaction has a non-trivial system-wide impact as pages
+belonging to different processes are moved around, which could also lead
+to latency spikes in unsuspecting applications. The kernel employs
+various heuristics to avoid wasting CPU cycles if it detects that
+proactive compaction is not being effective.

Delaying split makes sense to me because after all the kernel is not aware of
the userspace's preference, so the best thing is to do nothing until necessary.

Proactively split thps in dontneed/unmap added an assumption that the userspace
wants to break the pages by default.  It's 100% true for Shakeel's use case,
but I'm not sure whether it may always be true.  That's why I thought maybe a
new interface is more proper, so we at least won't break anyone by accident.
Shakeel Butt Nov. 26, 2021, 3:21 a.m. UTC | #23
On Thu, Nov 25, 2021 at 2:24 AM Peter Xu <peterx@redhat.com> wrote:
>
> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> > > Do we have a performance evaluation how much overhead is added e.g., for
> > > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> > > covers the whole THP?
> >
> > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> > x86 for both settings you suggested. I don't see any statistically
> > significant difference with and without the patch. Let me know if you
> > want me to try something else.
>
> I'm a bit surprised that sync split thp didn't bring any extra overhead.
>
> "unmap whole thp" is understandable from that pov, because afaict that won't
> even trigger any thp split anyway even delayed, if this is the simplest case
> that only this process mapped this thp, and it mapped once.
>
> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> fast; while what I don't understand since thp split requires all hand-made work
> for copying thp flags into small pages and so on, so I thought there should at
> least be some overhead measured.  Shakeel, could there be something overlooked
> in the test, or maybe it's me that overlooked?
>

Thanks for making me rerun this and yes indeed I had a very silly bug in the
benchmark code (i.e. madvise the same page for the whole loop) and this is
indeed several times slower than without the patch (sorry David for misleading
you).

To better understand what is happening, I profiled the benchmark:

-   31.27%     0.01%  dontneed  [kernel.kallsyms]  [k] zap_page_range_sync
   - 31.27% zap_page_range_sync
      - 30.25% split_local_deferred_list
         - 30.16% split_huge_page_to_list
            - 21.05% try_to_migrate
               + rmap_walk_anon
            - 7.47% remove_migration_ptes
               + 7.34% rmap_walk_locked
      + 1.02% zap_page_range_details

The overhead is not due to copying page flags but rather due to two rmap walks.
I don't think this much overhead is justified for current users of MADV_DONTNEED
and munmap. I have to rethink the approach.

thanks,
Shakeel
Peter Xu Nov. 26, 2021, 4:12 a.m. UTC | #24
On Thu, Nov 25, 2021 at 07:21:54PM -0800, Shakeel Butt wrote:
> On Thu, Nov 25, 2021 at 2:24 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> > > > Do we have a performance evaluation how much overhead is added e.g., for
> > > > a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> > > > covers the whole THP?
> > >
> > > I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> > > x86 for both settings you suggested. I don't see any statistically
> > > significant difference with and without the patch. Let me know if you
> > > want me to try something else.
> >
> > I'm a bit surprised that sync split thp didn't bring any extra overhead.
> >
> > "unmap whole thp" is understandable from that pov, because afaict that won't
> > even trigger any thp split anyway even delayed, if this is the simplest case
> > that only this process mapped this thp, and it mapped once.
> >
> > For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> > fast; while what I don't understand since thp split requires all hand-made work
> > for copying thp flags into small pages and so on, so I thought there should at
> > least be some overhead measured.  Shakeel, could there be something overlooked
> > in the test, or maybe it's me that overlooked?
> >
> 
> Thanks for making me rerun this and yes indeed I had a very silly bug in the
> benchmark code (i.e. madvise the same page for the whole loop) and this is
> indeed several times slower than without the patch (sorry David for misleading
> you).
> 
> To better understand what is happening, I profiled the benchmark:
> 
> -   31.27%     0.01%  dontneed  [kernel.kallsyms]  [k] zap_page_range_sync
>    - 31.27% zap_page_range_sync
>       - 30.25% split_local_deferred_list
>          - 30.16% split_huge_page_to_list
>             - 21.05% try_to_migrate
>                + rmap_walk_anon
>             - 7.47% remove_migration_ptes
>                + 7.34% rmap_walk_locked
>       + 1.02% zap_page_range_details

Makes sense, thanks for verifying it, Shakeel.  I forgot it'll also walk
itself.

I believe this effect will be exaggerated when the mapping is shared,
e.g. shmem file thp between processes.  What's worse is that when one process
DONTNEED one 4k page, all the rest mms will need to split the huge pmd without
even being noticed, so that's a direct suffer from perf degrade.

> 
> The overhead is not due to copying page flags but rather due to two rmap walks.
> I don't think this much overhead is justified for current users of MADV_DONTNEED
> and munmap. I have to rethink the approach.

Some side notes: I digged out the old MADV_COLLAPSE proposal right after I
thought about MADV_SPLIT (or any of its variance):

https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/

My memory was that there's some issue to be solved so that was blocked, however
when I read the thread it sounds like the list was mostly reaching a consensus
on considering MADV_COLLAPSE being beneficial.  Still copying DavidR in case I
missed something important.

If we think MADV_COLLAPSE can help to implement an userspace (and more
importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of
kcompactd, perhaps.

That's probably a bit off topic of this specific discussion on the specific use
case, but so far it seems all reasonable and discussable.
David Hildenbrand Nov. 26, 2021, 9:04 a.m. UTC | #25
On 26.11.21 03:52, Peter Xu wrote:
> On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote:
>> On 25.11.21 11:24, Peter Xu wrote:
>>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
>>>>> Do we have a performance evaluation how much overhead is added e.g., for
>>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
>>>>> covers the whole THP?
>>>>
>>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
>>>> x86 for both settings you suggested. I don't see any statistically
>>>> significant difference with and without the patch. Let me know if you
>>>> want me to try something else.
>>>
>>> I'm a bit surprised that sync split thp didn't bring any extra overhead.
>>>
>>> "unmap whole thp" is understandable from that pov, because afaict that won't
>>> even trigger any thp split anyway even delayed, if this is the simplest case
>>> that only this process mapped this thp, and it mapped once.
>>>
>>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
>>> fast; while what I don't understand since thp split requires all hand-made work
>>> for copying thp flags into small pages and so on, so I thought there should at
>>> least be some overhead measured.  Shakeel, could there be something overlooked
>>> in the test, or maybe it's me that overlooked?
>>>
>>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm
>>> worried proactively splitting simply because any 4k page is zapped might
>>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
>>> the defragmentation of the system memory in general.  I'm also not sure whether
>>> that's ideal for some very common workload that frequently uses DONTNEED to
>>> proactively drop some pages.
>>
>> The pageblock corresponding to the THP is movable. So (unless we start
>> spilling unmovable allocations into movable pageblocks) we'd only place
>> movable allocations in there. Compaction will be able to migrate to
>> re-create a free THP.
>>
>> In contrast I think, compaction will happily skip over the THP and
>> ignore it, because it has no clue that the THP could be repurposed by
>> split+migrate (at least I am not aware of code that does it).
>>
>> Unless I am missing something, with the above in mind it could make
>> sense to split as soon as possible, even before we're under memory
>> pressure -- for example, for proactive compaction.
>>
>> [proactive compaction could try splitting first as well I think]
> 
> But we can't rely on proactive compaction for rapid operations, because it's
> still adding overhead to the overall system by split+merge, right?

Yes, but there is also direct compaction that can be triggered without
the shrinker getting involved. I think we can summarize as "there might
not be a right or wrong when to split". An application that
MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to
consume memory, yet it looks like it's still consuming that memory.

I do wonder how THP on the deferred split queue behave in respect to
page migration -- memory offlining, alloc_contig_range(). I saw reports
that there are some cases where THP can be problematic when
stress-testing THP:
https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt

But not sure if that's related to deferred splitting. Most probably not.

> 
> +compaction_proactiveness
> +========================
> + ...
> +Note that compaction has a non-trivial system-wide impact as pages
> +belonging to different processes are moved around, which could also lead
> +to latency spikes in unsuspecting applications. The kernel employs
> +various heuristics to avoid wasting CPU cycles if it detects that
> +proactive compaction is not being effective.
> 
> Delaying split makes sense to me because after all the kernel is not aware of
> the userspace's preference, so the best thing is to do nothing until necessary.
> 
> Proactively split thps in dontneed/unmap added an assumption that the userspace
> wants to break the pages by default.  It's 100% true for Shakeel's use case,
> but I'm not sure whether it may always be true.  That's why I thought maybe a
> new interface is more proper, so we at least won't break anyone by accident.

Well, we already broke the PMD into PTEs. So the performance gain at
least for that user is really gone until we "fix that" again via
khugepaged -- which might just be configured to not "fix" if there are
empty PTEs.

It for sure is interesting if you have a COW huge page and only one
party zaps/unmaps some part.
David Hildenbrand Nov. 26, 2021, 9:16 a.m. UTC | #26
>>
>> Thanks for making me rerun this and yes indeed I had a very silly bug in the
>> benchmark code (i.e. madvise the same page for the whole loop) and this is
>> indeed several times slower than without the patch (sorry David for misleading
>> you).

No worries, BUGs happen :)

>>
>> To better understand what is happening, I profiled the benchmark:
>>
>> -   31.27%     0.01%  dontneed  [kernel.kallsyms]  [k] zap_page_range_sync
>>    - 31.27% zap_page_range_sync
>>       - 30.25% split_local_deferred_list
>>          - 30.16% split_huge_page_to_list
>>             - 21.05% try_to_migrate
>>                + rmap_walk_anon
>>             - 7.47% remove_migration_ptes
>>                + 7.34% rmap_walk_locked
>>       + 1.02% zap_page_range_details
> 
> Makes sense, thanks for verifying it, Shakeel.  I forgot it'll also walk
> itself.
> 
> I believe this effect will be exaggerated when the mapping is shared,
> e.g. shmem file thp between processes.  What's worse is that when one process
> DONTNEED one 4k page, all the rest mms will need to split the huge pmd without
> even being noticed, so that's a direct suffer from perf degrade.

Would this really apply to MADV_DONTNEED on shmem, and would deferred
splitting apply on shmem? I'm constantly confused about shmem vs. anon,
but I would have assumed that shmem is fd-based and we wouldn't end up
in rmap_walk_anon. For shmem, the pagecache would contain the THP which
would stick around and deferred splits don't even apply.

But again, I'm constantly confused so I'd love to be enlighted.

> 
>>
>> The overhead is not due to copying page flags but rather due to two rmap walks.
>> I don't think this much overhead is justified for current users of MADV_DONTNEED
>> and munmap. I have to rethink the approach.

Most probably not.

> 
> Some side notes: I digged out the old MADV_COLLAPSE proposal right after I
> thought about MADV_SPLIT (or any of its variance):
> 
> https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> 
> My memory was that there's some issue to be solved so that was blocked, however
> when I read the thread it sounds like the list was mostly reaching a consensus
> on considering MADV_COLLAPSE being beneficial.  Still copying DavidR in case I
> missed something important.
> 
> If we think MADV_COLLAPSE can help to implement an userspace (and more
> importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of
> kcompactd, perhaps.
> 
> That's probably a bit off topic of this specific discussion on the specific use
> case, but so far it seems all reasonable and discussable.

User space can trigger a split manually using some MADV hackery. But it
can only be used for the use case here, where we actually want to zap a
page.

1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE
   and the compound page.
2. MADV_DONTNEED either the complete range or the single 4k page.
Peter Xu Nov. 26, 2021, 9:39 a.m. UTC | #27
On Fri, Nov 26, 2021 at 10:16:58AM +0100, David Hildenbrand wrote:
> Would this really apply to MADV_DONTNEED on shmem, and would deferred
> splitting apply on shmem? I'm constantly confused about shmem vs. anon,
> but I would have assumed that shmem is fd-based and we wouldn't end up
> in rmap_walk_anon. For shmem, the pagecache would contain the THP which
> would stick around and deferred splits don't even apply.

Good point..  when split on shmem we just clear pmd, so yeah I don't think
we'll ever add it into the deferred list.

> User space can trigger a split manually using some MADV hackery. But it
> can only be used for the use case here, where we actually want to zap a
> page.
> 
> 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE
>    and the compound page.

Seems to be a very implicit but working solution indeed.

> 2. MADV_DONTNEED either the complete range or the single 4k page.

Is this what this patch is working on?

Thanks,
Yang Shi Nov. 29, 2021, 9:32 p.m. UTC | #28
On Fri, Nov 26, 2021 at 1:17 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> Thanks for making me rerun this and yes indeed I had a very silly bug in the
> >> benchmark code (i.e. madvise the same page for the whole loop) and this is
> >> indeed several times slower than without the patch (sorry David for misleading
> >> you).
>
> No worries, BUGs happen :)
>
> >>
> >> To better understand what is happening, I profiled the benchmark:
> >>
> >> -   31.27%     0.01%  dontneed  [kernel.kallsyms]  [k] zap_page_range_sync
> >>    - 31.27% zap_page_range_sync
> >>       - 30.25% split_local_deferred_list
> >>          - 30.16% split_huge_page_to_list
> >>             - 21.05% try_to_migrate
> >>                + rmap_walk_anon
> >>             - 7.47% remove_migration_ptes
> >>                + 7.34% rmap_walk_locked
> >>       + 1.02% zap_page_range_details
> >
> > Makes sense, thanks for verifying it, Shakeel.  I forgot it'll also walk
> > itself.
> >
> > I believe this effect will be exaggerated when the mapping is shared,
> > e.g. shmem file thp between processes.  What's worse is that when one process
> > DONTNEED one 4k page, all the rest mms will need to split the huge pmd without
> > even being noticed, so that's a direct suffer from perf degrade.
>
> Would this really apply to MADV_DONTNEED on shmem, and would deferred
> splitting apply on shmem? I'm constantly confused about shmem vs. anon,
> but I would have assumed that shmem is fd-based and we wouldn't end up
> in rmap_walk_anon. For shmem, the pagecache would contain the THP which
> would stick around and deferred splits don't even apply.

The deferred split is anon THP only, it doesn't apply to shmem THP.

For the rmap walk, there are two ramp walks for anon THP, but just one
for shmem THP. Both needs one rmap walk to unmap the THP before doing
actual split, but anon THP needs another rmap walk to reinstall PTEs
for still mapped pages, however this is not needed for shmem pages
since they could be reached via page cache.

>
> But again, I'm constantly confused so I'd love to be enlighted.
>
> >
> >>
> >> The overhead is not due to copying page flags but rather due to two rmap walks.
> >> I don't think this much overhead is justified for current users of MADV_DONTNEED
> >> and munmap. I have to rethink the approach.
>
> Most probably not.
>
> >
> > Some side notes: I digged out the old MADV_COLLAPSE proposal right after I
> > thought about MADV_SPLIT (or any of its variance):
> >
> > https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> >
> > My memory was that there's some issue to be solved so that was blocked, however
> > when I read the thread it sounds like the list was mostly reaching a consensus
> > on considering MADV_COLLAPSE being beneficial.  Still copying DavidR in case I
> > missed something important.
> >
> > If we think MADV_COLLAPSE can help to implement an userspace (and more
> > importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of
> > kcompactd, perhaps.
> >
> > That's probably a bit off topic of this specific discussion on the specific use
> > case, but so far it seems all reasonable and discussable.
>
> User space can trigger a split manually using some MADV hackery. But it
> can only be used for the use case here, where we actually want to zap a
> page.
>
> 1. MADV_FREE a single 4k page in the range. This will split the PMD->PTE
>    and the compound page.
> 2. MADV_DONTNEED either the complete range or the single 4k page.
>
> --
> Thanks,
>
> David / dhildenb
>
Yang Shi Nov. 29, 2021, 10 p.m. UTC | #29
On Fri, Nov 26, 2021 at 1:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.11.21 03:52, Peter Xu wrote:
> > On Thu, Nov 25, 2021 at 11:32:08AM +0100, David Hildenbrand wrote:
> >> On 25.11.21 11:24, Peter Xu wrote:
> >>> On Mon, Nov 22, 2021 at 10:40:54AM -0800, Shakeel Butt wrote:
> >>>>> Do we have a performance evaluation how much overhead is added e.g., for
> >>>>> a single 4k MADV_DONTNEED call on a THP or on a MADV_DONTNEED call that
> >>>>> covers the whole THP?
> >>>>
> >>>> I did a simple benchmark of madvise(MADV_DONTNEED) on 10000 THPs on
> >>>> x86 for both settings you suggested. I don't see any statistically
> >>>> significant difference with and without the patch. Let me know if you
> >>>> want me to try something else.
> >>>
> >>> I'm a bit surprised that sync split thp didn't bring any extra overhead.
> >>>
> >>> "unmap whole thp" is understandable from that pov, because afaict that won't
> >>> even trigger any thp split anyway even delayed, if this is the simplest case
> >>> that only this process mapped this thp, and it mapped once.
> >>>
> >>> For "unmap 4k upon thp" IIUC that's the worst case and zapping 4k should be
> >>> fast; while what I don't understand since thp split requires all hand-made work
> >>> for copying thp flags into small pages and so on, so I thought there should at
> >>> least be some overhead measured.  Shakeel, could there be something overlooked
> >>> in the test, or maybe it's me that overlooked?
> >>>
> >>> I had the same concern as what Kirill/Matthew raised in the other thread - I'm
> >>> worried proactively splitting simply because any 4k page is zapped might
> >>> quickly free up 2m thps in the system and I'm not sure whether it'll exaggerate
> >>> the defragmentation of the system memory in general.  I'm also not sure whether
> >>> that's ideal for some very common workload that frequently uses DONTNEED to
> >>> proactively drop some pages.
> >>
> >> The pageblock corresponding to the THP is movable. So (unless we start
> >> spilling unmovable allocations into movable pageblocks) we'd only place
> >> movable allocations in there. Compaction will be able to migrate to
> >> re-create a free THP.
> >>
> >> In contrast I think, compaction will happily skip over the THP and
> >> ignore it, because it has no clue that the THP could be repurposed by
> >> split+migrate (at least I am not aware of code that does it).
> >>
> >> Unless I am missing something, with the above in mind it could make
> >> sense to split as soon as possible, even before we're under memory
> >> pressure -- for example, for proactive compaction.
> >>
> >> [proactive compaction could try splitting first as well I think]
> >
> > But we can't rely on proactive compaction for rapid operations, because it's
> > still adding overhead to the overall system by split+merge, right?
>
> Yes, but there is also direct compaction that can be triggered without
> the shrinker getting involved. I think we can summarize as "there might
> not be a right or wrong when to split". An application that
> MADV_DONTNEEDs/munmap sub-THP memory told us that it doesn't want to
> consume memory, yet it looks like it's still consuming that memory.
>
> I do wonder how THP on the deferred split queue behave in respect to
> page migration -- memory offlining, alloc_contig_range(). I saw reports

IIUC the old page that is on deferred split queue would be freed and
off the list once the migration is done. But the new page is *NOT*
added to the deferred split list at all. This would not cause any
severe bugs but the partial unmapped pages can no longer be shrunk
under memory pressure.

> that there are some cases where THP can be problematic when
> stress-testing THP:
> https://lkml.kernel.org/r/20210903162102.GA10039@mam-gdavis-dt
>
> But not sure if that's related to deferred splitting. Most probably not.
>
> >
> > +compaction_proactiveness
> > +========================
> > + ...
> > +Note that compaction has a non-trivial system-wide impact as pages
> > +belonging to different processes are moved around, which could also lead
> > +to latency spikes in unsuspecting applications. The kernel employs
> > +various heuristics to avoid wasting CPU cycles if it detects that
> > +proactive compaction is not being effective.
> >
> > Delaying split makes sense to me because after all the kernel is not aware of
> > the userspace's preference, so the best thing is to do nothing until necessary.
> >
> > Proactively split thps in dontneed/unmap added an assumption that the userspace
> > wants to break the pages by default.  It's 100% true for Shakeel's use case,
> > but I'm not sure whether it may always be true.  That's why I thought maybe a
> > new interface is more proper, so we at least won't break anyone by accident.
>
> Well, we already broke the PMD into PTEs. So the performance gain at
> least for that user is really gone until we "fix that" again via
> khugepaged -- which might just be configured to not "fix" if there are
> empty PTEs.
>
> It for sure is interesting if you have a COW huge page and only one
> party zaps/unmaps some part.
>
>
> --
> Thanks,
>
> David / dhildenb
>
Pankaj Gupta Dec. 8, 2021, 1:23 p.m. UTC | #30
> >> Many applications do sophisticated management of their heap memory for
> >> better performance but with low cost. We have a bunch of such
> >> applications running on our production and examples include caching and
> >> data storage services. These applications keep their hot data on the
> >> THPs for better performance and release the cold data through
> >> MADV_DONTNEED to keep the memory cost low.
> >>
> >> The kernel defers the split and release of THPs until there is memory
> >> pressure. This causes complicates the memory management of these
> >> sophisticated applications which then needs to look into low level
> >> kernel handling of THPs to better gauge their headroom for expansion. In
> >> addition these applications are very latency sensitive and would prefer
> >> to not face memory reclaim due to non-deterministic nature of reclaim.
> >>
> >> This patch let such applications not worry about the low level handling
> >> of THPs in the kernel and splits the THPs synchronously on
> >> MADV_DONTNEED.
> >
> > I've been wondering about whether this is really the right strategy
> > (and this goes wider than just this one, new case)
> >
> > We chose to use a 2MB page here, based on whatever heuristics are
> > currently in play.  Now userspace is telling us we were wrong and should
> > have used smaller pages.
>
> IIUC, not necessarily, unfortunately.
>
> User space might be discarding the whole 2MB either via a single call
> (MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page
> reporting" or by virtio-mem in QEMU). In that case, there is nothing to
> migrate and we were not doing anything wrong.
>
> But more extreme, user space might be discarding the whole THP in small
> pieces shortly over time. This for example happens when a VM inflates
> the memory balloon via virtio-balloon. All inflation requests are 4k,
> resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP
> range inside of the VM, mapping to a THP range inside the hypervisor,
> we'll essentially free a THP in the hypervisor piece by piece using
> individual MADV_DONTNEED calls -- this happens frequently. Something
> similar can happen when de-fragmentation inside the VM "moves around"
> inflated 4k pages piece by piece to essentially form a huge inflated
> range -- this happens less frequently as of now. In both cases,
> migration is counter-productive, as we're just about to free the whole
> range either way.
>
> (yes, there are ways to optimize, for example using hugepage ballooning
> or merging MADV_DONTNEED calls in the hypervisor, but what I described
> is what we currently implement in hypervisors like QEMU, because there
> are corner cases for everything)

It seems this can happen when guest using huge pages or THP. If we end
up not freeing
hypervisor memory(THP) till memory pressure mounts, this can be a
problem for "free page reporting"
as well?

>
> Long story short: it's hard to tell what will happen next based on a
> single MADV_DONTNEED call. Page compaction, in comparison, doesn't care
> and optimized the layout as it observes it.
David Rientjes Jan. 24, 2022, 6:48 p.m. UTC | #31
On Fri, 26 Nov 2021, Peter Xu wrote:

> Some side notes: I digged out the old MADV_COLLAPSE proposal right after I
> thought about MADV_SPLIT (or any of its variance):
> 
> https://lore.kernel.org/all/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
> 
> My memory was that there's some issue to be solved so that was blocked, however
> when I read the thread it sounds like the list was mostly reaching a consensus
> on considering MADV_COLLAPSE being beneficial.  Still copying DavidR in case I
> missed something important.
> 
> If we think MADV_COLLAPSE can help to implement an userspace (and more
> importantly, data-aware) khugepaged, then MADV_SPLIT can be the other side of
> kcompactd, perhaps.
> 
> That's probably a bit off topic of this specific discussion on the specific use
> case, but so far it seems all reasonable and discussable.
> 

Hi Peter,

Providing a (late) update since we now have some better traction on this, 
I think we'll be ready to post an RFC soon that introduces MADV_COLLAPSE.  
The work is being driven by Zach, now cc'd.

Let's also include SeongJae Park <sj@kernel.org> as well and keep him in 
the loop since DAMON could easily be extended with a DAMOS_COLLAPSE action 
to use MADV_COLLAPSE for hot regions of memory.

Idea for initial approach:

 - MADV_COLLAPSE core code based on the proposal you cite above for anon 
   memory as the inaugural support, collapse memory into thp in process 
   context

 - Batching support to collapse ranges of memory into multiple THP

 - Wire this up for madvise(2) (and process_madvise(2))

 - Enlightenment for file-backed thp

I think Zach's RFC will cover the first three, it could be debated if the 
initial patch series *must* support file-backed thp.  We'll see based on 
the feedback to the RFC.

There's also an extension where MADV_COLLAPSE could be potentially useful 
for hugetlb backed memory.  We have another effort underway that we've 
been talking with Mike Kravetz about that allows hugetlb memory to be 
mapped at multiple levels of the page tables.  There are several use cases 
but one of the driving factors is the performance of post-copy live 
migration; in this case, you'd be able to send smaller sized pages over 
the wire rather than, say, a 1GB gigantic page.

In this case, MADV_COLLAPSE could be useful to map smaller pages by 
a larger page table entry before all of the smaller pages have been live 
migrated.

That said, we have not invested time into an MADV_SPLIT yet.

Do you (or anybody else) have concerns about this approach?  Ideas for 
extensions?

Thanks!
diff mbox series

Patch

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 58e744b78c2c..7fa0035128b9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -795,6 +795,11 @@  struct deferred_split {
 	struct list_head split_queue;
 	unsigned long split_queue_len;
 };
+void split_local_deferred_list(struct list_head *defer_list);
+#else
+static inline void split_local_deferred_list(struct list_head *defer_list)
+{
+}
 #endif
 
 /*
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9d27fd0ce5df..a984bb6509d9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1412,6 +1412,10 @@  struct task_struct {
 	struct mem_cgroup		*active_memcg;
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	struct list_head		*deferred_split_list;
+#endif
+
 #ifdef CONFIG_BLK_CGROUP
 	struct request_queue		*throttle_queue;
 #endif
diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index fdf742e15e27..9b438c7e811e 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -374,6 +374,17 @@  set_active_memcg(struct mem_cgroup *memcg)
 }
 #endif
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline void set_local_deferred_list(struct list_head *list)
+{
+	current->deferred_split_list = list;
+}
+#else
+static inline void set_local_deferred_list(struct list_head *list)
+{
+}
+#endif
+
 #ifdef CONFIG_MEMBARRIER
 enum {
 	MEMBARRIER_STATE_PRIVATE_EXPEDITED_READY		= (1U << 0),
diff --git a/kernel/fork.c b/kernel/fork.c
index 01af6129aa38..8197b8ed4b7a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1019,6 +1019,9 @@  static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 
 #ifdef CONFIG_MEMCG
 	tsk->active_memcg = NULL;
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	tsk->deferred_split_list = NULL;
 #endif
 	return tsk;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5483347291c..2f73eeecb857 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2754,6 +2754,7 @@  void free_transhuge_page(struct page *page)
 void deferred_split_huge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
+	struct list_head *list = current->deferred_split_list;
 #ifdef CONFIG_MEMCG
 	struct mem_cgroup *memcg = page_memcg(compound_head(page));
 #endif
@@ -2774,7 +2775,14 @@  void deferred_split_huge_page(struct page *page)
 	if (PageSwapCache(page))
 		return;
 
+	if (list && list_empty(page_deferred_list(page))) {
+		/* corresponding put in split_local_deferred_list. */
+		get_page(page);
+		list_add(page_deferred_list(page), list);
+	}
+
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+
 	if (list_empty(page_deferred_list(page))) {
 		count_vm_event(THP_DEFERRED_SPLIT_PAGE);
 		list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
@@ -2801,6 +2809,48 @@  static unsigned long deferred_split_count(struct shrinker *shrink,
 	return READ_ONCE(ds_queue->split_queue_len);
 }
 
+void split_local_deferred_list(struct list_head *defer_list)
+{
+	struct list_head *pos, *next;
+	struct page *page;
+
+	/* First iteration for split. */
+	list_for_each_safe(pos, next, defer_list) {
+		page = list_entry((void *)pos, struct page, deferred_list);
+		page = compound_head(page);
+
+		if (!trylock_page(page))
+			continue;
+
+		if (split_huge_page(page)) {
+			unlock_page(page);
+			continue;
+		}
+		/* split_huge_page() removes page from list on success */
+		unlock_page(page);
+
+		/* corresponding get in deferred_split_huge_page. */
+		put_page(page);
+	}
+
+	/* Second iteration to putback failed pages. */
+	list_for_each_safe(pos, next, defer_list) {
+		struct deferred_split *ds_queue;
+		unsigned long flags;
+
+		page = list_entry((void *)pos, struct page, deferred_list);
+		page = compound_head(page);
+		ds_queue = get_deferred_split_queue(page);
+
+		spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+		list_move(page_deferred_list(page), &ds_queue->split_queue);
+		spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+
+		/* corresponding get in deferred_split_huge_page. */
+		put_page(page);
+	}
+}
+
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 		struct shrink_control *sc)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 8c927202bbe6..15614115e359 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -762,7 +762,15 @@  static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
+	LIST_HEAD(list);
+
+	set_local_deferred_list(&list);
+
 	zap_page_range(vma, start, end - start);
+
+	set_local_deferred_list(NULL);
+	split_local_deferred_list(&list);
+
 	return 0;
 }