diff mbox series

[v3] mm/khugepaged: sched to numa node when collapse huge page

Message ID 20220317065024.2635069-1-maobibo@loongson.cn (mailing list archive)
State New
Headers show
Series [v3] mm/khugepaged: sched to numa node when collapse huge page | expand

Commit Message

bibo mao March 17, 2022, 6:50 a.m. UTC
collapse huge page will copy huge page from general small pages,
dest node is calculated from most one of source pages, however
THP daemon is not scheduled on dest node. The performance may be
poor since huge page copying across nodes, also cache is not used
for target node. With this patch, khugepaged daemon switches to
the same numa node with huge page. It saves copying time and makes
use of local cache better.

With this patch, specint 2006 base performance is improved with 6%
on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

Signed-off-by: Bibo Mao <maobibo@loongson.cn>
---
changelog:
V2: remove node record for thp daemon
V3: remove unlikely statement
---
 mm/khugepaged.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Andrew Morton April 27, 2022, 8:48 p.m. UTC | #1
On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote:

> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
> 
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> 

Are there any acks for this one please?

> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
>  	gfp_t gfp;
> +	const struct cpumask *cpumask;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 * that. We will recheck the vma after taking it again in write mode.
>  	 */
>  	mmap_read_unlock(mm);
> +
> +	/* sched to specified node before huage page memory copy */
> +	if (task_node(current) != node) {
> +		cpumask = cpumask_of_node(node);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(current, cpumask);
> +	}
>  	new_page = khugepaged_alloc_page(hpage, gfp, node);
>  	if (!new_page) {
>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> -- 
> 2.31.1
>
Yang Shi April 27, 2022, 10:29 p.m. UTC | #2
On Wed, Apr 27, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote:
>
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> >
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> >
>
> Are there any acks for this one please?

TBH, I'm a little bit reluctant to this patch. I agree running
khugepaged on the same node with the source and dest pages could
reduce cross socket traffic and use cache more efficiently. But I'm
not sure whether it is really worth it or not. For example, on a busy
system, khugepaged may jump from cpus to cpus, that may interfere with
the scheduler, and khugepaged has to wait to run on the target cpu, it
may take indefinite time. In addition the yield also depends on the
locality of source pages (how many of them are on the same node), how
often khugepaged is woken up on a different node, etc.

Even though it was proved worth it, I'd prefer set_cpus_allowed_ptr()
is called between mmap_read_unlock() and mmap_write_lock() in order to
avoid waste effort for some error paths.

>
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> >       gfp_t gfp;
> > +     const struct cpumask *cpumask;
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> >        * that. We will recheck the vma after taking it again in write mode.
> >        */
> >       mmap_read_unlock(mm);
> > +
> > +     /* sched to specified node before huage page memory copy */
> > +     if (task_node(current) != node) {
> > +             cpumask = cpumask_of_node(node);
> > +             if (!cpumask_empty(cpumask))
> > +                     set_cpus_allowed_ptr(current, cpumask);
> > +     }
> >       new_page = khugepaged_alloc_page(hpage, gfp, node);
> >       if (!new_page) {
> >               result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> > --
> > 2.31.1
> >
David Hildenbrand April 28, 2022, 10:07 a.m. UTC | #3
On 27.04.22 22:48, Andrew Morton wrote:
> On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote:
> 
>> collapse huge page will copy huge page from general small pages,
>> dest node is calculated from most one of source pages, however
>> THP daemon is not scheduled on dest node. The performance may be
>> poor since huge page copying across nodes, also cache is not used
>> for target node. With this patch, khugepaged daemon switches to
>> the same numa node with huge page. It saves copying time and makes
>> use of local cache better.
>>
>> With this patch, specint 2006 base performance is improved with 6%
>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>
> 
> Are there any acks for this one please?

I'll have a look!
Peter Xu April 28, 2022, 1:50 p.m. UTC | #4
Hi, Bibo,

On Thu, Mar 17, 2022 at 02:50:24AM -0400, Bibo Mao wrote:
> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
> 
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

Totally not familiar with specint, so a pure question is whether it'll make
a real difference in real-world workloads?  As I assume in real world the
memory affinity to the processors should change relatively slow on tuned
systems, so even if khugepaged copied a bit slower then it'll not affect
much on the real workload after the movement completes?

The other question is if it makes sense, whether it's applicable to file
thps too (collapse_file)?

Thanks,

> 
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
> changelog:
> V2: remove node record for thp daemon
> V3: remove unlikely statement
> ---
>  mm/khugepaged.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..b3cf0885f5a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
>  	gfp_t gfp;
> +	const struct cpumask *cpumask;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 * that. We will recheck the vma after taking it again in write mode.
>  	 */
>  	mmap_read_unlock(mm);
> +
> +	/* sched to specified node before huage page memory copy */
> +	if (task_node(current) != node) {
> +		cpumask = cpumask_of_node(node);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(current, cpumask);
> +	}
>  	new_page = khugepaged_alloc_page(hpage, gfp, node);
>  	if (!new_page) {
>  		result = SCAN_ALLOC_HUGE_PAGE_FAIL;
David Hildenbrand April 28, 2022, 3:17 p.m. UTC | #5
On 17.03.22 07:50, Bibo Mao wrote:
> collapse huge page will copy huge page from general small pages,
> dest node is calculated from most one of source pages, however
> THP daemon is not scheduled on dest node. The performance may be
> poor since huge page copying across nodes, also cache is not used
> for target node. With this patch, khugepaged daemon switches to
> the same numa node with huge page. It saves copying time and makes
> use of local cache better.
> 
> With this patch, specint 2006 base performance is improved with 6%
> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.

If it helps, that's nice as long as it doesn't hurt other cases.

> 
> Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> ---
> changelog:
> V2: remove node record for thp daemon
> V3: remove unlikely statement
> ---
>  mm/khugepaged.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 131492fd1148..b3cf0885f5a2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
>  	gfp_t gfp;
> +	const struct cpumask *cpumask;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
>  	 * that. We will recheck the vma after taking it again in write mode.
>  	 */
>  	mmap_read_unlock(mm);
> +
> +	/* sched to specified node before huage page memory copy */

huage? I assume "huge"

> +	if (task_node(current) != node) {
> +		cpumask = cpumask_of_node(node);
> +		if (!cpumask_empty(cpumask))
> +			set_cpus_allowed_ptr(current, cpumask);
> +	}

I wonder if that will always be optimized out without NUMA and if we
want to check for IS_ENABLED(CONFIG_NUMA).


Regarding comments from others, I agree: I think what we'd actually want
is something like "try to reschedule to one of these CPUs immediately.
If they are all busy, just stay here.


Also, I do wonder if there could already be scenarios where someone
wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
when pinning VCPUs in a VM to physical CPUs). It might even degrade the
VM performance in that case if we schedule something unrelated on these
CPUs. (I don't know which interfaces we might already have to configure
housekeeping CPUs for kthreads).

I can spot in kernel/kthread.c:kthread()

set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));

Hmmmmm ...
Peter Xu April 28, 2022, 4:34 p.m. UTC | #6
On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> On 17.03.22 07:50, Bibo Mao wrote:
> > collapse huge page will copy huge page from general small pages,
> > dest node is calculated from most one of source pages, however
> > THP daemon is not scheduled on dest node. The performance may be
> > poor since huge page copying across nodes, also cache is not used
> > for target node. With this patch, khugepaged daemon switches to
> > the same numa node with huge page. It saves copying time and makes
> > use of local cache better.
> > 
> > With this patch, specint 2006 base performance is improved with 6%
> > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> 
> If it helps, that's nice as long as it doesn't hurt other cases.
> 
> > 
> > Signed-off-by: Bibo Mao <maobibo@loongson.cn>
> > ---
> > changelog:
> > V2: remove node record for thp daemon
> > V3: remove unlikely statement
> > ---
> >  mm/khugepaged.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 131492fd1148..b3cf0885f5a2 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	struct vm_area_struct *vma;
> >  	struct mmu_notifier_range range;
> >  	gfp_t gfp;
> > +	const struct cpumask *cpumask;
> >  
> >  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >  
> > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm,
> >  	 * that. We will recheck the vma after taking it again in write mode.
> >  	 */
> >  	mmap_read_unlock(mm);
> > +
> > +	/* sched to specified node before huage page memory copy */
> 
> huage? I assume "huge"
> 
> > +	if (task_node(current) != node) {
> > +		cpumask = cpumask_of_node(node);
> > +		if (!cpumask_empty(cpumask))
> > +			set_cpus_allowed_ptr(current, cpumask);
> > +	}
> 
> I wonder if that will always be optimized out without NUMA and if we
> want to check for IS_ENABLED(CONFIG_NUMA).
> 
> 
> Regarding comments from others, I agree: I think what we'd actually want
> is something like "try to reschedule to one of these CPUs immediately.
> If they are all busy, just stay here.
> 
> 
> Also, I do wonder if there could already be scenarios where someone
> wants to let khugepaged run only on selected housekeeping CPUs (e.g.,
> when pinning VCPUs in a VM to physical CPUs). It might even degrade the
> VM performance in that case if we schedule something unrelated on these
> CPUs. (I don't know which interfaces we might already have to configure
> housekeeping CPUs for kthreads).
> 
> I can spot in kernel/kthread.c:kthread()
> 
> set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD));
> 
> Hmmmmm ...

Yes that's a valid point, for RT afaik many users tunes the kernel threads
specifically on demand by pinning them.  So I'm not sure how this new
algorithm could break some users already, by either (1) trying to pin
khugepaged onto some isolated cores (which can cause spikes?), or (2) mess
up with the admin's previous pin settings on the khugepagd kthread.

The other thing is the khugepaged movement on the cores seems to be quite
random, because the pages it scans can be unpredictably stored on different
numa nodes, so logically it can start bouncing easily on some hosts and
that does sound questionalbe.. as I raised the (pure) question previously
on the 2nd point irrelevant of the benchmark result.
Andrew Morton May 13, 2022, 12:36 a.m. UTC | #7
On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote:

> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
> > On 17.03.22 07:50, Bibo Mao wrote:
> > > collapse huge page will copy huge page from general small pages,
> > > dest node is calculated from most one of source pages, however
> > > THP daemon is not scheduled on dest node. The performance may be
> > > poor since huge page copying across nodes, also cache is not used
> > > for target node. With this patch, khugepaged daemon switches to
> > > the same numa node with huge page. It saves copying time and makes
> > > use of local cache better.
> > > 
> > > With this patch, specint 2006 base performance is improved with 6%
> > > on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
> > 
> > If it helps, that's nice as long as it doesn't hurt other cases.
> > 

Quite a bit of doubtful feedback and we have yet to hear from the
author.  I'll drop the patch.

Bibo, please resend at a later time if you feel the patch remains
desirable.  Please attempt to address the feedback via code changes
and/or changelogging.
bibo mao May 13, 2022, 1:19 a.m. UTC | #8
在 2022/5/13 08:36, Andrew Morton 写道:
> On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote:
> 
>> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
>>> On 17.03.22 07:50, Bibo Mao wrote:
>>>> collapse huge page will copy huge page from general small pages,
>>>> dest node is calculated from most one of source pages, however
>>>> THP daemon is not scheduled on dest node. The performance may be
>>>> poor since huge page copying across nodes, also cache is not used
>>>> for target node. With this patch, khugepaged daemon switches to
>>>> the same numa node with huge page. It saves copying time and makes
>>>> use of local cache better.
>>>>
>>>> With this patch, specint 2006 base performance is improved with 6%
>>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>>
>>> If it helps, that's nice as long as it doesn't hurt other cases.
>>>
> 
> Quite a bit of doubtful feedback and we have yet to hear from the
> author.  I'll drop the patch.
> 
> Bibo, please resend at a later time if you feel the patch remains
> desirable.  Please attempt to address the feedback via code changes
> and/or changelogging.
Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.

Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.


regards
bibo,mao
bibo mao May 13, 2022, 1:29 a.m. UTC | #9
在 2022/5/13 09:19, maobibo 写道:
> 
> 
> 在 2022/5/13 08:36, Andrew Morton 写道:
>> On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote:
>>
>>> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote:
>>>> On 17.03.22 07:50, Bibo Mao wrote:
>>>>> collapse huge page will copy huge page from general small pages,
>>>>> dest node is calculated from most one of source pages, however
>>>>> THP daemon is not scheduled on dest node. The performance may be
>>>>> poor since huge page copying across nodes, also cache is not used
>>>>> for target node. With this patch, khugepaged daemon switches to
>>>>> the same numa node with huge page. It saves copying time and makes
>>>>> use of local cache better.
>>>>>
>>>>> With this patch, specint 2006 base performance is improved with 6%
>>>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes.
>>>>
>>>> If it helps, that's nice as long as it doesn't hurt other cases.
>>>>
>>
>> Quite a bit of doubtful feedback and we have yet to hear from the
>> author.  I'll drop the patch.
>>
>> Bibo, please resend at a later time if you feel the patch remains
>> desirable.  Please attempt to address the feedback via code changes
>> and/or changelogging.
> Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
> 
> Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.

Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.

> 
> 
> regards
> bibo,mao
Andrew Morton May 13, 2022, 1:49 a.m. UTC | #10
On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote:

> 
> 
> >> and/or changelogging.
> > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
> > 
> > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
> 
> Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.

It has always surprised me that we have a single khugepaged thread.  If
we had a thread per node, you'd be all fixed up, yes?

Ditto ksmd.
bibo mao May 13, 2022, 1:59 a.m. UTC | #11
在 2022/5/13 09:49, Andrew Morton 写道:
> On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote:
> 
>>
>>
>>>> and/or changelogging.
>>> Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
>>>
>>> Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
>>
>> Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.
> 
> It has always surprised me that we have a single khugepaged thread.  If
> we had a thread per node, you'd be all fixed up, yes?
yes, it will solve this issue if there is a thread per node, also it can speed up huge page scanning. It should be useful for some workloads in short time like specint.
> 
> Ditto ksmd.
Yang Shi May 13, 2022, 2:40 a.m. UTC | #12
On Thu, May 12, 2022 at 6:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote:
>
> >
> >
> > >> and/or changelogging.
> > > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box.
> > >
> > > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process.
> >
> > Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch.
>
> It has always surprised me that we have a single khugepaged thread.  If
> we had a thread per node, you'd be all fixed up, yes?

Actually I was thinking about this before, but I didn't see too much
benefit with this approach TBH. The khugepaged scans vmas and the
mapped pages may spread on all nodes. It is not like kswapd which
could scan the LRUs for the specific node.

>
> Ditto ksmd.
>
diff mbox series

Patch

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..b3cf0885f5a2 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1066,6 +1066,7 @@  static void collapse_huge_page(struct mm_struct *mm,
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
 	gfp_t gfp;
+	const struct cpumask *cpumask;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1079,6 +1080,13 @@  static void collapse_huge_page(struct mm_struct *mm,
 	 * that. We will recheck the vma after taking it again in write mode.
 	 */
 	mmap_read_unlock(mm);
+
+	/* sched to specified node before huage page memory copy */
+	if (task_node(current) != node) {
+		cpumask = cpumask_of_node(node);
+		if (!cpumask_empty(cpumask))
+			set_cpus_allowed_ptr(current, cpumask);
+	}
 	new_page = khugepaged_alloc_page(hpage, gfp, node);
 	if (!new_page) {
 		result = SCAN_ALLOC_HUGE_PAGE_FAIL;