Message ID | 20220317065024.2635069-1-maobibo@loongson.cn (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm/khugepaged: sched to numa node when collapse huge page | expand |
On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote: > collapse huge page will copy huge page from general small pages, > dest node is calculated from most one of source pages, however > THP daemon is not scheduled on dest node. The performance may be > poor since huge page copying across nodes, also cache is not used > for target node. With this patch, khugepaged daemon switches to > the same numa node with huge page. It saves copying time and makes > use of local cache better. > > With this patch, specint 2006 base performance is improved with 6% > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. > Are there any acks for this one please? > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, > struct vm_area_struct *vma; > struct mmu_notifier_range range; > gfp_t gfp; > + const struct cpumask *cpumask; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, > * that. We will recheck the vma after taking it again in write mode. > */ > mmap_read_unlock(mm); > + > + /* sched to specified node before huage page memory copy */ > + if (task_node(current) != node) { > + cpumask = cpumask_of_node(node); > + if (!cpumask_empty(cpumask)) > + set_cpus_allowed_ptr(current, cpumask); > + } > new_page = khugepaged_alloc_page(hpage, gfp, node); > if (!new_page) { > result = SCAN_ALLOC_HUGE_PAGE_FAIL; > -- > 2.31.1 >
On Wed, Apr 27, 2022 at 1:48 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote: > > > collapse huge page will copy huge page from general small pages, > > dest node is calculated from most one of source pages, however > > THP daemon is not scheduled on dest node. The performance may be > > poor since huge page copying across nodes, also cache is not used > > for target node. With this patch, khugepaged daemon switches to > > the same numa node with huge page. It saves copying time and makes > > use of local cache better. > > > > With this patch, specint 2006 base performance is improved with 6% > > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. > > > > Are there any acks for this one please? TBH, I'm a little bit reluctant to this patch. I agree running khugepaged on the same node with the source and dest pages could reduce cross socket traffic and use cache more efficiently. But I'm not sure whether it is really worth it or not. For example, on a busy system, khugepaged may jump from cpus to cpus, that may interfere with the scheduler, and khugepaged has to wait to run on the target cpu, it may take indefinite time. In addition the yield also depends on the locality of source pages (how many of them are on the same node), how often khugepaged is woken up on a different node, etc. Even though it was proved worth it, I'd prefer set_cpus_allowed_ptr() is called between mmap_read_unlock() and mmap_write_lock() in order to avoid waste effort for some error paths. > > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, > > struct vm_area_struct *vma; > > struct mmu_notifier_range range; > > gfp_t gfp; > > + const struct cpumask *cpumask; > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, > > * that. We will recheck the vma after taking it again in write mode. > > */ > > mmap_read_unlock(mm); > > + > > + /* sched to specified node before huage page memory copy */ > > + if (task_node(current) != node) { > > + cpumask = cpumask_of_node(node); > > + if (!cpumask_empty(cpumask)) > > + set_cpus_allowed_ptr(current, cpumask); > > + } > > new_page = khugepaged_alloc_page(hpage, gfp, node); > > if (!new_page) { > > result = SCAN_ALLOC_HUGE_PAGE_FAIL; > > -- > > 2.31.1 > >
On 27.04.22 22:48, Andrew Morton wrote: > On Thu, 17 Mar 2022 02:50:24 -0400 Bibo Mao <maobibo@loongson.cn> wrote: > >> collapse huge page will copy huge page from general small pages, >> dest node is calculated from most one of source pages, however >> THP daemon is not scheduled on dest node. The performance may be >> poor since huge page copying across nodes, also cache is not used >> for target node. With this patch, khugepaged daemon switches to >> the same numa node with huge page. It saves copying time and makes >> use of local cache better. >> >> With this patch, specint 2006 base performance is improved with 6% >> on Loongson 3C5000L platform with 32 cores and 8 numa nodes. >> > > Are there any acks for this one please? I'll have a look!
Hi, Bibo, On Thu, Mar 17, 2022 at 02:50:24AM -0400, Bibo Mao wrote: > collapse huge page will copy huge page from general small pages, > dest node is calculated from most one of source pages, however > THP daemon is not scheduled on dest node. The performance may be > poor since huge page copying across nodes, also cache is not used > for target node. With this patch, khugepaged daemon switches to > the same numa node with huge page. It saves copying time and makes > use of local cache better. > > With this patch, specint 2006 base performance is improved with 6% > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. Totally not familiar with specint, so a pure question is whether it'll make a real difference in real-world workloads? As I assume in real world the memory affinity to the processors should change relatively slow on tuned systems, so even if khugepaged copied a bit slower then it'll not affect much on the real workload after the movement completes? The other question is if it makes sense, whether it's applicable to file thps too (collapse_file)? Thanks, > > Signed-off-by: Bibo Mao <maobibo@loongson.cn> > --- > changelog: > V2: remove node record for thp daemon > V3: remove unlikely statement > --- > mm/khugepaged.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 131492fd1148..b3cf0885f5a2 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, > struct vm_area_struct *vma; > struct mmu_notifier_range range; > gfp_t gfp; > + const struct cpumask *cpumask; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, > * that. We will recheck the vma after taking it again in write mode. > */ > mmap_read_unlock(mm); > + > + /* sched to specified node before huage page memory copy */ > + if (task_node(current) != node) { > + cpumask = cpumask_of_node(node); > + if (!cpumask_empty(cpumask)) > + set_cpus_allowed_ptr(current, cpumask); > + } > new_page = khugepaged_alloc_page(hpage, gfp, node); > if (!new_page) { > result = SCAN_ALLOC_HUGE_PAGE_FAIL;
On 17.03.22 07:50, Bibo Mao wrote: > collapse huge page will copy huge page from general small pages, > dest node is calculated from most one of source pages, however > THP daemon is not scheduled on dest node. The performance may be > poor since huge page copying across nodes, also cache is not used > for target node. With this patch, khugepaged daemon switches to > the same numa node with huge page. It saves copying time and makes > use of local cache better. > > With this patch, specint 2006 base performance is improved with 6% > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. If it helps, that's nice as long as it doesn't hurt other cases. > > Signed-off-by: Bibo Mao <maobibo@loongson.cn> > --- > changelog: > V2: remove node record for thp daemon > V3: remove unlikely statement > --- > mm/khugepaged.c | 8 ++++++++ > 1 file changed, 8 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 131492fd1148..b3cf0885f5a2 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, > struct vm_area_struct *vma; > struct mmu_notifier_range range; > gfp_t gfp; > + const struct cpumask *cpumask; > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, > * that. We will recheck the vma after taking it again in write mode. > */ > mmap_read_unlock(mm); > + > + /* sched to specified node before huage page memory copy */ huage? I assume "huge" > + if (task_node(current) != node) { > + cpumask = cpumask_of_node(node); > + if (!cpumask_empty(cpumask)) > + set_cpus_allowed_ptr(current, cpumask); > + } I wonder if that will always be optimized out without NUMA and if we want to check for IS_ENABLED(CONFIG_NUMA). Regarding comments from others, I agree: I think what we'd actually want is something like "try to reschedule to one of these CPUs immediately. If they are all busy, just stay here. Also, I do wonder if there could already be scenarios where someone wants to let khugepaged run only on selected housekeeping CPUs (e.g., when pinning VCPUs in a VM to physical CPUs). It might even degrade the VM performance in that case if we schedule something unrelated on these CPUs. (I don't know which interfaces we might already have to configure housekeeping CPUs for kthreads). I can spot in kernel/kthread.c:kthread() set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD)); Hmmmmm ...
On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote: > On 17.03.22 07:50, Bibo Mao wrote: > > collapse huge page will copy huge page from general small pages, > > dest node is calculated from most one of source pages, however > > THP daemon is not scheduled on dest node. The performance may be > > poor since huge page copying across nodes, also cache is not used > > for target node. With this patch, khugepaged daemon switches to > > the same numa node with huge page. It saves copying time and makes > > use of local cache better. > > > > With this patch, specint 2006 base performance is improved with 6% > > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. > > If it helps, that's nice as long as it doesn't hurt other cases. > > > > > Signed-off-by: Bibo Mao <maobibo@loongson.cn> > > --- > > changelog: > > V2: remove node record for thp daemon > > V3: remove unlikely statement > > --- > > mm/khugepaged.c | 8 ++++++++ > > 1 file changed, 8 insertions(+) > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 131492fd1148..b3cf0885f5a2 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, > > struct vm_area_struct *vma; > > struct mmu_notifier_range range; > > gfp_t gfp; > > + const struct cpumask *cpumask; > > > > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > > > @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, > > * that. We will recheck the vma after taking it again in write mode. > > */ > > mmap_read_unlock(mm); > > + > > + /* sched to specified node before huage page memory copy */ > > huage? I assume "huge" > > > + if (task_node(current) != node) { > > + cpumask = cpumask_of_node(node); > > + if (!cpumask_empty(cpumask)) > > + set_cpus_allowed_ptr(current, cpumask); > > + } > > I wonder if that will always be optimized out without NUMA and if we > want to check for IS_ENABLED(CONFIG_NUMA). > > > Regarding comments from others, I agree: I think what we'd actually want > is something like "try to reschedule to one of these CPUs immediately. > If they are all busy, just stay here. > > > Also, I do wonder if there could already be scenarios where someone > wants to let khugepaged run only on selected housekeeping CPUs (e.g., > when pinning VCPUs in a VM to physical CPUs). It might even degrade the > VM performance in that case if we schedule something unrelated on these > CPUs. (I don't know which interfaces we might already have to configure > housekeeping CPUs for kthreads). > > I can spot in kernel/kthread.c:kthread() > > set_cpus_allowed_ptr(current, housekeeping_cpumask(HK_TYPE_KTHREAD)); > > Hmmmmm ... Yes that's a valid point, for RT afaik many users tunes the kernel threads specifically on demand by pinning them. So I'm not sure how this new algorithm could break some users already, by either (1) trying to pin khugepaged onto some isolated cores (which can cause spikes?), or (2) mess up with the admin's previous pin settings on the khugepagd kthread. The other thing is the khugepaged movement on the cores seems to be quite random, because the pages it scans can be unpredictably stored on different numa nodes, so logically it can start bouncing easily on some hosts and that does sound questionalbe.. as I raised the (pure) question previously on the 2nd point irrelevant of the benchmark result.
On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote: > On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote: > > On 17.03.22 07:50, Bibo Mao wrote: > > > collapse huge page will copy huge page from general small pages, > > > dest node is calculated from most one of source pages, however > > > THP daemon is not scheduled on dest node. The performance may be > > > poor since huge page copying across nodes, also cache is not used > > > for target node. With this patch, khugepaged daemon switches to > > > the same numa node with huge page. It saves copying time and makes > > > use of local cache better. > > > > > > With this patch, specint 2006 base performance is improved with 6% > > > on Loongson 3C5000L platform with 32 cores and 8 numa nodes. > > > > If it helps, that's nice as long as it doesn't hurt other cases. > > Quite a bit of doubtful feedback and we have yet to hear from the author. I'll drop the patch. Bibo, please resend at a later time if you feel the patch remains desirable. Please attempt to address the feedback via code changes and/or changelogging.
在 2022/5/13 08:36, Andrew Morton 写道: > On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote: > >> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote: >>> On 17.03.22 07:50, Bibo Mao wrote: >>>> collapse huge page will copy huge page from general small pages, >>>> dest node is calculated from most one of source pages, however >>>> THP daemon is not scheduled on dest node. The performance may be >>>> poor since huge page copying across nodes, also cache is not used >>>> for target node. With this patch, khugepaged daemon switches to >>>> the same numa node with huge page. It saves copying time and makes >>>> use of local cache better. >>>> >>>> With this patch, specint 2006 base performance is improved with 6% >>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes. >>> >>> If it helps, that's nice as long as it doesn't hurt other cases. >>> > > Quite a bit of doubtful feedback and we have yet to hear from the > author. I'll drop the patch. > > Bibo, please resend at a later time if you feel the patch remains > desirable. Please attempt to address the feedback via code changes > and/or changelogging. Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box. Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process. regards bibo,mao
在 2022/5/13 09:19, maobibo 写道: > > > 在 2022/5/13 08:36, Andrew Morton 写道: >> On Thu, 28 Apr 2022 12:34:07 -0400 Peter Xu <peterx@redhat.com> wrote: >> >>> On Thu, Apr 28, 2022 at 05:17:07PM +0200, David Hildenbrand wrote: >>>> On 17.03.22 07:50, Bibo Mao wrote: >>>>> collapse huge page will copy huge page from general small pages, >>>>> dest node is calculated from most one of source pages, however >>>>> THP daemon is not scheduled on dest node. The performance may be >>>>> poor since huge page copying across nodes, also cache is not used >>>>> for target node. With this patch, khugepaged daemon switches to >>>>> the same numa node with huge page. It saves copying time and makes >>>>> use of local cache better. >>>>> >>>>> With this patch, specint 2006 base performance is improved with 6% >>>>> on Loongson 3C5000L platform with 32 cores and 8 numa nodes. >>>> >>>> If it helps, that's nice as long as it doesn't hurt other cases. >>>> >> >> Quite a bit of doubtful feedback and we have yet to hear from the >> author. I'll drop the patch. >> >> Bibo, please resend at a later time if you feel the patch remains >> desirable. Please attempt to address the feedback via code changes >> and/or changelogging. > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box. > > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process. Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch. > > > regards > bibo,mao
On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote: > > > >> and/or changelogging. > > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box. > > > > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process. > > Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch. It has always surprised me that we have a single khugepaged thread. If we had a thread per node, you'd be all fixed up, yes? Ditto ksmd.
在 2022/5/13 09:49, Andrew Morton 写道: > On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote: > >> >> >>>> and/or changelogging. >>> Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box. >>> >>> Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process. >> >> Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch. > > It has always surprised me that we have a single khugepaged thread. If > we had a thread per node, you'd be all fixed up, yes? yes, it will solve this issue if there is a thread per node, also it can speed up huge page scanning. It should be useful for some workloads in short time like specint. > > Ditto ksmd.
On Thu, May 12, 2022 at 6:49 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Fri, 13 May 2022 09:29:07 +0800 maobibo <maobibo@loongson.cn> wrote: > > > > > > > >> and/or changelogging. > > > Sorry for the late response, the mail is filtered and I did not notice that. The result is not so obvious after bandwidth is improved between nodes, it is only about 1% improvement for specint2006 for 32 core on my box. > > > > > > Now I do not see negative effective about this patch unless user wants to keep some cores separated from khugepaged daemon process. > > > > Can we provide an extra parameter to let khugepaged daemon scheduling binded to node or freely? If can, I will provide updated patch. > > It has always surprised me that we have a single khugepaged thread. If > we had a thread per node, you'd be all fixed up, yes? Actually I was thinking about this before, but I didn't see too much benefit with this approach TBH. The khugepaged scans vmas and the mapped pages may spread on all nodes. It is not like kswapd which could scan the LRUs for the specific node. > > Ditto ksmd. >
diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 131492fd1148..b3cf0885f5a2 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1066,6 +1066,7 @@ static void collapse_huge_page(struct mm_struct *mm, struct vm_area_struct *vma; struct mmu_notifier_range range; gfp_t gfp; + const struct cpumask *cpumask; VM_BUG_ON(address & ~HPAGE_PMD_MASK); @@ -1079,6 +1080,13 @@ static void collapse_huge_page(struct mm_struct *mm, * that. We will recheck the vma after taking it again in write mode. */ mmap_read_unlock(mm); + + /* sched to specified node before huage page memory copy */ + if (task_node(current) != node) { + cpumask = cpumask_of_node(node); + if (!cpumask_empty(cpumask)) + set_cpus_allowed_ptr(current, cpumask); + } new_page = khugepaged_alloc_page(hpage, gfp, node); if (!new_page) { result = SCAN_ALLOC_HUGE_PAGE_FAIL;
collapse huge page will copy huge page from general small pages, dest node is calculated from most one of source pages, however THP daemon is not scheduled on dest node. The performance may be poor since huge page copying across nodes, also cache is not used for target node. With this patch, khugepaged daemon switches to the same numa node with huge page. It saves copying time and makes use of local cache better. With this patch, specint 2006 base performance is improved with 6% on Loongson 3C5000L platform with 32 cores and 8 numa nodes. Signed-off-by: Bibo Mao <maobibo@loongson.cn> --- changelog: V2: remove node record for thp daemon V3: remove unlikely statement --- mm/khugepaged.c | 8 ++++++++ 1 file changed, 8 insertions(+)