Message ID | 20180627084954.73ucqvac62v5gje4@techsingularity.net (mailing list archive) |
---|---|
State | Not Applicable, archived |
Headers | show |
On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote: > Hi Mel, > > we have compared 4.18 + git:// > git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance results > look very good! > Excellent, thanks to both Kamil and yourself for collecting the data. It's helpful to have independent verification. > We see performance gains about 10-20% for SPECjbb2005. NAS results are a > little bit noisy but show overall performance gains as well (total runtime > for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a > specific example). Great. > The only benchmark showing a slight regression is stream > - but the regression is just a few percents ( upto 10%) and I think it's > not a real concern given that it's an artificial benchmark. > Agreed. > How is your testing going? Do you think > that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18? > My own testing completed and the results are within expectations and I saw no red flags. Unfortunately, I consider it unlikely they'll be merged for 4.18. Srikar Dronamraju's series is likely to need another update and I would need to rebase my patches on top of that. Given the scope and complexity, I find it unlikely they would be accepted for an -rc, particularly this late of an rc. Whether we hit the 4.19 merge window or not will depend on when Srikar's series gets updated. > Thanks a lot for your efforts to improve the performance! My pleasure.
Resending in the plain text mode. > My own testing completed and the results are within expectations and I > saw no red flags. Unfortunately, I consider it unlikely they'll be merged > for 4.18. Srikar Dronamraju's series is likely to need another update > and I would need to rebase my patches on top of that. Given the scope > and complexity, I find it unlikely they would be accepted for an -rc, > particularly this late of an rc. Whether we hit the 4.19 merge window or > not will depend on when Srikar's series gets updated. Hi Mel, we have collaborated back in July on the scheduler patch, improving the performance by allowing faster memory migration. You came up with the "sched-numa-fast-crossnode-v1r12" series here: https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git which has shown good performance results both in your and our testing. Do you have some update on the latest status? Is there any plan to merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 and based on the results it seems that the patch is not included (and I don't see it listed in git shortlog v4.18..v4.19-rc1 ./kernel/sched) With 4.19rc1 we see performance drop * up to 40% (NAS bench) relatively to 4.18 + sched-numa-fast-crossnode-v1r12 * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla The performance is dropping. It's quite unclear what are the next steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be merged or should we start looking at what has caused the drop in performance going from 4.19rc1 to 4.18? We would appreciate any guidance on how to proceed. Thanks a lot! Jirka On Mon, Sep 3, 2018 at 5:04 PM, Jirka Hladky <jhladky@redhat.com> wrote: >> My own testing completed and the results are within expectations and I >> saw no red flags. Unfortunately, I consider it unlikely they'll be merged >> for 4.18. Srikar Dronamraju's series is likely to need another update >> and I would need to rebase my patches on top of that. Given the scope >> and complexity, I find it unlikely they would be accepted for an -rc, >> particularly this late of an rc. Whether we hit the 4.19 merge window or >> not will depend on when Srikar's series gets updated. > > > Hi Mel, > > we have collaborated back in July on the scheduler patch, improving the > performance by allowing faster memory migration. You came up with the > "sched-numa-fast-crossnode-v1r12" series here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > > which has shown good performance results both in your and our testing. > > Do you have some update on the latest status? Is there any plan to merge > this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 and based > on the results it seems that the patch is not included (and I don't see it > listed in git shortlog v4.18..v4.19-rc1 ./kernel/sched) > > With 4.19rc1 we see performance drop > > up to 40% (NAS bench) relatively to 4.18 + sched-numa-fast-crossnode-v1r12 > up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla > > The performance is dropping. It's quite unclear what are the next steps - > should we wait for "sched-numa-fast-crossnode-v1r12" to be merged or should > we start looking at what has caused the drop in performance going from > 4.19rc1 to 4.18? > > We would appreciate any guidance on how to proceed. > > Thanks a lot! > Jirka > > > > > On Tue, Jul 17, 2018 at 12:03 PM, Mel Gorman <mgorman@techsingularity.net> > wrote: >> >> On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote: >> > Hi Mel, >> > >> > we have compared 4.18 + git:// >> > git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git >> > sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance >> > results >> > look very good! >> > >> >> Excellent, thanks to both Kamil and yourself for collecting the data. >> It's helpful to have independent verification. >> >> > We see performance gains about 10-20% for SPECjbb2005. NAS results are a >> > little bit noisy but show overall performance gains as well (total >> > runtime >> > for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a >> > specific example). >> >> Great. >> >> > The only benchmark showing a slight regression is stream >> > - but the regression is just a few percents ( upto 10%) and I think it's >> > not a real concern given that it's an artificial benchmark. >> > >> >> Agreed. >> >> > How is your testing going? Do you think >> > that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18? >> > >> >> My own testing completed and the results are within expectations and I >> saw no red flags. Unfortunately, I consider it unlikely they'll be merged >> for 4.18. Srikar Dronamraju's series is likely to need another update >> and I would need to rebase my patches on top of that. Given the scope >> and complexity, I find it unlikely they would be accepted for an -rc, >> particularly this late of an rc. Whether we hit the 4.19 merge window or >> not will depend on when Srikar's series gets updated. >> >> > Thanks a lot for your efforts to improve the performance! >> >> My pleasure. >> >> -- >> Mel Gorman >> SUSE Labs > >
On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote: > Resending in the plain text mode. > > > My own testing completed and the results are within expectations and I > > saw no red flags. Unfortunately, I consider it unlikely they'll be merged > > for 4.18. Srikar Dronamraju's series is likely to need another update > > and I would need to rebase my patches on top of that. Given the scope > > and complexity, I find it unlikely they would be accepted for an -rc, > > particularly this late of an rc. Whether we hit the 4.19 merge window or > > not will depend on when Srikar's series gets updated. > > > Hi Mel, > > we have collaborated back in July on the scheduler patch, improving > the performance by allowing faster memory migration. You came up with > the "sched-numa-fast-crossnode-v1r12" series here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git > > which has shown good performance results both in your and our testing. > I remember. > Do you have some update on the latest status? Is there any plan to > merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 > and based on the results it seems that the patch is not included (and > I don't see it listed in git shortlog v4.18..v4.19-rc1 > ./kernel/sched) > Srikar's series that mine depended upon was only partially merged due to a review bottleneck. He posted a v2 but it was during the merge window and likely will need a v3 to avoid falling through the cracks. When it is merged, I'll rebase my series on top and post it. While I didn't check against 4.19-rc1, I did find that rebasing on top of the partial series in 4.18 did not have as big an improvement. > With 4.19rc1 we see performance drop > * up to 40% (NAS bench) relatively to 4.18 + sched-numa-fast-crossnode-v1r12 > * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla > The performance is dropping. It's quite unclear what are the next > steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be > merged or should we start looking at what has caused the drop in > performance going from 4.19rc1 to 4.18? > Both are valid options. If you take the latter option, I suggest looking at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the issue as at least one auto-bisection found that it may be problematic. Whether it is an issue or not depends heavily on the number of threads relative to a socket size.
Hi Mel, thanks for sharing the background information! We will check if 2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current regression in 4.19 rc1 and let you know the outcome. Jirka On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman <mgorman@techsingularity.net> wrote: > On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote: >> Resending in the plain text mode. >> >> > My own testing completed and the results are within expectations and I >> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged >> > for 4.18. Srikar Dronamraju's series is likely to need another update >> > and I would need to rebase my patches on top of that. Given the scope >> > and complexity, I find it unlikely they would be accepted for an -rc, >> > particularly this late of an rc. Whether we hit the 4.19 merge window or >> > not will depend on when Srikar's series gets updated. >> >> >> Hi Mel, >> >> we have collaborated back in July on the scheduler patch, improving >> the performance by allowing faster memory migration. You came up with >> the "sched-numa-fast-crossnode-v1r12" series here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git >> >> which has shown good performance results both in your and our testing. >> > > I remember. > >> Do you have some update on the latest status? Is there any plan to >> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 >> and based on the results it seems that the patch is not included (and >> I don't see it listed in git shortlog v4.18..v4.19-rc1 >> ./kernel/sched) >> > > Srikar's series that mine depended upon was only partially merged due to > a review bottleneck. He posted a v2 but it was during the merge window > and likely will need a v3 to avoid falling through the cracks. When it > is merged, I'll rebase my series on top and post it. While I didn't > check against 4.19-rc1, I did find that rebasing on top of the partial > series in 4.18 did not have as big an improvement. > >> With 4.19rc1 we see performance drop >> * up to 40% (NAS bench) relatively to 4.18 + sched-numa-fast-crossnode-v1r12 >> * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla >> The performance is dropping. It's quite unclear what are the next >> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be >> merged or should we start looking at what has caused the drop in >> performance going from 4.19rc1 to 4.18? >> > > Both are valid options. If you take the latter option, I suggest looking > at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the > issue as at least one auto-bisection found that it may be problematic. > Whether it is an issue or not depends heavily on the number of threads > relative to a socket size. > > -- > Mel Gorman > SUSE Labs
Hi Mel, we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted. * Compared to 4.18, there is still performance regression - especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA systems, regression is around 10-15% * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20% While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a lot there is another issue as well. Could you please recommend some commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try? Regarding the current results, how do we proceed? Could you please contact Srikar and ask for the advice or should we contact him directly? Thanks a lot! Jirka On Tue, Sep 4, 2018 at 12:07 PM, Jirka Hladky <jhladky@redhat.com> wrote: > Hi Mel, > > thanks for sharing the background information! We will check if > 2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current > regression in 4.19 rc1 and let you know the outcome. > > Jirka > > On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman <mgorman@techsingularity.net> wrote: >> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote: >>> Resending in the plain text mode. >>> >>> > My own testing completed and the results are within expectations and I >>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged >>> > for 4.18. Srikar Dronamraju's series is likely to need another update >>> > and I would need to rebase my patches on top of that. Given the scope >>> > and complexity, I find it unlikely they would be accepted for an -rc, >>> > particularly this late of an rc. Whether we hit the 4.19 merge window or >>> > not will depend on when Srikar's series gets updated. >>> >>> >>> Hi Mel, >>> >>> we have collaborated back in July on the scheduler patch, improving >>> the performance by allowing faster memory migration. You came up with >>> the "sched-numa-fast-crossnode-v1r12" series here: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git >>> >>> which has shown good performance results both in your and our testing. >>> >> >> I remember. >> >>> Do you have some update on the latest status? Is there any plan to >>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 >>> and based on the results it seems that the patch is not included (and >>> I don't see it listed in git shortlog v4.18..v4.19-rc1 >>> ./kernel/sched) >>> >> >> Srikar's series that mine depended upon was only partially merged due to >> a review bottleneck. He posted a v2 but it was during the merge window >> and likely will need a v3 to avoid falling through the cracks. When it >> is merged, I'll rebase my series on top and post it. While I didn't >> check against 4.19-rc1, I did find that rebasing on top of the partial >> series in 4.18 did not have as big an improvement. >> >>> With 4.19rc1 we see performance drop >>> * up to 40% (NAS bench) relatively to 4.18 + sched-numa-fast-crossnode-v1r12 >>> * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla >>> The performance is dropping. It's quite unclear what are the next >>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be >>> merged or should we start looking at what has caused the drop in >>> performance going from 4.19rc1 to 4.18? >>> >> >> Both are valid options. If you take the latter option, I suggest looking >> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the >> issue as at least one auto-bisection found that it may be problematic. >> Whether it is an issue or not depends heavily on the number of threads >> relative to a socket size. >> >> -- >> Mel Gorman >> SUSE Labs
On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote: > Hi Mel, > > we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted. > > * Compared to 4.18, there is still performance regression - > especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA > systems, regression is around 10-15% > * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20% > Ok. > While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a > lot there is another issue as well. Could you please recommend some > commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try? > Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird condition in terms of idle CPU handling that has been problematic. > Regarding the current results, how do we proceed? Could you please > contact Srikar and ask for the advice or should we contact him > directly? > I would suggest contacting Srikar directly. While I'm working on a series that touches off some similar areas, there is no guarantee it'll be a success as I'm not primarily upstream focused at the moment. Restarting the thread would also end up with a much more sensible cc list.
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird > condition in terms of idle CPU handling that has been problematic. We will try that, thanks! > I would suggest contacting Srikar directly. I will do that right away. Whom should I put on Cc? Just you and linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as well? $scripts/get_maintainer.pl -f kernel/sched Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER) Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER) linux-kernel@vger.kernel.org (open list:SCHEDULER) Jirka On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman <mgorman@techsingularity.net> wrote: > On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote: >> Hi Mel, >> >> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted. >> >> * Compared to 4.18, there is still performance regression - >> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA >> systems, regression is around 10-15% >> * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20% >> > > Ok. > >> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a >> lot there is another issue as well. Could you please recommend some >> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try? >> > > Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird > condition in terms of idle CPU handling that has been problematic. > >> Regarding the current results, how do we proceed? Could you please >> contact Srikar and ask for the advice or should we contact him >> directly? >> > > I would suggest contacting Srikar directly. While I'm working on a > series that touches off some similar areas, there is no guarantee it'll > be a success as I'm not primarily upstream focused at the moment. > > Restarting the thread would also end up with a much more sensible cc > list. > > -- > Mel Gorman > SUSE Labs
Hi Mel, we have tried to revert following 2 commits: 305c1fac3225 2d4056fafa196e1ab We had to revert 10864a9e222048a862da2c21efa28929a4dfed15 as well. The performance of the kernel was better than when only 2d4056fafa196e1ab was reverted but still worse than the performance of 4.18 kernel. Since the patch series from Srikar shows very good results we would wait till it's merged to mainline kernel and stop the bisecting efforts for now. Your patch series sched-numa-fast-crossnode-v1r12 (on top of 4.18) is giving in some cases slightly better results than Srikar's series so it would be really great if both series could be merged together. Removing NUMA migration rate limit helps performance. Thanks a lot for your help on this! Jirka On Fri, Sep 7, 2018 at 10:09 AM, Jirka Hladky <jhladky@redhat.com> wrote: >> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird >> condition in terms of idle CPU handling that has been problematic. > > > We will try that, thanks! > >> I would suggest contacting Srikar directly. > > > I will do that right away. Whom should I put on Cc? Just you and > linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as > well? > > $scripts/get_maintainer.pl -f kernel/sched > Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER) > Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER) > linux-kernel@vger.kernel.org (open list:SCHEDULER) > > Jirka > > On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman <mgorman@techsingularity.net> wrote: >> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote: >>> Hi Mel, >>> >>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted. >>> >>> * Compared to 4.18, there is still performance regression - >>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA >>> systems, regression is around 10-15% >>> * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20% >>> >> >> Ok. >> >>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a >>> lot there is another issue as well. Could you please recommend some >>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try? >>> >> >> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird >> condition in terms of idle CPU handling that has been problematic. >> >>> Regarding the current results, how do we proceed? Could you please >>> contact Srikar and ask for the advice or should we contact him >>> directly? >>> >> >> I would suggest contacting Srikar directly. While I'm working on a >> series that touches off some similar areas, there is no guarantee it'll >> be a success as I'm not primarily upstream focused at the moment. >> >> Restarting the thread would also end up with a much more sensible cc >> list. >> >> -- >> Mel Gorman >> SUSE Labs
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 0dbe1d5bb936..eea5f82ca447 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -669,11 +669,6 @@ typedef struct pglist_data { struct task_struct *kcompactd; #endif #ifdef CONFIG_NUMA_BALANCING - /* Rate limiting time interval */ - unsigned long numabalancing_migrate_next_window; - - /* Number of pages migrated during the rate limiting time interval */ - unsigned long numabalancing_migrate_nr_pages; int active_node_migrate; #endif /* diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h index 711372845945..de8c73f9abcf 100644 --- a/include/trace/events/migrate.h +++ b/include/trace/events/migrate.h @@ -71,32 +71,6 @@ TRACE_EVENT(mm_migrate_pages, __print_symbolic(__entry->reason, MIGRATE_REASON)) ); -TRACE_EVENT(mm_numa_migrate_ratelimit, - - TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages), - - TP_ARGS(p, dst_nid, nr_pages), - - TP_STRUCT__entry( - __array( char, comm, TASK_COMM_LEN) - __field( pid_t, pid) - __field( int, dst_nid) - __field( unsigned long, nr_pages) - ), - - TP_fast_assign( - memcpy(__entry->comm, p->comm, TASK_COMM_LEN); - __entry->pid = p->pid; - __entry->dst_nid = dst_nid; - __entry->nr_pages = nr_pages; - ), - - TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu", - __entry->comm, - __entry->pid, - __entry->dst_nid, - __entry->nr_pages) -); #endif /* _TRACE_MIGRATE_H */ /* This part must be outside protection */ diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 6ca3be059872..c020af2c58ec 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1394,6 +1394,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, int last_cpupid, this_cpupid; this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid); + last_cpupid = page_cpupid_xchg_last(page, this_cpupid); + + /* + * Allow first faults or private faults to migrate immediately early in + * the lifetime of a task. The magic number 4 is based on waiting for + * two full passes of the "multi-stage node selection" test that is + * executed below. + */ + if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) && + (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid))) + return true; /* * Multi-stage node selection is used in conjunction with a periodic @@ -1412,7 +1423,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page, * This quadric squishes small probabilities, making it less likely we * act on an unlikely task<->page relation. */ - last_cpupid = page_cpupid_xchg_last(page, this_cpupid); if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != dst_nid) return false; @@ -6702,6 +6712,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus p->se.exec_start = 0; #ifdef CONFIG_NUMA_BALANCING + if (!static_branch_likely(&sched_numa_balancing)) + return; + if (!p->mm || (p->flags & PF_EXITING)) return; @@ -6709,8 +6722,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus int src_nid = cpu_to_node(task_cpu(p)); int dst_nid = cpu_to_node(new_cpu); - if (src_nid != dst_nid) - p->numa_scan_period = task_scan_start(p); + if (src_nid == dst_nid) + return; + + /* + * Allow resets if faults have been trapped before one scan + * has completed. This is most likely due to a new task that + * is pulled cross-node due to wakeups or load balancing. + */ + if (p->numa_scan_seq) { + /* + * Avoid scan adjustments if moving to the preferred + * node or if the task was not previously running on + * the preferred node. + */ + if (dst_nid == p->numa_preferred_nid || + (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid)) + return; + } + + p->numa_scan_period = task_scan_start(p); } #endif } diff --git a/mm/migrate.c b/mm/migrate.c index c7749902a160..f935f4781036 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -1856,54 +1856,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page, return newpage; } -/* - * page migration rate limiting control. - * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs - * window of time. Default here says do not migrate more than 1280M per second. - */ -static unsigned int migrate_interval_millisecs __read_mostly = 100; -static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT); - -/* Returns true if the node is migrate rate-limited after the update */ -static bool numamigrate_update_ratelimit(pg_data_t *pgdat, - unsigned long nr_pages) -{ - unsigned long next_window, interval; - - next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window); - interval = msecs_to_jiffies(migrate_interval_millisecs); - - /* - * Rate-limit the amount of data that is being migrated to a node. - * Optimal placement is no good if the memory bus is saturated and - * all the time is being spent migrating! - */ - if (time_after(jiffies, next_window)) { - if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0)) { - do { - next_window += interval; - } while (unlikely(time_after(jiffies, next_window))); - - WRITE_ONCE(pgdat->numabalancing_migrate_next_window, - next_window); - } - } - if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) { - trace_mm_numa_migrate_ratelimit(current, pgdat->node_id, - nr_pages); - return true; - } - - /* - * This is an unlocked non-atomic update so errors are possible. - * The consequences are failing to migrate when we potentiall should - * have which is not severe enough to warrant locking. If it is ever - * a problem, it can be converted to a per-cpu counter. - */ - pgdat->numabalancing_migrate_nr_pages += nr_pages; - return false; -} - static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page) { int page_lru; @@ -1976,14 +1928,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma, if (page_is_file_cache(page) && PageDirty(page)) goto out; - /* - * Rate-limit the amount of data that is being migrated to a node. - * Optimal placement is no good if the memory bus is saturated and - * all the time is being spent migrating! - */ - if (numamigrate_update_ratelimit(pgdat, 1)) - goto out; - isolated = numamigrate_isolate_page(pgdat, page); if (!isolated) goto out; @@ -2030,14 +1974,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, unsigned long mmun_start = address & HPAGE_PMD_MASK; unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE; - /* - * Rate-limit the amount of data that is being migrated to a node. - * Optimal placement is no good if the memory bus is saturated and - * all the time is being spent migrating! - */ - if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR)) - goto out_dropref; - new_page = alloc_pages_node(node, (GFP_TRANSHUGE_LIGHT | __GFP_THISNODE), HPAGE_PMD_ORDER); @@ -2134,7 +2070,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm, out_fail: count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR); -out_dropref: ptl = pmd_lock(mm, pmd); if (pmd_same(*pmd, entry)) { entry = pmd_modify(entry, vma->vm_page_prot); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index a4fc9b0798df..9049e7b26e92 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6211,11 +6211,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat) int nid = pgdat->node_id; pgdat_resize_init(pgdat); -#ifdef CONFIG_NUMA_BALANCING - pgdat->numabalancing_migrate_nr_pages = 0; - pgdat->active_node_migrate = 0; - pgdat->numabalancing_migrate_next_window = jiffies; -#endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE spin_lock_init(&pgdat->split_queue_lock); INIT_LIST_HEAD(&pgdat->split_queue);