diff mbox series

[RFC,1/1] mm: page_alloc: replace mm_percpu_wq with kthreads in drain_all_pages

Message ID 20220225012819.1807147-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [RFC,1/1] mm: page_alloc: replace mm_percpu_wq with kthreads in drain_all_pages | expand

Commit Message

Suren Baghdasaryan Feb. 25, 2022, 1:28 a.m. UTC
Sending as an RFC to confirm if this is the right direction and to
clarify if other tasks currently executed on mm_percpu_wq should be
also moved to kthreads. The patch seems stable in testing but I want
to collect more performance data before submitting a non-RFC version.


Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
list during direct reclaim. The tasks on a workqueue can be delayed
by other tasks in the workqueues using the same per-cpu worker pool.
This results in sizable delays in drain_all_pages when cpus are highly
contended.
Memory management operations designed to relieve memory pressure should
not be allowed to block by other tasks, especially if the task in direct
reclaim has higher priority than the blocking tasks.
Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
kthreads to execute draining tasks.

Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 70 insertions(+), 14 deletions(-)

Comments

Petr Mladek March 1, 2022, 12:25 p.m. UTC | #1
On Thu 2022-02-24 17:28:19, Suren Baghdasaryan wrote:
> Sending as an RFC to confirm if this is the right direction and to
> clarify if other tasks currently executed on mm_percpu_wq should be
> also moved to kthreads. The patch seems stable in testing but I want
> to collect more performance data before submitting a non-RFC version.
> 
> 
> Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> list during direct reclaim. The tasks on a workqueue can be delayed
> by other tasks in the workqueues using the same per-cpu worker pool.
> This results in sizable delays in drain_all_pages when cpus are highly
> contended.
> Memory management operations designed to relieve memory pressure should
> not be allowed to block by other tasks, especially if the task in direct
> reclaim has higher priority than the blocking tasks.
> Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> kthreads to execute draining tasks.
> 
> Suggested-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>

The patch looks good to me. See few comments below about things
where I was in doubts. But I do not see any real problem with
this approach.

> ---
>  mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 70 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..c9ab2cf4b05b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2209,6 +2210,58 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
>  
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +static void drain_local_pages_func(struct kthread_work *work);
> +
> +static int alloc_drain_worker(unsigned int cpu)
> +{
> +	struct pcpu_drain *drain;
> +
> +	mutex_lock(&pcpu_drain_mutex);
> +	drain = per_cpu_ptr(&pcpu_drain, cpu);
> +	drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
> +	if (IS_ERR(drain->worker)) {
> +		drain->worker = NULL;
> +		pr_err("Failed to create pg_drain/%u\n", cpu);
> +		goto out;
> +	}
> +	/* Ensure the thread is not blocked by normal priority tasks */
> +	sched_set_fifo_low(drain->worker->task);
> +	kthread_init_work(&drain->work, drain_local_pages_func);
> +out:
> +	mutex_unlock(&pcpu_drain_mutex);
> +
> +	return 0;
> +}
> +
> +static int free_drain_worker(unsigned int cpu)
> +{
> +	struct pcpu_drain *drain;
> +
> +	mutex_lock(&pcpu_drain_mutex);
> +	drain = per_cpu_ptr(&pcpu_drain, cpu);
> +	kthread_cancel_work_sync(&drain->work);

I do see not how CPU down was handled in the original code.

Note that workqueues call unbind_workers() when a CPU
is going down. The pending work items might be proceed
on another CPU. From this POV, the new code looks more
safe.

> +	kthread_destroy_worker(drain->worker);
> +	drain->worker = NULL;
> +	mutex_unlock(&pcpu_drain_mutex);
> +
> +	return 0;
> +}
> +
> +static void __init init_drain_workers(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		alloc_drain_worker(cpu);

I though whether this need to be called under cpus_read_lock();
And I think that the code should be safe as it is. There
is this call chain:

  + kernel_init_freeable()
    + page_alloc_init_late()
      + init_drain_workers()

It is called after smp_init() but before the init process
is executed. I guess that nobody could trigger CPU hotplug
at this state. So there there is no need to synchronize
against it.

> +
> +	if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +					"page_alloc/drain:online",
> +					alloc_drain_worker,
> +					free_drain_worker)) {
> +		pr_err("page_alloc_drain: Failed to allocate a hotplug state\n");

I am not sure if there are any special requirements about the
ordering vs. other CPU hotplug operations.

Just note that the per-CPU workqueues are started/stopped
via CPUHP_AP_WORKQUEUE_ONLINE. They are available slightly
earlier before CPUHP_AP_ONLINE_DYN when the CPU is being
enabled.

> +	}
> +}
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;

Best Regards,
Petr
Suren Baghdasaryan March 1, 2022, 9:12 p.m. UTC | #2
On Tue, Mar 1, 2022 at 4:25 AM Petr Mladek <pmladek@suse.com> wrote:
>
> On Thu 2022-02-24 17:28:19, Suren Baghdasaryan wrote:
> > Sending as an RFC to confirm if this is the right direction and to
> > clarify if other tasks currently executed on mm_percpu_wq should be
> > also moved to kthreads. The patch seems stable in testing but I want
> > to collect more performance data before submitting a non-RFC version.
> >
> >
> > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > list during direct reclaim. The tasks on a workqueue can be delayed
> > by other tasks in the workqueues using the same per-cpu worker pool.
> > This results in sizable delays in drain_all_pages when cpus are highly
> > contended.
> > Memory management operations designed to relieve memory pressure should
> > not be allowed to block by other tasks, especially if the task in direct
> > reclaim has higher priority than the blocking tasks.
> > Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> > kthreads to execute draining tasks.
> >
> > Suggested-by: Petr Mladek <pmladek@suse.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
>
> The patch looks good to me. See few comments below about things
> where I was in doubts. But I do not see any real problem with
> this approach.

Thanks for the review, Petr. One question inline.

Other than that I would like to check if:
1. Using low priority FIFO for these kthreads is warranted. From
https://lore.kernel.org/all/CAEe=Sxmow-jx60cDjFMY7qi7+KVc+BT++BTdwC5+G9E=1soMmQ@mail.gmail.com/#t
my understanding was that we want this work to be done by RT
kthread_worker but maybe that's not appropriate here?
2. Do we want to move any other work done on mm_percpu_wq
(vmstat_work, lru_add_drain_all) to these kthreads?
If what I have currently is ok, I'll post the first version.
Thanks,
Suren.



>
> > ---
> >  mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 70 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..c9ab2cf4b05b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2209,6 +2210,58 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
> >
> >  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >
> > +static void drain_local_pages_func(struct kthread_work *work);
> > +
> > +static int alloc_drain_worker(unsigned int cpu)
> > +{
> > +     struct pcpu_drain *drain;
> > +
> > +     mutex_lock(&pcpu_drain_mutex);
> > +     drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +     drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
> > +     if (IS_ERR(drain->worker)) {
> > +             drain->worker = NULL;
> > +             pr_err("Failed to create pg_drain/%u\n", cpu);
> > +             goto out;
> > +     }
> > +     /* Ensure the thread is not blocked by normal priority tasks */
> > +     sched_set_fifo_low(drain->worker->task);
> > +     kthread_init_work(&drain->work, drain_local_pages_func);
> > +out:
> > +     mutex_unlock(&pcpu_drain_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +static int free_drain_worker(unsigned int cpu)
> > +{
> > +     struct pcpu_drain *drain;
> > +
> > +     mutex_lock(&pcpu_drain_mutex);
> > +     drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +     kthread_cancel_work_sync(&drain->work);
>
> I do see not how CPU down was handled in the original code.
>
> Note that workqueues call unbind_workers() when a CPU
> is going down. The pending work items might be proceed
> on another CPU. From this POV, the new code looks more
> safe.
>
> > +     kthread_destroy_worker(drain->worker);
> > +     drain->worker = NULL;
> > +     mutex_unlock(&pcpu_drain_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +static void __init init_drain_workers(void)
> > +{
> > +     unsigned int cpu;
> > +
> > +     for_each_online_cpu(cpu)
> > +             alloc_drain_worker(cpu);
>
> I though whether this need to be called under cpus_read_lock();
> And I think that the code should be safe as it is. There
> is this call chain:
>
>   + kernel_init_freeable()
>     + page_alloc_init_late()
>       + init_drain_workers()
>
> It is called after smp_init() but before the init process
> is executed. I guess that nobody could trigger CPU hotplug
> at this state. So there there is no need to synchronize
> against it.

Should I add a comment here to describe why we don't need
cpus_read_lock here (due to init process not being active at this
time)?

>
> > +
> > +     if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> > +                                     "page_alloc/drain:online",
> > +                                     alloc_drain_worker,
> > +                                     free_drain_worker)) {
> > +             pr_err("page_alloc_drain: Failed to allocate a hotplug state\n");
>
> I am not sure if there are any special requirements about the
> ordering vs. other CPU hotplug operations.
>
> Just note that the per-CPU workqueues are started/stopped
> via CPUHP_AP_WORKQUEUE_ONLINE. They are available slightly
> earlier before CPUHP_AP_ONLINE_DYN when the CPU is being
> enabled.
>
> > +     }
> > +}
> > +
> >  void __init page_alloc_init_late(void)
> >  {
> >       struct zone *zone;
>
> Best Regards,
> Petr
Hillf Danton March 2, 2022, 12:21 a.m. UTC | #3
On Thu, 24 Feb 2022 17:28:19 -0800 Suren Baghdasaryan wrote:
> Sending as an RFC to confirm if this is the right direction and to
> clarify if other tasks currently executed on mm_percpu_wq should be
> also moved to kthreads. The patch seems stable in testing but I want
> to collect more performance data before submitting a non-RFC version.
> 
> 
> Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> list during direct reclaim. The tasks on a workqueue can be delayed
> by other tasks in the workqueues using the same per-cpu worker pool.

The pending works may be freeing a couple of slabs/pages each. Who knows?

> This results in sizable delays in drain_all_pages when cpus are highly
> contended.
> Memory management operations designed to relieve memory pressure should
> not be allowed to block by other tasks, especially if the task in direct
> reclaim has higher priority than the blocking tasks.

Wonder why priority is the right cure to tight memory - otherwise it was
not a problem given a direct reclaimer of higher priority.

Off topic question - why is it making sense from begining for a task of
lower priority to peel pages off from another of higher priority if
priority is considered in direct reclaim?

> Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> kthreads to execute draining tasks.
> 
> Suggested-by: Petr Mladek <pmladek@suse.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 70 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..c9ab2cf4b05b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -153,7 +153,8 @@ EXPORT_PER_CPU_SYMBOL(_numa_mem_);
>  /* work_structs for global per-cpu drains */
>  struct pcpu_drain {
>  	struct zone *zone;
> -	struct work_struct work;
> +	struct kthread_work work;
> +	struct kthread_worker *worker;
>  };
>  static DEFINE_MUTEX(pcpu_drain_mutex);
>  static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
> @@ -2209,6 +2210,58 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
>  
>  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
>  
> +static void drain_local_pages_func(struct kthread_work *work);
> +
> +static int alloc_drain_worker(unsigned int cpu)
> +{
> +	struct pcpu_drain *drain;
> +
> +	mutex_lock(&pcpu_drain_mutex);

Nit, see below.

> +	drain = per_cpu_ptr(&pcpu_drain, cpu);
> +	drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
> +	if (IS_ERR(drain->worker)) {
> +		drain->worker = NULL;
> +		pr_err("Failed to create pg_drain/%u\n", cpu);
> +		goto out;
> +	}
> +	/* Ensure the thread is not blocked by normal priority tasks */
> +	sched_set_fifo_low(drain->worker->task);
> +	kthread_init_work(&drain->work, drain_local_pages_func);
> +out:
> +	mutex_unlock(&pcpu_drain_mutex);
> +
> +	return 0;
> +}

alloc_drain_worker(unsigned int cpu)
  mutex_lock(&pcpu_drain_mutex);
  drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
    __kthread_create_worker(cpu, flags, namefmt, args);
      kzalloc(sizeof(*worker), GFP_KERNEL);
        kmalloc
          slab_alloc
	    new_slab
	      alloc_pages
	        __alloc_pages_slowpath
		  __alloc_pages_direct_reclaim 
		    drain_all_pages(NULL);
		      __drain_all_pages(zone, false);
		      if (unlikely(!mutex_trylock(&pcpu_drain_mutex))) {
		     		if (!zone)
		     			return;
		     		mutex_lock(&pcpu_drain_mutex);
		      }

Either deadlock or no page drained wrt pcpu_drain_mutex if nothing missed.

> +
> +static int free_drain_worker(unsigned int cpu)
> +{
> +	struct pcpu_drain *drain;
> +
> +	mutex_lock(&pcpu_drain_mutex);
> +	drain = per_cpu_ptr(&pcpu_drain, cpu);
> +	kthread_cancel_work_sync(&drain->work);
> +	kthread_destroy_worker(drain->worker);
> +	drain->worker = NULL;
> +	mutex_unlock(&pcpu_drain_mutex);
> +
> +	return 0;
> +}
> +
> +static void __init init_drain_workers(void)
> +{
> +	unsigned int cpu;
> +
> +	for_each_online_cpu(cpu)
> +		alloc_drain_worker(cpu);
> +
> +	if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> +					"page_alloc/drain:online",
> +					alloc_drain_worker,
> +					free_drain_worker)) {
> +		pr_err("page_alloc_drain: Failed to allocate a hotplug state\n");
> +	}
> +}
> +
>  void __init page_alloc_init_late(void)
>  {
>  	struct zone *zone;
> @@ -2245,6 +2298,8 @@ void __init page_alloc_init_late(void)
>  
>  	for_each_populated_zone(zone)
>  		set_zone_contiguous(zone);
> +
> +	init_drain_workers();
>  }
>  
>  #ifdef CONFIG_CMA
> @@ -3144,7 +3199,7 @@ void drain_local_pages(struct zone *zone)
>  		drain_pages(cpu);
>  }
>  
> -static void drain_local_pages_wq(struct work_struct *work)
> +static void drain_local_pages_func(struct kthread_work *work)
>  {
>  	struct pcpu_drain *drain;
>  
> @@ -3175,6 +3230,7 @@ static void drain_local_pages_wq(struct work_struct *work)
>  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  {
>  	int cpu;
> +	struct pcpu_drain *drain;
>  
>  	/*
>  	 * Allocate in the BSS so we won't require allocation in
> @@ -3182,13 +3238,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  	 */
>  	static cpumask_t cpus_with_pcps;
>  
> -	/*
> -	 * Make sure nobody triggers this path before mm_percpu_wq is fully
> -	 * initialized.
> -	 */
> -	if (WARN_ON_ONCE(!mm_percpu_wq))
> -		return;
> -
>  	/*
>  	 * Do not drain if one is already in progress unless it's specific to
>  	 * a zone. Such callers are primarily CMA and memory hotplug and need
> @@ -3238,14 +3287,21 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
>  	}
>  
>  	for_each_cpu(cpu, &cpus_with_pcps) {
> -		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
> +		drain = per_cpu_ptr(&pcpu_drain, cpu);
>  
>  		drain->zone = zone;
> -		INIT_WORK(&drain->work, drain_local_pages_wq);
> -		queue_work_on(cpu, mm_percpu_wq, &drain->work);
> +		if (likely(drain->worker))
> +			kthread_queue_work(drain->worker, &drain->work);
> +	}
> +	/* Wait for kthreads to finish or drain itself */
> +	for_each_cpu(cpu, &cpus_with_pcps) {
> +		drain = per_cpu_ptr(&pcpu_drain, cpu);
> +
> +		if (likely(drain->worker))
> +			kthread_flush_work(&drain->work);
> +		else
> +			drain_local_pages_func(&drain->work);
>  	}
> -	for_each_cpu(cpu, &cpus_with_pcps)
> -		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
>  
>  	mutex_unlock(&pcpu_drain_mutex);
>  }
> -- 
> 2.35.1.574.g5d30c73bfb-goog
> 
>
Petr Mladek March 2, 2022, 12:18 p.m. UTC | #4
On Tue 2022-03-01 13:12:19, Suren Baghdasaryan wrote:
> On Tue, Mar 1, 2022 at 4:25 AM Petr Mladek <pmladek@suse.com> wrote:
> >
> > On Thu 2022-02-24 17:28:19, Suren Baghdasaryan wrote:
> > > Sending as an RFC to confirm if this is the right direction and to
> > > clarify if other tasks currently executed on mm_percpu_wq should be
> > > also moved to kthreads. The patch seems stable in testing but I want
> > > to collect more performance data before submitting a non-RFC version.
> > >
> > >
> > > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > > list during direct reclaim. The tasks on a workqueue can be delayed
> > > by other tasks in the workqueues using the same per-cpu worker pool.
> > > This results in sizable delays in drain_all_pages when cpus are highly
> > > contended.
> > > Memory management operations designed to relieve memory pressure should
> > > not be allowed to block by other tasks, especially if the task in direct
> > > reclaim has higher priority than the blocking tasks.
> > > Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> > > kthreads to execute draining tasks.
> > >
> > > Suggested-by: Petr Mladek <pmladek@suse.com>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> >
> > The patch looks good to me. See few comments below about things
> > where I was in doubts. But I do not see any real problem with
> > this approach.
> 
> Thanks for the review, Petr. One question inline.

Answering just this question.

> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3589febc6d31..c9ab2cf4b05b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > +static void __init init_drain_workers(void)
> > > +{
> > > +     unsigned int cpu;
> > > +
> > > +     for_each_online_cpu(cpu)
> > > +             alloc_drain_worker(cpu);
> >
> > I though whether this need to be called under cpus_read_lock();
> > And I think that the code should be safe as it is. There
> > is this call chain:
> >
> >   + kernel_init_freeable()
> >     + page_alloc_init_late()
> >       + init_drain_workers()
> >
> > It is called after smp_init() but before the init process
> > is executed. I guess that nobody could trigger CPU hotplug
> > at this state. So there there is no need to synchronize
> > against it.
> 
> Should I add a comment here to describe why we don't need
> cpus_read_lock here (due to init process not being active at this
> time)?

I would add the comment. That said, I hope that I am right and
lock is not really needed ;-)

Best Regards,
Petr
Suren Baghdasaryan March 2, 2022, 11:06 p.m. UTC | #5
On Tue, Mar 1, 2022 at 4:22 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Thu, 24 Feb 2022 17:28:19 -0800 Suren Baghdasaryan wrote:
> > Sending as an RFC to confirm if this is the right direction and to
> > clarify if other tasks currently executed on mm_percpu_wq should be
> > also moved to kthreads. The patch seems stable in testing but I want
> > to collect more performance data before submitting a non-RFC version.
> >
> >
> > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > list during direct reclaim. The tasks on a workqueue can be delayed
> > by other tasks in the workqueues using the same per-cpu worker pool.
>
> The pending works may be freeing a couple of slabs/pages each. Who knows?

If we are talking about work specifically scheduled on mm_percpu_wq
then apart from drain_all_pages, mm_percpu_wq is used to execute
vmstat_update and lru_add_drain_cpu for drainig pagevecs. If OTOH what
you mean is that the work might be blocked by say kswapd, which is
freeing memory, then sure, who knows...

>
> > This results in sizable delays in drain_all_pages when cpus are highly
> > contended.
> > Memory management operations designed to relieve memory pressure should
> > not be allowed to block by other tasks, especially if the task in direct
> > reclaim has higher priority than the blocking tasks.
>
> Wonder why priority is the right cure to tight memory - otherwise it was
> not a problem given a direct reclaimer of higher priority.
>
> Off topic question - why is it making sense from begining for a task of
> lower priority to peel pages off from another of higher priority if
> priority is considered in direct reclaim?

The way I understood your question is that you are asking why we have
to use workqueues of potentially lower priority to drain pages for a
potentially higher priority process in direct reclaim (which is
blocked waiting for workqueues to complete draining)?
If so, IIUC this mechanism was introduced in
https://lore.kernel.org/all/20170117092954.15413-4-mgorman@techsingularity.net
to avoid draining from IPI context (CC'ing Mel Gorman to correct me if
I'm wrong).
I think the issue here is that in the process we are losing
information about the priority of the process in direct reclaim, which
might lead to priority inversion.

I'm not sure at all if this is the right solution here, hence sending
this as RFC to gather more feedback.
The discussion that lead to this patch starts here:
https://lore.kernel.org/all/YhNTcM9XtqA1zUUi@dhcp22.suse.cz (CC'ing
people who were involved in that discussion)

>
> > Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> > kthreads to execute draining tasks.
> >
> > Suggested-by: Petr Mladek <pmladek@suse.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/page_alloc.c | 84 ++++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 70 insertions(+), 14 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..c9ab2cf4b05b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -153,7 +153,8 @@ EXPORT_PER_CPU_SYMBOL(_numa_mem_);
> >  /* work_structs for global per-cpu drains */
> >  struct pcpu_drain {
> >       struct zone *zone;
> > -     struct work_struct work;
> > +     struct kthread_work work;
> > +     struct kthread_worker *worker;
> >  };
> >  static DEFINE_MUTEX(pcpu_drain_mutex);
> >  static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
> > @@ -2209,6 +2210,58 @@ _deferred_grow_zone(struct zone *zone, unsigned int order)
> >
> >  #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> >
> > +static void drain_local_pages_func(struct kthread_work *work);
> > +
> > +static int alloc_drain_worker(unsigned int cpu)
> > +{
> > +     struct pcpu_drain *drain;
> > +
> > +     mutex_lock(&pcpu_drain_mutex);
>
> Nit, see below.
>
> > +     drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +     drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
> > +     if (IS_ERR(drain->worker)) {
> > +             drain->worker = NULL;
> > +             pr_err("Failed to create pg_drain/%u\n", cpu);
> > +             goto out;
> > +     }
> > +     /* Ensure the thread is not blocked by normal priority tasks */
> > +     sched_set_fifo_low(drain->worker->task);
> > +     kthread_init_work(&drain->work, drain_local_pages_func);
> > +out:
> > +     mutex_unlock(&pcpu_drain_mutex);
> > +
> > +     return 0;
> > +}
>
> alloc_drain_worker(unsigned int cpu)
>   mutex_lock(&pcpu_drain_mutex);
>   drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
>     __kthread_create_worker(cpu, flags, namefmt, args);
>       kzalloc(sizeof(*worker), GFP_KERNEL);
>         kmalloc
>           slab_alloc
>             new_slab
>               alloc_pages
>                 __alloc_pages_slowpath
>                   __alloc_pages_direct_reclaim
>                     drain_all_pages(NULL);
>                       __drain_all_pages(zone, false);
>                       if (unlikely(!mutex_trylock(&pcpu_drain_mutex))) {
>                                 if (!zone)
>                                         return;
>                                 mutex_lock(&pcpu_drain_mutex);
>                       }
>
> Either deadlock or no page drained wrt pcpu_drain_mutex if nothing missed.

Thanks for noticing it! I think this can be easily fixed by calling
kthread_create_worker_on_cpu outside of the pcpu_drain_mutex
protection and then assigning the result to drain->worker after taking
pcpu_drain_mutex.
Thanks,
Suren.

>
> > +
> > +static int free_drain_worker(unsigned int cpu)
> > +{
> > +     struct pcpu_drain *drain;
> > +
> > +     mutex_lock(&pcpu_drain_mutex);
> > +     drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +     kthread_cancel_work_sync(&drain->work);
> > +     kthread_destroy_worker(drain->worker);
> > +     drain->worker = NULL;
> > +     mutex_unlock(&pcpu_drain_mutex);
> > +
> > +     return 0;
> > +}
> > +
> > +static void __init init_drain_workers(void)
> > +{
> > +     unsigned int cpu;
> > +
> > +     for_each_online_cpu(cpu)
> > +             alloc_drain_worker(cpu);
> > +
> > +     if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
> > +                                     "page_alloc/drain:online",
> > +                                     alloc_drain_worker,
> > +                                     free_drain_worker)) {
> > +             pr_err("page_alloc_drain: Failed to allocate a hotplug state\n");
> > +     }
> > +}
> > +
> >  void __init page_alloc_init_late(void)
> >  {
> >       struct zone *zone;
> > @@ -2245,6 +2298,8 @@ void __init page_alloc_init_late(void)
> >
> >       for_each_populated_zone(zone)
> >               set_zone_contiguous(zone);
> > +
> > +     init_drain_workers();
> >  }
> >
> >  #ifdef CONFIG_CMA
> > @@ -3144,7 +3199,7 @@ void drain_local_pages(struct zone *zone)
> >               drain_pages(cpu);
> >  }
> >
> > -static void drain_local_pages_wq(struct work_struct *work)
> > +static void drain_local_pages_func(struct kthread_work *work)
> >  {
> >       struct pcpu_drain *drain;
> >
> > @@ -3175,6 +3230,7 @@ static void drain_local_pages_wq(struct work_struct *work)
> >  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> >  {
> >       int cpu;
> > +     struct pcpu_drain *drain;
> >
> >       /*
> >        * Allocate in the BSS so we won't require allocation in
> > @@ -3182,13 +3238,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> >        */
> >       static cpumask_t cpus_with_pcps;
> >
> > -     /*
> > -      * Make sure nobody triggers this path before mm_percpu_wq is fully
> > -      * initialized.
> > -      */
> > -     if (WARN_ON_ONCE(!mm_percpu_wq))
> > -             return;
> > -
> >       /*
> >        * Do not drain if one is already in progress unless it's specific to
> >        * a zone. Such callers are primarily CMA and memory hotplug and need
> > @@ -3238,14 +3287,21 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> >       }
> >
> >       for_each_cpu(cpu, &cpus_with_pcps) {
> > -             struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +             drain = per_cpu_ptr(&pcpu_drain, cpu);
> >
> >               drain->zone = zone;
> > -             INIT_WORK(&drain->work, drain_local_pages_wq);
> > -             queue_work_on(cpu, mm_percpu_wq, &drain->work);
> > +             if (likely(drain->worker))
> > +                     kthread_queue_work(drain->worker, &drain->work);
> > +     }
> > +     /* Wait for kthreads to finish or drain itself */
> > +     for_each_cpu(cpu, &cpus_with_pcps) {
> > +             drain = per_cpu_ptr(&pcpu_drain, cpu);
> > +
> > +             if (likely(drain->worker))
> > +                     kthread_flush_work(&drain->work);
> > +             else
> > +                     drain_local_pages_func(&drain->work);
> >       }
> > -     for_each_cpu(cpu, &cpus_with_pcps)
> > -             flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
> >
> >       mutex_unlock(&pcpu_drain_mutex);
> >  }
> > --
> > 2.35.1.574.g5d30c73bfb-goog
> >
> >
>
Petr Mladek March 7, 2022, 4:35 p.m. UTC | #6
On Wed 2022-03-02 15:06:24, Suren Baghdasaryan wrote:
> On Tue, Mar 1, 2022 at 4:22 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Thu, 24 Feb 2022 17:28:19 -0800 Suren Baghdasaryan wrote:
> > > Sending as an RFC to confirm if this is the right direction and to
> > > clarify if other tasks currently executed on mm_percpu_wq should be
> > > also moved to kthreads. The patch seems stable in testing but I want
> > > to collect more performance data before submitting a non-RFC version.
> > >
> > >
> > > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > > list during direct reclaim. The tasks on a workqueue can be delayed
> > > by other tasks in the workqueues using the same per-cpu worker pool.
> >
> > The pending works may be freeing a couple of slabs/pages each. Who knows?
> 
> If we are talking about work specifically scheduled on mm_percpu_wq
> then apart from drain_all_pages, mm_percpu_wq is used to execute
> vmstat_update and lru_add_drain_cpu for drainig pagevecs. If OTOH what
> you mean is that the work might be blocked by say kswapd, which is
> freeing memory, then sure, who knows...

Note that the same worker pool is used by many workqueues. And
work items in per-cpu workqueues are serialized on a single worker.
Another worker is used only when a work goes into a sleeping wait.

I want to say that "drain_all_pages" are not blocked only by other
works using "mm_percpu_wq" but also by works from many other
workqueues, including "system_wq".

These works might do anything, including memory allocation, freeing.

> >
> > > This results in sizable delays in drain_all_pages when cpus are highly
> > > contended.
> > > Memory management operations designed to relieve memory pressure should
> > > not be allowed to block by other tasks, especially if the task in direct
> > > reclaim has higher priority than the blocking tasks.
> >
> > Wonder why priority is the right cure to tight memory - otherwise it was
> > not a problem given a direct reclaimer of higher priority.
> >
> > Off topic question - why is it making sense from begining for a task of
> > lower priority to peel pages off from another of higher priority if
> > priority is considered in direct reclaim?
> 
> The way I understood your question is that you are asking why we have
> to use workqueues of potentially lower priority to drain pages for a
> potentially higher priority process in direct reclaim (which is
> blocked waiting for workqueues to complete draining)?
> If so, IIUC this mechanism was introduced in
> https://lore.kernel.org/all/20170117092954.15413-4-mgorman@techsingularity.net
> to avoid draining from IPI context (CC'ing Mel Gorman to correct me if
> I'm wrong).
> I think the issue here is that in the process we are losing
> information about the priority of the process in direct reclaim, which
> might lead to priority inversion.

Note that priority of workqueue workers is static. It is defined
by the workqueue parameters.

kthread_worker API allows to create custom kthreads. The user could
modify the priority as needed. It allows to prevent priority
inversion. It can hardly be achieved by workques where the workers
are heavily shared by unrelated tasks.

Best Regards,
Petr
Suren Baghdasaryan March 7, 2022, 4:48 p.m. UTC | #7
On Mon, Mar 7, 2022 at 8:35 AM Petr Mladek <pmladek@suse.com> wrote:
>
> On Wed 2022-03-02 15:06:24, Suren Baghdasaryan wrote:
> > On Tue, Mar 1, 2022 at 4:22 PM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On Thu, 24 Feb 2022 17:28:19 -0800 Suren Baghdasaryan wrote:
> > > > Sending as an RFC to confirm if this is the right direction and to
> > > > clarify if other tasks currently executed on mm_percpu_wq should be
> > > > also moved to kthreads. The patch seems stable in testing but I want
> > > > to collect more performance data before submitting a non-RFC version.
> > > >
> > > >
> > > > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > > > list during direct reclaim. The tasks on a workqueue can be delayed
> > > > by other tasks in the workqueues using the same per-cpu worker pool.
> > >
> > > The pending works may be freeing a couple of slabs/pages each. Who knows?
> >
> > If we are talking about work specifically scheduled on mm_percpu_wq
> > then apart from drain_all_pages, mm_percpu_wq is used to execute
> > vmstat_update and lru_add_drain_cpu for drainig pagevecs. If OTOH what
> > you mean is that the work might be blocked by say kswapd, which is
> > freeing memory, then sure, who knows...
>
> Note that the same worker pool is used by many workqueues. And
> work items in per-cpu workqueues are serialized on a single worker.
> Another worker is used only when a work goes into a sleeping wait.
>
> I want to say that "drain_all_pages" are not blocked only by other
> works using "mm_percpu_wq" but also by works from many other
> workqueues, including "system_wq".
>
> These works might do anything, including memory allocation, freeing.

Ah, I didn't know this (I think you mentioned it in one of your
previous replies but I missed it). Thank you for clarifying!

>
> > >
> > > > This results in sizable delays in drain_all_pages when cpus are highly
> > > > contended.
> > > > Memory management operations designed to relieve memory pressure should
> > > > not be allowed to block by other tasks, especially if the task in direct
> > > > reclaim has higher priority than the blocking tasks.
> > >
> > > Wonder why priority is the right cure to tight memory - otherwise it was
> > > not a problem given a direct reclaimer of higher priority.
> > >
> > > Off topic question - why is it making sense from begining for a task of
> > > lower priority to peel pages off from another of higher priority if
> > > priority is considered in direct reclaim?
> >
> > The way I understood your question is that you are asking why we have
> > to use workqueues of potentially lower priority to drain pages for a
> > potentially higher priority process in direct reclaim (which is
> > blocked waiting for workqueues to complete draining)?
> > If so, IIUC this mechanism was introduced in
> > https://lore.kernel.org/all/20170117092954.15413-4-mgorman@techsingularity.net
> > to avoid draining from IPI context (CC'ing Mel Gorman to correct me if
> > I'm wrong).
> > I think the issue here is that in the process we are losing
> > information about the priority of the process in direct reclaim, which
> > might lead to priority inversion.
>
> Note that priority of workqueue workers is static. It is defined
> by the workqueue parameters.
>
> kthread_worker API allows to create custom kthreads. The user could
> modify the priority as needed. It allows to prevent priority
> inversion. It can hardly be achieved by workques where the workers
> are heavily shared by unrelated tasks.

Yes but I suspect we would not want to dynamically change the priority
of the kthreads performing drain_all_pages? I guess we could adjust it
based on the highest priority among the tasks waiting for
drain_all_pages and that would eliminate priority inversion. However
I'm not sure about the possible overhead associated with such dynamic
priority adjustments.
My RFC sets up the kthreads to be low priority FIFO threads. It's a
simplification and I'm not sure that is the right approach here...

>
> Best Regards,
> Petr
Michal Hocko March 7, 2022, 5:04 p.m. UTC | #8
On Thu 24-02-22 17:28:19, Suren Baghdasaryan wrote:
> Sending as an RFC to confirm if this is the right direction and to
> clarify if other tasks currently executed on mm_percpu_wq should be
> also moved to kthreads. The patch seems stable in testing but I want
> to collect more performance data before submitting a non-RFC version.
> 
> 
> Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> list during direct reclaim. The tasks on a workqueue can be delayed
> by other tasks in the workqueues using the same per-cpu worker pool.
> This results in sizable delays in drain_all_pages when cpus are highly
> contended.

This is not about cpus being highly contended. It is about too much work
on the WQ context.

> Memory management operations designed to relieve memory pressure should
> not be allowed to block by other tasks, especially if the task in direct
> reclaim has higher priority than the blocking tasks.

Agreed here.

> Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> kthreads to execute draining tasks.

This looks like a natural thing to do when WQ context is not suitable
but I am not sure the additional resources is really justified. Large
machines with a lot of cpus would create a lot of kernel threads. Can we
do better than that?

Would it be possible to have fewer workers (e.g. 1 or one per numa node)
and it would perform the work on a dedicated cpu by changing its
affinity? Or would that introduce an unacceptable overhead?

Or would it be possible to update the existing WQ code to use rescuer
well before the WQ is completely clogged?
Suren Baghdasaryan March 7, 2022, 5:24 p.m. UTC | #9
On Mon, Mar 7, 2022 at 9:04 AM 'Michal Hocko' via kernel-team
<kernel-team@android.com> wrote:
>
> On Thu 24-02-22 17:28:19, Suren Baghdasaryan wrote:
> > Sending as an RFC to confirm if this is the right direction and to
> > clarify if other tasks currently executed on mm_percpu_wq should be
> > also moved to kthreads. The patch seems stable in testing but I want
> > to collect more performance data before submitting a non-RFC version.
> >
> >
> > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > list during direct reclaim. The tasks on a workqueue can be delayed
> > by other tasks in the workqueues using the same per-cpu worker pool.
> > This results in sizable delays in drain_all_pages when cpus are highly
> > contended.
>
> This is not about cpus being highly contended. It is about too much work
> on the WQ context.

Ack.

>
> > Memory management operations designed to relieve memory pressure should
> > not be allowed to block by other tasks, especially if the task in direct
> > reclaim has higher priority than the blocking tasks.
>
> Agreed here.
>
> > Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> > kthreads to execute draining tasks.
>
> This looks like a natural thing to do when WQ context is not suitable
> but I am not sure the additional resources is really justified. Large
> machines with a lot of cpus would create a lot of kernel threads. Can we
> do better than that?
>
> Would it be possible to have fewer workers (e.g. 1 or one per numa node)
> and it would perform the work on a dedicated cpu by changing its
> affinity? Or would that introduce an unacceptable overhead?

Not sure but I can try implementing per-node kthreads and measure the
performance of the reclaim path, comparing with the current and with
per-cpu approach.

>
> Or would it be possible to update the existing WQ code to use rescuer
> well before the WQ is completely clogged?
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Suren Baghdasaryan March 17, 2022, 11:04 p.m. UTC | #10
On Mon, Mar 7, 2022 at 9:24 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Mar 7, 2022 at 9:04 AM 'Michal Hocko' via kernel-team
> <kernel-team@android.com> wrote:
> >
> > On Thu 24-02-22 17:28:19, Suren Baghdasaryan wrote:
> > > Sending as an RFC to confirm if this is the right direction and to
> > > clarify if other tasks currently executed on mm_percpu_wq should be
> > > also moved to kthreads. The patch seems stable in testing but I want
> > > to collect more performance data before submitting a non-RFC version.
> > >
> > >
> > > Currently drain_all_pages uses mm_percpu_wq to drain pages from pcp
> > > list during direct reclaim. The tasks on a workqueue can be delayed
> > > by other tasks in the workqueues using the same per-cpu worker pool.
> > > This results in sizable delays in drain_all_pages when cpus are highly
> > > contended.
> >
> > This is not about cpus being highly contended. It is about too much work
> > on the WQ context.
>
> Ack.
>
> >
> > > Memory management operations designed to relieve memory pressure should
> > > not be allowed to block by other tasks, especially if the task in direct
> > > reclaim has higher priority than the blocking tasks.
> >
> > Agreed here.
> >
> > > Replace the usage of mm_percpu_wq with per-cpu low priority FIFO
> > > kthreads to execute draining tasks.
> >
> > This looks like a natural thing to do when WQ context is not suitable
> > but I am not sure the additional resources is really justified. Large
> > machines with a lot of cpus would create a lot of kernel threads. Can we
> > do better than that?
> >
> > Would it be possible to have fewer workers (e.g. 1 or one per numa node)
> > and it would perform the work on a dedicated cpu by changing its
> > affinity? Or would that introduce an unacceptable overhead?
>
> Not sure but I can try implementing per-node kthreads and measure the
> performance of the reclaim path, comparing with the current and with
> per-cpu approach.

Just to update on this RFC. In my testing I don't see a meaningful
improvement from using the kthreads yet. This might be due to my test
setup, so I'll keep exploring. Will post the next version only if I
get demonstrable improvements.
Thanks!

>
> >
> > Or would it be possible to update the existing WQ code to use rescuer
> > well before the WQ is completely clogged?
> > --
> > Michal Hocko
> > SUSE Labs
> >
> > --
> > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> >
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..c9ab2cf4b05b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -153,7 +153,8 @@  EXPORT_PER_CPU_SYMBOL(_numa_mem_);
 /* work_structs for global per-cpu drains */
 struct pcpu_drain {
 	struct zone *zone;
-	struct work_struct work;
+	struct kthread_work work;
+	struct kthread_worker *worker;
 };
 static DEFINE_MUTEX(pcpu_drain_mutex);
 static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);
@@ -2209,6 +2210,58 @@  _deferred_grow_zone(struct zone *zone, unsigned int order)
 
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
+static void drain_local_pages_func(struct kthread_work *work);
+
+static int alloc_drain_worker(unsigned int cpu)
+{
+	struct pcpu_drain *drain;
+
+	mutex_lock(&pcpu_drain_mutex);
+	drain = per_cpu_ptr(&pcpu_drain, cpu);
+	drain->worker = kthread_create_worker_on_cpu(cpu, 0, "pg_drain/%u", cpu);
+	if (IS_ERR(drain->worker)) {
+		drain->worker = NULL;
+		pr_err("Failed to create pg_drain/%u\n", cpu);
+		goto out;
+	}
+	/* Ensure the thread is not blocked by normal priority tasks */
+	sched_set_fifo_low(drain->worker->task);
+	kthread_init_work(&drain->work, drain_local_pages_func);
+out:
+	mutex_unlock(&pcpu_drain_mutex);
+
+	return 0;
+}
+
+static int free_drain_worker(unsigned int cpu)
+{
+	struct pcpu_drain *drain;
+
+	mutex_lock(&pcpu_drain_mutex);
+	drain = per_cpu_ptr(&pcpu_drain, cpu);
+	kthread_cancel_work_sync(&drain->work);
+	kthread_destroy_worker(drain->worker);
+	drain->worker = NULL;
+	mutex_unlock(&pcpu_drain_mutex);
+
+	return 0;
+}
+
+static void __init init_drain_workers(void)
+{
+	unsigned int cpu;
+
+	for_each_online_cpu(cpu)
+		alloc_drain_worker(cpu);
+
+	if (cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+					"page_alloc/drain:online",
+					alloc_drain_worker,
+					free_drain_worker)) {
+		pr_err("page_alloc_drain: Failed to allocate a hotplug state\n");
+	}
+}
+
 void __init page_alloc_init_late(void)
 {
 	struct zone *zone;
@@ -2245,6 +2298,8 @@  void __init page_alloc_init_late(void)
 
 	for_each_populated_zone(zone)
 		set_zone_contiguous(zone);
+
+	init_drain_workers();
 }
 
 #ifdef CONFIG_CMA
@@ -3144,7 +3199,7 @@  void drain_local_pages(struct zone *zone)
 		drain_pages(cpu);
 }
 
-static void drain_local_pages_wq(struct work_struct *work)
+static void drain_local_pages_func(struct kthread_work *work)
 {
 	struct pcpu_drain *drain;
 
@@ -3175,6 +3230,7 @@  static void drain_local_pages_wq(struct work_struct *work)
 static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 {
 	int cpu;
+	struct pcpu_drain *drain;
 
 	/*
 	 * Allocate in the BSS so we won't require allocation in
@@ -3182,13 +3238,6 @@  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	 */
 	static cpumask_t cpus_with_pcps;
 
-	/*
-	 * Make sure nobody triggers this path before mm_percpu_wq is fully
-	 * initialized.
-	 */
-	if (WARN_ON_ONCE(!mm_percpu_wq))
-		return;
-
 	/*
 	 * Do not drain if one is already in progress unless it's specific to
 	 * a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3238,14 +3287,21 @@  static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
 	}
 
 	for_each_cpu(cpu, &cpus_with_pcps) {
-		struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
+		drain = per_cpu_ptr(&pcpu_drain, cpu);
 
 		drain->zone = zone;
-		INIT_WORK(&drain->work, drain_local_pages_wq);
-		queue_work_on(cpu, mm_percpu_wq, &drain->work);
+		if (likely(drain->worker))
+			kthread_queue_work(drain->worker, &drain->work);
+	}
+	/* Wait for kthreads to finish or drain itself */
+	for_each_cpu(cpu, &cpus_with_pcps) {
+		drain = per_cpu_ptr(&pcpu_drain, cpu);
+
+		if (likely(drain->worker))
+			kthread_flush_work(&drain->work);
+		else
+			drain_local_pages_func(&drain->work);
 	}
-	for_each_cpu(cpu, &cpus_with_pcps)
-		flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);
 
 	mutex_unlock(&pcpu_drain_mutex);
 }