diff mbox series

[1/1] mm: count time in drain_all_pages during direct reclaim as memory pressure

Message ID 20220219174940.2570901-1-surenb@google.com (mailing list archive)
State New
Headers show
Series [1/1] mm: count time in drain_all_pages during direct reclaim as memory pressure | expand

Commit Message

Suren Baghdasaryan Feb. 19, 2022, 5:49 p.m. UTC
When page allocation in direct reclaim path fails, the system will
make one attempt to shrink per-cpu page lists and free pages from
high alloc reserves. Draining per-cpu pages into buddy allocator can
be a very slow operation because it's done using workqueues and the
task in direct reclaim waits for all of them to finish before
proceeding. Currently this time is not accounted as psi memory stall.

While testing mobile devices under extreme memory pressure, when
allocations are failing during direct reclaim, we notices that psi
events which would be expected in such conditions were not triggered.
After profiling these cases it was determined that the reason for
missing psi events was that a big chunk of time spent in direct
reclaim is not accounted as memory stall, therefore psi would not
reach the levels at which an event is generated. Further investigation
revealed that the bulk of that unaccounted time was spent inside
drain_all_pages call.

Annotate drain_all_pages and unreserve_highatomic_pageblock during
page allocation failure in the direct reclaim path so that delays
caused by these calls are accounted as memory stall.

Reported-by: Tim Murray <timmurray@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/page_alloc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Minchan Kim Feb. 20, 2022, 12:40 a.m. UTC | #1
On Sat, Feb 19, 2022 at 09:49:40AM -0800, Suren Baghdasaryan wrote:
> When page allocation in direct reclaim path fails, the system will
> make one attempt to shrink per-cpu page lists and free pages from
> high alloc reserves. Draining per-cpu pages into buddy allocator can
> be a very slow operation because it's done using workqueues and the
> task in direct reclaim waits for all of them to finish before

Yes, drain_all_pages is serious slow(100ms - 150ms on Android)
especially when CPUs are fully packed. It was also spotted in CMA
allocation even when there was on no memory pressure.

> proceeding. Currently this time is not accounted as psi memory stall.

Good spot.

> 
> While testing mobile devices under extreme memory pressure, when
> allocations are failing during direct reclaim, we notices that psi
> events which would be expected in such conditions were not triggered.
> After profiling these cases it was determined that the reason for
> missing psi events was that a big chunk of time spent in direct
> reclaim is not accounted as memory stall, therefore psi would not
> reach the levels at which an event is generated. Further investigation
> revealed that the bulk of that unaccounted time was spent inside
> drain_all_pages call.
> 
> Annotate drain_all_pages and unreserve_highatomic_pageblock during
> page allocation failure in the direct reclaim path so that delays
> caused by these calls are accounted as memory stall.
> 
> Reported-by: Tim Murray <timmurray@google.com>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/page_alloc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31..7fd0d392b39b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4639,8 +4639,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  	 * Shrink them and try again
>  	 */
>  	if (!page && !drained) {
> +		unsigned long pflags;
> +
> +		psi_memstall_enter(&pflags);
>  		unreserve_highatomic_pageblock(ac, false);
>  		drain_all_pages(NULL);
> +		psi_memstall_leave(&pflags);

Instead of annotating the specific drain_all_pages, how about
moving the annotation from __perform_reclaim to
__alloc_pages_direct_reclaim?
Suren Baghdasaryan Feb. 20, 2022, 4:52 p.m. UTC | #2
On Sat, Feb 19, 2022 at 4:40 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Sat, Feb 19, 2022 at 09:49:40AM -0800, Suren Baghdasaryan wrote:
> > When page allocation in direct reclaim path fails, the system will
> > make one attempt to shrink per-cpu page lists and free pages from
> > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > be a very slow operation because it's done using workqueues and the
> > task in direct reclaim waits for all of them to finish before
>
> Yes, drain_all_pages is serious slow(100ms - 150ms on Android)
> especially when CPUs are fully packed. It was also spotted in CMA
> allocation even when there was on no memory pressure.

Thanks for the input, Minchan!
In my tests I've seen 50-60ms delays in a single drain_all_pages but I
can imagine there are cases worse than these.

>
> > proceeding. Currently this time is not accounted as psi memory stall.
>
> Good spot.
>
> >
> > While testing mobile devices under extreme memory pressure, when
> > allocations are failing during direct reclaim, we notices that psi
> > events which would be expected in such conditions were not triggered.
> > After profiling these cases it was determined that the reason for
> > missing psi events was that a big chunk of time spent in direct
> > reclaim is not accounted as memory stall, therefore psi would not
> > reach the levels at which an event is generated. Further investigation
> > revealed that the bulk of that unaccounted time was spent inside
> > drain_all_pages call.
> >
> > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > page allocation failure in the direct reclaim path so that delays
> > caused by these calls are accounted as memory stall.
> >
> > Reported-by: Tim Murray <timmurray@google.com>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/page_alloc.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3589febc6d31..7fd0d392b39b 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4639,8 +4639,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> >        * Shrink them and try again
> >        */
> >       if (!page && !drained) {
> > +             unsigned long pflags;
> > +
> > +             psi_memstall_enter(&pflags);
> >               unreserve_highatomic_pageblock(ac, false);
> >               drain_all_pages(NULL);
> > +             psi_memstall_leave(&pflags);
>
> Instead of annotating the specific drain_all_pages, how about
> moving the annotation from __perform_reclaim to
> __alloc_pages_direct_reclaim?

I'm fine with that approach too. Let's wait for Johannes' input before
I make any changes.
Thanks,
Suren.
Michal Hocko Feb. 21, 2022, 8:55 a.m. UTC | #3
On Sat 19-02-22 09:49:40, Suren Baghdasaryan wrote:
> When page allocation in direct reclaim path fails, the system will
> make one attempt to shrink per-cpu page lists and free pages from
> high alloc reserves. Draining per-cpu pages into buddy allocator can
> be a very slow operation because it's done using workqueues and the
> task in direct reclaim waits for all of them to finish before
> proceeding. Currently this time is not accounted as psi memory stall.
> 
> While testing mobile devices under extreme memory pressure, when
> allocations are failing during direct reclaim, we notices that psi
> events which would be expected in such conditions were not triggered.
> After profiling these cases it was determined that the reason for
> missing psi events was that a big chunk of time spent in direct
> reclaim is not accounted as memory stall, therefore psi would not
> reach the levels at which an event is generated. Further investigation
> revealed that the bulk of that unaccounted time was spent inside
> drain_all_pages call.

It would be cool to have some numbers here.

> Annotate drain_all_pages and unreserve_highatomic_pageblock during
> page allocation failure in the direct reclaim path so that delays
> caused by these calls are accounted as memory stall.

If the draining is too slow and dependent on the current CPU/WQ
contention then we should address that. The original intention was that
having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
operation from the rest of WQ activity. Maybe we need to fine tune
mm_percpu_wq. If that doesn't help then we should revise the WQ model
and use something else. Memory reclaim shouldn't really get stuck behind
other unrelated work.
Petr Mladek Feb. 21, 2022, 10:41 a.m. UTC | #4
On Mon 2022-02-21 09:55:12, Michal Hocko wrote:
> On Sat 19-02-22 09:49:40, Suren Baghdasaryan wrote:
> > When page allocation in direct reclaim path fails, the system will
> > make one attempt to shrink per-cpu page lists and free pages from
> > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > be a very slow operation because it's done using workqueues and the
> > task in direct reclaim waits for all of them to finish before
> > proceeding. Currently this time is not accounted as psi memory stall.
> > 
> > While testing mobile devices under extreme memory pressure, when
> > allocations are failing during direct reclaim, we notices that psi
> > events which would be expected in such conditions were not triggered.
> > After profiling these cases it was determined that the reason for
> > missing psi events was that a big chunk of time spent in direct
> > reclaim is not accounted as memory stall, therefore psi would not
> > reach the levels at which an event is generated. Further investigation
> > revealed that the bulk of that unaccounted time was spent inside
> > drain_all_pages call.
> 
> It would be cool to have some numbers here.
> 
> > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > page allocation failure in the direct reclaim path so that delays
> > caused by these calls are accounted as memory stall.
> 
> If the draining is too slow and dependent on the current CPU/WQ
> contention then we should address that. The original intention was that
> having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> operation from the rest of WQ activity. Maybe we need to fine tune
> mm_percpu_wq. If that doesn't help then we should revise the WQ model
> and use something else. Memory reclaim shouldn't really get stuck behind
> other unrelated work.

WQ_MEM_RECLAIM causes that one special worker (rescuer) is created for
the workqueue. It is used _only_ when new workers could not be created
for some, typically when there is non enough memory. It is just
a fallback, last resort. It does _not_ speedup processing.

Otherwise, "mm_percpu_wq" is a normal CPU-bound wq. It uses the shared
per-CPU worker pools. They serialize all work items on a single
worker. Another worker is used only when a work goes asleep and waits
for something.

It means that "drain" work is blocked by other work items that are
using the same worker pool and were queued earlier.


You might try to allocate "mm_percpu_wq" with WQ_HIGHPRI flag. It will
use another shared per-CPU worker pools where the workers have nice
-20. The "drain" work still might be blocked by another work items
using the same pool. But it should be faster because the workers
have higher priority.


Dedicated kthreads might be needed when the "draining" should not be
blocked by anything. If you go this way then I suggest to use
the kthread_worker API, see "linux/kthread.h". It is very similar
to the workqueues API but it always creates new kthreads.

Just note that kthread_worker API does not maintain per-CPU workers
on its own. If you need per-CPU workers than you need to
use kthread_create_worker_on_cpu() for_each_online_cpu().
And you would need cpu hotplug callbacks to create/destroy
ktheads. For example, see start_power_clamp_worker().

HTH,
Petr
Suren Baghdasaryan Feb. 21, 2022, 7:09 p.m. UTC | #5
On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sat 19-02-22 09:49:40, Suren Baghdasaryan wrote:
> > When page allocation in direct reclaim path fails, the system will
> > make one attempt to shrink per-cpu page lists and free pages from
> > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > be a very slow operation because it's done using workqueues and the
> > task in direct reclaim waits for all of them to finish before
> > proceeding. Currently this time is not accounted as psi memory stall.
> >
> > While testing mobile devices under extreme memory pressure, when
> > allocations are failing during direct reclaim, we notices that psi
> > events which would be expected in such conditions were not triggered.
> > After profiling these cases it was determined that the reason for
> > missing psi events was that a big chunk of time spent in direct
> > reclaim is not accounted as memory stall, therefore psi would not
> > reach the levels at which an event is generated. Further investigation
> > revealed that the bulk of that unaccounted time was spent inside
> > drain_all_pages call.
>
> It would be cool to have some numbers here.

A typical case I was able to record when drain_all_pages path gets activated:

__alloc_pages_slowpath took 44.644.613ns
    __perform_reclaim 751.668ns (1.7%)
    drain_all_pages took 43.887.167ns (98.3%)

PSI in this case records the time spent in __perform_reclaim but
ignores drain_all_pages, IOW it misses 98.3% of the time spent in
__alloc_pages_slowpath. Sure, normally it's not often that this path
is activated, but when it is, we miss reporting most of the stall.

>
> > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > page allocation failure in the direct reclaim path so that delays
> > caused by these calls are accounted as memory stall.
>
> If the draining is too slow and dependent on the current CPU/WQ
> contention then we should address that. The original intention was that
> having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> operation from the rest of WQ activity. Maybe we need to fine tune
> mm_percpu_wq. If that doesn't help then we should revise the WQ model
> and use something else. Memory reclaim shouldn't really get stuck behind
> other unrelated work.

Agree. However even after improving this I think we should record the
time spent in drain_all_pages as psi memstall. So, this patch I
believe is still relevant.
Thanks,
Suren.

> --
> Michal Hocko
> SUSE Labs
Suren Baghdasaryan Feb. 21, 2022, 7:13 p.m. UTC | #6
On Mon, Feb 21, 2022 at 2:41 AM 'Petr Mladek' via kernel-team
<kernel-team@android.com> wrote:
>
> On Mon 2022-02-21 09:55:12, Michal Hocko wrote:
> > On Sat 19-02-22 09:49:40, Suren Baghdasaryan wrote:
> > > When page allocation in direct reclaim path fails, the system will
> > > make one attempt to shrink per-cpu page lists and free pages from
> > > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > > be a very slow operation because it's done using workqueues and the
> > > task in direct reclaim waits for all of them to finish before
> > > proceeding. Currently this time is not accounted as psi memory stall.
> > >
> > > While testing mobile devices under extreme memory pressure, when
> > > allocations are failing during direct reclaim, we notices that psi
> > > events which would be expected in such conditions were not triggered.
> > > After profiling these cases it was determined that the reason for
> > > missing psi events was that a big chunk of time spent in direct
> > > reclaim is not accounted as memory stall, therefore psi would not
> > > reach the levels at which an event is generated. Further investigation
> > > revealed that the bulk of that unaccounted time was spent inside
> > > drain_all_pages call.
> >
> > It would be cool to have some numbers here.
> >
> > > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > > page allocation failure in the direct reclaim path so that delays
> > > caused by these calls are accounted as memory stall.
> >
> > If the draining is too slow and dependent on the current CPU/WQ
> > contention then we should address that. The original intention was that
> > having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> > operation from the rest of WQ activity. Maybe we need to fine tune
> > mm_percpu_wq. If that doesn't help then we should revise the WQ model
> > and use something else. Memory reclaim shouldn't really get stuck behind
> > other unrelated work.
>
> WQ_MEM_RECLAIM causes that one special worker (rescuer) is created for
> the workqueue. It is used _only_ when new workers could not be created
> for some, typically when there is non enough memory. It is just
> a fallback, last resort. It does _not_ speedup processing.
>
> Otherwise, "mm_percpu_wq" is a normal CPU-bound wq. It uses the shared
> per-CPU worker pools. They serialize all work items on a single
> worker. Another worker is used only when a work goes asleep and waits
> for something.
>
> It means that "drain" work is blocked by other work items that are
> using the same worker pool and were queued earlier.

Thanks for the valuable information!

>
>
> You might try to allocate "mm_percpu_wq" with WQ_HIGHPRI flag. It will
> use another shared per-CPU worker pools where the workers have nice
> -20. The "drain" work still might be blocked by another work items
> using the same pool. But it should be faster because the workers
> have higher priority.

This seems like a good first step to try. I'll make this change and
rerun the tests to see how useful this would be.

>
>
> Dedicated kthreads might be needed when the "draining" should not be
> blocked by anything. If you go this way then I suggest to use
> the kthread_worker API, see "linux/kthread.h". It is very similar
> to the workqueues API but it always creates new kthreads.
>
> Just note that kthread_worker API does not maintain per-CPU workers
> on its own. If you need per-CPU workers than you need to
> use kthread_create_worker_on_cpu() for_each_online_cpu().
> And you would need cpu hotplug callbacks to create/destroy
> ktheads. For example, see start_power_clamp_worker().

Got it. Let me try the WQ_HIGHPRI approach first. Let's see if we can
fix this with minimal changes to the current mechanisms.
Thanks,
Suren.

>
> HTH,
> Petr
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Tim Murray Feb. 22, 2022, 7:47 p.m. UTC | #7
On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> It would be cool to have some numbers here.

Are there any numbers beyond what Suren mentioned that would be
useful? As one example, in a trace of a camera workload that I opened
at random to check for drain_local_pages stalls, I saw the kworker
that ran drain_local_pages stay at runnable for 68ms before getting
any CPU time. I could try to query our trace corpus to find more
examples, but they're not hard to find in individual traces already.

> If the draining is too slow and dependent on the current CPU/WQ
> contention then we should address that. The original intention was that
> having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> operation from the rest of WQ activity. Maybe we need to fine tune
> mm_percpu_wq. If that doesn't help then we should revise the WQ model
> and use something else. Memory reclaim shouldn't really get stuck behind
> other unrelated work.

In my experience, workqueues are easy to misuse and should be
approached with a lot of care. For many workloads, they work fine 99%+
of the time, but once you run into problems with scheduling delays for
that workqueue, the only option is to stop using workqueues. If you
have work that is system-initiated with minimal latency requirements
(eg, some driver heartbeat every so often, devfreq governors, things
like that), workqueues are great. If you have userspace-initiated work
that should respect priority (eg, GPU command buffer submission in the
critical path of UI) or latency-critical system-initiated work (eg,
display synchronization around panel refresh), workqueues are the
wrong choice because there is no RT capability. WQ_HIGHPRI has a minor
impact, but it won't solve the fundamental problem if the system is
under heavy enough load or if RT threads are involved. As Petr
mentioned, the best solution for those cases seems to be "convert the
workqueue to an RT kthread_worker." I've done that many times on many
different Android devices over the years for latency-critical work,
especially around GPU, display, and camera.

In the drain_local_pages case, I think it is triggered by userspace
work and should respect priority; I don't think a prio 50 RT task
should be blocked waiting on a prio 120 (or prio 100 if WQ_HIGHPRI)
kworker to be scheduled so it can run drain_local_pages. If that's a
reasonable claim, then I think moving drain_local_pages away from
workqueues is the best choice.
Suren Baghdasaryan Feb. 23, 2022, 12:15 a.m. UTC | #8
On Tue, Feb 22, 2022 at 11:47 AM Tim Murray <timmurray@google.com> wrote:
>
> On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > It would be cool to have some numbers here.
>
> Are there any numbers beyond what Suren mentioned that would be
> useful? As one example, in a trace of a camera workload that I opened
> at random to check for drain_local_pages stalls, I saw the kworker
> that ran drain_local_pages stay at runnable for 68ms before getting
> any CPU time. I could try to query our trace corpus to find more
> examples, but they're not hard to find in individual traces already.
>
> > If the draining is too slow and dependent on the current CPU/WQ
> > contention then we should address that. The original intention was that
> > having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> > operation from the rest of WQ activity. Maybe we need to fine tune
> > mm_percpu_wq. If that doesn't help then we should revise the WQ model
> > and use something else. Memory reclaim shouldn't really get stuck behind
> > other unrelated work.
>
> In my experience, workqueues are easy to misuse and should be
> approached with a lot of care. For many workloads, they work fine 99%+
> of the time, but once you run into problems with scheduling delays for
> that workqueue, the only option is to stop using workqueues. If you
> have work that is system-initiated with minimal latency requirements
> (eg, some driver heartbeat every so often, devfreq governors, things
> like that), workqueues are great. If you have userspace-initiated work
> that should respect priority (eg, GPU command buffer submission in the
> critical path of UI) or latency-critical system-initiated work (eg,
> display synchronization around panel refresh), workqueues are the
> wrong choice because there is no RT capability. WQ_HIGHPRI has a minor
> impact, but it won't solve the fundamental problem if the system is
> under heavy enough load or if RT threads are involved. As Petr
> mentioned, the best solution for those cases seems to be "convert the
> workqueue to an RT kthread_worker." I've done that many times on many
> different Android devices over the years for latency-critical work,
> especially around GPU, display, and camera.
>
> In the drain_local_pages case, I think it is triggered by userspace
> work and should respect priority; I don't think a prio 50 RT task
> should be blocked waiting on a prio 120 (or prio 100 if WQ_HIGHPRI)
> kworker to be scheduled so it can run drain_local_pages. If that's a
> reasonable claim, then I think moving drain_local_pages away from
> workqueues is the best choice.

Ok, sounds like I should not spend time on WQ_HIGHPRI and go directly
to kthread_create_worker_on_cpu approach suggested by Petr.
Johannes Weiner Feb. 23, 2022, 6:54 p.m. UTC | #9
On Sun, Feb 20, 2022 at 08:52:38AM -0800, Suren Baghdasaryan wrote:
> On Sat, Feb 19, 2022 at 4:40 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Sat, Feb 19, 2022 at 09:49:40AM -0800, Suren Baghdasaryan wrote:
> > > When page allocation in direct reclaim path fails, the system will
> > > make one attempt to shrink per-cpu page lists and free pages from
> > > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > > be a very slow operation because it's done using workqueues and the
> > > task in direct reclaim waits for all of them to finish before
> >
> > Yes, drain_all_pages is serious slow(100ms - 150ms on Android)
> > especially when CPUs are fully packed. It was also spotted in CMA
> > allocation even when there was on no memory pressure.
> 
> Thanks for the input, Minchan!
> In my tests I've seen 50-60ms delays in a single drain_all_pages but I
> can imagine there are cases worse than these.
> 
> >
> > > proceeding. Currently this time is not accounted as psi memory stall.
> >
> > Good spot.
> >
> > >
> > > While testing mobile devices under extreme memory pressure, when
> > > allocations are failing during direct reclaim, we notices that psi
> > > events which would be expected in such conditions were not triggered.
> > > After profiling these cases it was determined that the reason for
> > > missing psi events was that a big chunk of time spent in direct
> > > reclaim is not accounted as memory stall, therefore psi would not
> > > reach the levels at which an event is generated. Further investigation
> > > revealed that the bulk of that unaccounted time was spent inside
> > > drain_all_pages call.
> > >
> > > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > > page allocation failure in the direct reclaim path so that delays
> > > caused by these calls are accounted as memory stall.
> > >
> > > Reported-by: Tim Murray <timmurray@google.com>
> > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > ---
> > >  mm/page_alloc.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 3589febc6d31..7fd0d392b39b 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -4639,8 +4639,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > >        * Shrink them and try again
> > >        */
> > >       if (!page && !drained) {
> > > +             unsigned long pflags;
> > > +
> > > +             psi_memstall_enter(&pflags);
> > >               unreserve_highatomic_pageblock(ac, false);
> > >               drain_all_pages(NULL);
> > > +             psi_memstall_leave(&pflags);
> >
> > Instead of annotating the specific drain_all_pages, how about
> > moving the annotation from __perform_reclaim to
> > __alloc_pages_direct_reclaim?
> 
> I'm fine with that approach too. Let's wait for Johannes' input before
> I make any changes.

I think the change makes sense, even if the workqueue fix speeds up
the drain. I agree with Minchan about moving the annotation upward.

With it moved, please feel free to add
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Suren Baghdasaryan Feb. 23, 2022, 7:06 p.m. UTC | #10
On Wed, Feb 23, 2022 at 10:54 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Sun, Feb 20, 2022 at 08:52:38AM -0800, Suren Baghdasaryan wrote:
> > On Sat, Feb 19, 2022 at 4:40 PM Minchan Kim <minchan@kernel.org> wrote:
> > >
> > > On Sat, Feb 19, 2022 at 09:49:40AM -0800, Suren Baghdasaryan wrote:
> > > > When page allocation in direct reclaim path fails, the system will
> > > > make one attempt to shrink per-cpu page lists and free pages from
> > > > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > > > be a very slow operation because it's done using workqueues and the
> > > > task in direct reclaim waits for all of them to finish before
> > >
> > > Yes, drain_all_pages is serious slow(100ms - 150ms on Android)
> > > especially when CPUs are fully packed. It was also spotted in CMA
> > > allocation even when there was on no memory pressure.
> >
> > Thanks for the input, Minchan!
> > In my tests I've seen 50-60ms delays in a single drain_all_pages but I
> > can imagine there are cases worse than these.
> >
> > >
> > > > proceeding. Currently this time is not accounted as psi memory stall.
> > >
> > > Good spot.
> > >
> > > >
> > > > While testing mobile devices under extreme memory pressure, when
> > > > allocations are failing during direct reclaim, we notices that psi
> > > > events which would be expected in such conditions were not triggered.
> > > > After profiling these cases it was determined that the reason for
> > > > missing psi events was that a big chunk of time spent in direct
> > > > reclaim is not accounted as memory stall, therefore psi would not
> > > > reach the levels at which an event is generated. Further investigation
> > > > revealed that the bulk of that unaccounted time was spent inside
> > > > drain_all_pages call.
> > > >
> > > > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > > > page allocation failure in the direct reclaim path so that delays
> > > > caused by these calls are accounted as memory stall.
> > > >
> > > > Reported-by: Tim Murray <timmurray@google.com>
> > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > ---
> > > >  mm/page_alloc.c | 4 ++++
> > > >  1 file changed, 4 insertions(+)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 3589febc6d31..7fd0d392b39b 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -4639,8 +4639,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > > >        * Shrink them and try again
> > > >        */
> > > >       if (!page && !drained) {
> > > > +             unsigned long pflags;
> > > > +
> > > > +             psi_memstall_enter(&pflags);
> > > >               unreserve_highatomic_pageblock(ac, false);
> > > >               drain_all_pages(NULL);
> > > > +             psi_memstall_leave(&pflags);
> > >
> > > Instead of annotating the specific drain_all_pages, how about
> > > moving the annotation from __perform_reclaim to
> > > __alloc_pages_direct_reclaim?
> >
> > I'm fine with that approach too. Let's wait for Johannes' input before
> > I make any changes.
>
> I think the change makes sense, even if the workqueue fix speeds up
> the drain. I agree with Minchan about moving the annotation upward.
>
> With it moved, please feel free to add
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Johannes!
I'll move psi_memstall_enter/psi_memstall_leave from __perform_reclaim
into __alloc_pages_direct_reclaim to cover it completely. After that
will continue on fixing the workqueue issue.
Suren Baghdasaryan Feb. 23, 2022, 7:42 p.m. UTC | #11
On Wed, Feb 23, 2022 at 11:06 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Wed, Feb 23, 2022 at 10:54 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Sun, Feb 20, 2022 at 08:52:38AM -0800, Suren Baghdasaryan wrote:
> > > On Sat, Feb 19, 2022 at 4:40 PM Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > On Sat, Feb 19, 2022 at 09:49:40AM -0800, Suren Baghdasaryan wrote:
> > > > > When page allocation in direct reclaim path fails, the system will
> > > > > make one attempt to shrink per-cpu page lists and free pages from
> > > > > high alloc reserves. Draining per-cpu pages into buddy allocator can
> > > > > be a very slow operation because it's done using workqueues and the
> > > > > task in direct reclaim waits for all of them to finish before
> > > >
> > > > Yes, drain_all_pages is serious slow(100ms - 150ms on Android)
> > > > especially when CPUs are fully packed. It was also spotted in CMA
> > > > allocation even when there was on no memory pressure.
> > >
> > > Thanks for the input, Minchan!
> > > In my tests I've seen 50-60ms delays in a single drain_all_pages but I
> > > can imagine there are cases worse than these.
> > >
> > > >
> > > > > proceeding. Currently this time is not accounted as psi memory stall.
> > > >
> > > > Good spot.
> > > >
> > > > >
> > > > > While testing mobile devices under extreme memory pressure, when
> > > > > allocations are failing during direct reclaim, we notices that psi
> > > > > events which would be expected in such conditions were not triggered.
> > > > > After profiling these cases it was determined that the reason for
> > > > > missing psi events was that a big chunk of time spent in direct
> > > > > reclaim is not accounted as memory stall, therefore psi would not
> > > > > reach the levels at which an event is generated. Further investigation
> > > > > revealed that the bulk of that unaccounted time was spent inside
> > > > > drain_all_pages call.
> > > > >
> > > > > Annotate drain_all_pages and unreserve_highatomic_pageblock during
> > > > > page allocation failure in the direct reclaim path so that delays
> > > > > caused by these calls are accounted as memory stall.
> > > > >
> > > > > Reported-by: Tim Murray <timmurray@google.com>
> > > > > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > > > > ---
> > > > >  mm/page_alloc.c | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 3589febc6d31..7fd0d392b39b 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > @@ -4639,8 +4639,12 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
> > > > >        * Shrink them and try again
> > > > >        */
> > > > >       if (!page && !drained) {
> > > > > +             unsigned long pflags;
> > > > > +
> > > > > +             psi_memstall_enter(&pflags);
> > > > >               unreserve_highatomic_pageblock(ac, false);
> > > > >               drain_all_pages(NULL);
> > > > > +             psi_memstall_leave(&pflags);
> > > >
> > > > Instead of annotating the specific drain_all_pages, how about
> > > > moving the annotation from __perform_reclaim to
> > > > __alloc_pages_direct_reclaim?
> > >
> > > I'm fine with that approach too. Let's wait for Johannes' input before
> > > I make any changes.
> >
> > I think the change makes sense, even if the workqueue fix speeds up
> > the drain. I agree with Minchan about moving the annotation upward.
> >
> > With it moved, please feel free to add
> > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>
> Thanks Johannes!
> I'll move psi_memstall_enter/psi_memstall_leave from __perform_reclaim
> into __alloc_pages_direct_reclaim to cover it completely. After that
> will continue on fixing the workqueue issue.

Posted v2 at https://lore.kernel.org/all/20220223194018.1296629-1-surenb@google.com/
Hillf Danton March 3, 2022, 2:59 a.m. UTC | #12
On Tue, 22 Feb 2022 11:47:01 -0800 Tim Murray  wrote:
> On Mon, Feb 21, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote:
> > It would be cool to have some numbers here.
> 
> Are there any numbers beyond what Suren mentioned that would be
> useful? As one example, in a trace of a camera workload that I opened
> at random to check for drain_local_pages stalls, I saw the kworker
> that ran drain_local_pages stay at runnable for 68ms before getting
> any CPU time. I could try to query our trace corpus to find more
> examples, but they're not hard to find in individual traces already.
> 
> > If the draining is too slow and dependent on the current CPU/WQ
> > contention then we should address that. The original intention was that
> > having a dedicated WQ with WQ_MEM_RECLAIM would help to isolate the
> > operation from the rest of WQ activity. Maybe we need to fine tune
> > mm_percpu_wq. If that doesn't help then we should revise the WQ model
> > and use something else. Memory reclaim shouldn't really get stuck behind
> > other unrelated work.
> 
> In my experience, workqueues are easy to misuse and should be
> approached with a lot of care. For many workloads, they work fine 99%+
> of the time, but once you run into problems with scheduling delays for
> that workqueue, the only option is to stop using workqueues. If you
> have work that is system-initiated with minimal latency requirements
> (eg, some driver heartbeat every so often, devfreq governors, things
> like that), workqueues are great. If you have userspace-initiated work
> that should respect priority (eg, GPU command buffer submission in the
> critical path of UI) or latency-critical system-initiated work (eg,
> display synchronization around panel refresh), workqueues are the
> wrong choice because there is no RT capability. WQ_HIGHPRI has a minor
> impact, but it won't solve the fundamental problem if the system is
> under heavy enough load or if RT threads are involved. As Petr
> mentioned, the best solution for those cases seems to be "convert the
> workqueue to an RT kthread_worker." I've done that many times on many
> different Android devices over the years for latency-critical work,
> especially around GPU, display, and camera.

Feel free to list the URLs to the latency-critical works as I want to
learn the reasons why workqueue failed to fit in the scenarios.
> 
> In the drain_local_pages case, I think it is triggered by userspace
> work and should respect priority; I don't think a prio 50 RT task
> should be blocked waiting on a prio 120 (or prio 100 if WQ_HIGHPRI)
> kworker to be scheduled so it can run drain_local_pages. If that's a
> reasonable claim, then I think moving drain_local_pages away from
> workqueues is the best choice.

A prio-50 direct reclaimer implies design failure in 99.1% products.
diff mbox series

Patch

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..7fd0d392b39b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4639,8 +4639,12 @@  __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 	 * Shrink them and try again
 	 */
 	if (!page && !drained) {
+		unsigned long pflags;
+
+		psi_memstall_enter(&pflags);
 		unreserve_highatomic_pageblock(ac, false);
 		drain_all_pages(NULL);
+		psi_memstall_leave(&pflags);
 		drained = true;
 		goto retry;
 	}