mbox series

[v2,0/2] psi: enhance psi with the help of ebpf

Message ID 1585649077-10896-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
Headers show
Series psi: enhance psi with the help of ebpf | expand

Message

Yafang Shao March 31, 2020, 10:04 a.m. UTC
PSI gives us a powerful way to anaylze memory pressure issue, but we can
make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
Especially with ebpf we can flexiblely get more details of the memory
pressure.

In orderc to achieve this goal, a new parameter is added into
psi_memstall_{enter, leave}, which indicates the specific type of a
memstall. There're totally ten memstalls by now,
        MEMSTALL_KSWAPD
        MEMSTALL_RECLAIM_DIRECT
        MEMSTALL_RECLAIM_MEMCG
        MEMSTALL_RECLAIM_HIGH
        MEMSTALL_KCOMPACTD
        MEMSTALL_COMPACT
        MEMSTALL_WORKINGSET_REFAULT
        MEMSTALL_WORKINGSET_THRASH
        MEMSTALL_MEMDELAY
        MEMSTALL_SWAPIO
With the help of kprobe or tracepoint to trace this newly added agument we
can know which type of memstall it is and then do corresponding
improvement. I can also help us to analyze the latency spike caused by
memory pressure.

But note that we can't use it to build memory pressure for a specific type
of memstall, e.g. memcg pressure, compaction pressure and etc, because it
doesn't implement various types of task->in_memstall, e.g.
task->in_memcgstall, task->in_compactionstall and etc.

Although there're already some tracepoints can help us to achieve this
goal, e.g.
        vmscan:mm_vmscan_kswapd_{wake, sleep}
        vmscan:mm_vmscan_direct_reclaim_{begin, end}
        vmscan:mm_vmscan_memcg_reclaim_{begin, end}
        /* no tracepoint for memcg high reclaim*/
        compcation:mm_compaction_kcompactd_{wake, sleep}
        compcation:mm_compaction_begin_{begin, end}
        /* no tracepoint for workingset refault */
        /* no tracepoint for workingset thrashing */
        /* no tracepoint for use memdelay */
        /* no tracepoint for swapio */
but psi_memstall_{enter, leave} gives us a unified entrance for all
types of memstall and we don't need to add many begin and end tracepoints
that hasn't been implemented yet.

Patch #2 gives us an example of how to use it with ebpf. With the help of
ebpf we can trace a specific task, application, container and etc. It also
can help us to analyze the spread of latencies and whether they were
clustered at a point of time or spread out over long periods of time.

To summarize, with the pressure data in /proc/pressure/memroy we know that
the system is under memory pressure, and then with the newly added tracing
facility in this patchset we can get the reason of this memory pressure,
and then thinks about how to make the change.
The workflow can be illustrated as bellow.

		   REASON	  ACTION
		 | compcation	| improve compcation	|
		 | vmscan	| improve vmscan	|
Memory pressure -| workingset	| improve workingset	|
		 | etc		| ...			|

Yafang Shao (2):
  psi: introduce various types of memstall
  psi, tracepoint: introduce tracepoints for psi_memstall_{enter, leave}

 block/blk-cgroup.c           |  4 ++--
 block/blk-core.c             |  4 ++--
 include/linux/psi.h          | 15 +++++++++++----
 include/linux/psi_types.h    | 13 +++++++++++++
 include/trace/events/sched.h | 41 +++++++++++++++++++++++++++++++++++++++++
 kernel/sched/psi.c           | 14 ++++++++++++--
 mm/compaction.c              |  4 ++--
 mm/filemap.c                 |  4 ++--
 mm/memcontrol.c              |  4 ++--
 mm/page_alloc.c              |  8 ++++----
 mm/page_io.c                 |  4 ++--
 mm/vmscan.c                  |  8 ++++----
 12 files changed, 97 insertions(+), 26 deletions(-)

Comments

Shakeel Butt July 15, 2020, 4:36 p.m. UTC | #1
Hi Yafang,

On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> PSI gives us a powerful way to anaylze memory pressure issue, but we can
> make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> Especially with ebpf we can flexiblely get more details of the memory
> pressure.
>
> In orderc to achieve this goal, a new parameter is added into
> psi_memstall_{enter, leave}, which indicates the specific type of a
> memstall. There're totally ten memstalls by now,
>         MEMSTALL_KSWAPD
>         MEMSTALL_RECLAIM_DIRECT
>         MEMSTALL_RECLAIM_MEMCG
>         MEMSTALL_RECLAIM_HIGH
>         MEMSTALL_KCOMPACTD
>         MEMSTALL_COMPACT
>         MEMSTALL_WORKINGSET_REFAULT
>         MEMSTALL_WORKINGSET_THRASH
>         MEMSTALL_MEMDELAY
>         MEMSTALL_SWAPIO
> With the help of kprobe or tracepoint to trace this newly added agument we
> can know which type of memstall it is and then do corresponding
> improvement. I can also help us to analyze the latency spike caused by
> memory pressure.
>
> But note that we can't use it to build memory pressure for a specific type
> of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> doesn't implement various types of task->in_memstall, e.g.
> task->in_memcgstall, task->in_compactionstall and etc.
>
> Although there're already some tracepoints can help us to achieve this
> goal, e.g.
>         vmscan:mm_vmscan_kswapd_{wake, sleep}
>         vmscan:mm_vmscan_direct_reclaim_{begin, end}
>         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
>         /* no tracepoint for memcg high reclaim*/
>         compcation:mm_compaction_kcompactd_{wake, sleep}
>         compcation:mm_compaction_begin_{begin, end}
>         /* no tracepoint for workingset refault */
>         /* no tracepoint for workingset thrashing */
>         /* no tracepoint for use memdelay */
>         /* no tracepoint for swapio */
> but psi_memstall_{enter, leave} gives us a unified entrance for all
> types of memstall and we don't need to add many begin and end tracepoints
> that hasn't been implemented yet.
>
> Patch #2 gives us an example of how to use it with ebpf. With the help of
> ebpf we can trace a specific task, application, container and etc. It also
> can help us to analyze the spread of latencies and whether they were
> clustered at a point of time or spread out over long periods of time.
>
> To summarize, with the pressure data in /proc/pressure/memroy we know that
> the system is under memory pressure, and then with the newly added tracing
> facility in this patchset we can get the reason of this memory pressure,
> and then thinks about how to make the change.
> The workflow can be illustrated as bellow.
>
>                    REASON         ACTION
>                  | compcation   | improve compcation    |
>                  | vmscan       | improve vmscan        |
> Memory pressure -| workingset   | improve workingset    |
>                  | etc          | ...                   |
>

I have not looked at the patch series in detail but I wanted to get
your thoughts if it is possible to achieve what I am trying to do with
this patch series.

At the moment I am only interested in global reclaim and I wanted to
enable alerts like "alert if there is process stuck in global reclaim
for x seconds in last y seconds window" or "alert if all the processes
are stuck in global reclaim for some z seconds".

I see that using this series I can identify global reclaim but I am
wondering if alert or notifications are possible. Android is using psi
monitors for such alerts but it does not use cgroups, so, most of the
memstalls are related to global reclaim stall. For cgroup environment,
do we need for add support to psi monitor similar to this patch
series?

thanks,
Shakeel
Yafang Shao July 16, 2020, 3:18 a.m. UTC | #2
On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Yafang,
>
> On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > Especially with ebpf we can flexiblely get more details of the memory
> > pressure.
> >
> > In orderc to achieve this goal, a new parameter is added into
> > psi_memstall_{enter, leave}, which indicates the specific type of a
> > memstall. There're totally ten memstalls by now,
> >         MEMSTALL_KSWAPD
> >         MEMSTALL_RECLAIM_DIRECT
> >         MEMSTALL_RECLAIM_MEMCG
> >         MEMSTALL_RECLAIM_HIGH
> >         MEMSTALL_KCOMPACTD
> >         MEMSTALL_COMPACT
> >         MEMSTALL_WORKINGSET_REFAULT
> >         MEMSTALL_WORKINGSET_THRASH
> >         MEMSTALL_MEMDELAY
> >         MEMSTALL_SWAPIO
> > With the help of kprobe or tracepoint to trace this newly added agument we
> > can know which type of memstall it is and then do corresponding
> > improvement. I can also help us to analyze the latency spike caused by
> > memory pressure.
> >
> > But note that we can't use it to build memory pressure for a specific type
> > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > doesn't implement various types of task->in_memstall, e.g.
> > task->in_memcgstall, task->in_compactionstall and etc.
> >
> > Although there're already some tracepoints can help us to achieve this
> > goal, e.g.
> >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> >         /* no tracepoint for memcg high reclaim*/
> >         compcation:mm_compaction_kcompactd_{wake, sleep}
> >         compcation:mm_compaction_begin_{begin, end}
> >         /* no tracepoint for workingset refault */
> >         /* no tracepoint for workingset thrashing */
> >         /* no tracepoint for use memdelay */
> >         /* no tracepoint for swapio */
> > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > types of memstall and we don't need to add many begin and end tracepoints
> > that hasn't been implemented yet.
> >
> > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > ebpf we can trace a specific task, application, container and etc. It also
> > can help us to analyze the spread of latencies and whether they were
> > clustered at a point of time or spread out over long periods of time.
> >
> > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > the system is under memory pressure, and then with the newly added tracing
> > facility in this patchset we can get the reason of this memory pressure,
> > and then thinks about how to make the change.
> > The workflow can be illustrated as bellow.
> >
> >                    REASON         ACTION
> >                  | compcation   | improve compcation    |
> >                  | vmscan       | improve vmscan        |
> > Memory pressure -| workingset   | improve workingset    |
> >                  | etc          | ...                   |
> >
>
> I have not looked at the patch series in detail but I wanted to get
> your thoughts if it is possible to achieve what I am trying to do with
> this patch series.
>
> At the moment I am only interested in global reclaim and I wanted to
> enable alerts like "alert if there is process stuck in global reclaim
> for x seconds in last y seconds window" or "alert if all the processes
> are stuck in global reclaim for some z seconds".
>
> I see that using this series I can identify global reclaim but I am
> wondering if alert or notifications are possible. Android is using psi
> monitors for such alerts but it does not use cgroups, so, most of the
> memstalls are related to global reclaim stall. For cgroup environment,
> do we need for add support to psi monitor similar to this patch
> series?
>

Hi Shakeel,

We use the PSI tracepoints in our kernel to analyze the individual
latency caused by memory pressure, but the PSI tracepoints are
implemented with a new version as bellow:
    trace_psi_memstall_enter(_RET_IP_);
    trace_psi_memstall_leave(_RET_IP_);
And then using the _RET_IP_ to identify the specific PSI type.

If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
the pressure caused by the memory cgroup, IOW, the limit of memcg is
reached and it has to do memcg reclaim. Otherwise we can consider it
as global memory pressure.
try_to_free_mem_cgroup_pages
    psi_memstall_enter
        if (static_branch_likely(&psi_disabled))
            return;
        *flags = current->in_memstall;
         if (*flags)
             return;
         trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure
Shakeel Butt July 16, 2020, 5:04 p.m. UTC | #3
On Wed, Jul 15, 2020 at 8:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
>
> On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
> >
> > Hi Yafang,
> >
> > On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > >
> > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > Especially with ebpf we can flexiblely get more details of the memory
> > > pressure.
> > >
> > > In orderc to achieve this goal, a new parameter is added into
> > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > memstall. There're totally ten memstalls by now,
> > >         MEMSTALL_KSWAPD
> > >         MEMSTALL_RECLAIM_DIRECT
> > >         MEMSTALL_RECLAIM_MEMCG
> > >         MEMSTALL_RECLAIM_HIGH
> > >         MEMSTALL_KCOMPACTD
> > >         MEMSTALL_COMPACT
> > >         MEMSTALL_WORKINGSET_REFAULT
> > >         MEMSTALL_WORKINGSET_THRASH
> > >         MEMSTALL_MEMDELAY
> > >         MEMSTALL_SWAPIO
> > > With the help of kprobe or tracepoint to trace this newly added agument we
> > > can know which type of memstall it is and then do corresponding
> > > improvement. I can also help us to analyze the latency spike caused by
> > > memory pressure.
> > >
> > > But note that we can't use it to build memory pressure for a specific type
> > > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > > doesn't implement various types of task->in_memstall, e.g.
> > > task->in_memcgstall, task->in_compactionstall and etc.
> > >
> > > Although there're already some tracepoints can help us to achieve this
> > > goal, e.g.
> > >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> > >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> > >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> > >         /* no tracepoint for memcg high reclaim*/
> > >         compcation:mm_compaction_kcompactd_{wake, sleep}
> > >         compcation:mm_compaction_begin_{begin, end}
> > >         /* no tracepoint for workingset refault */
> > >         /* no tracepoint for workingset thrashing */
> > >         /* no tracepoint for use memdelay */
> > >         /* no tracepoint for swapio */
> > > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > > types of memstall and we don't need to add many begin and end tracepoints
> > > that hasn't been implemented yet.
> > >
> > > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > > ebpf we can trace a specific task, application, container and etc. It also
> > > can help us to analyze the spread of latencies and whether they were
> > > clustered at a point of time or spread out over long periods of time.
> > >
> > > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > > the system is under memory pressure, and then with the newly added tracing
> > > facility in this patchset we can get the reason of this memory pressure,
> > > and then thinks about how to make the change.
> > > The workflow can be illustrated as bellow.
> > >
> > >                    REASON         ACTION
> > >                  | compcation   | improve compcation    |
> > >                  | vmscan       | improve vmscan        |
> > > Memory pressure -| workingset   | improve workingset    |
> > >                  | etc          | ...                   |
> > >
> >
> > I have not looked at the patch series in detail but I wanted to get
> > your thoughts if it is possible to achieve what I am trying to do with
> > this patch series.
> >
> > At the moment I am only interested in global reclaim and I wanted to
> > enable alerts like "alert if there is process stuck in global reclaim
> > for x seconds in last y seconds window" or "alert if all the processes
> > are stuck in global reclaim for some z seconds".
> >
> > I see that using this series I can identify global reclaim but I am
> > wondering if alert or notifications are possible. Android is using psi
> > monitors for such alerts but it does not use cgroups, so, most of the
> > memstalls are related to global reclaim stall. For cgroup environment,
> > do we need for add support to psi monitor similar to this patch
> > series?
> >
>
> Hi Shakeel,
>
> We use the PSI tracepoints in our kernel to analyze the individual
> latency caused by memory pressure, but the PSI tracepoints are
> implemented with a new version as bellow:
>     trace_psi_memstall_enter(_RET_IP_);
>     trace_psi_memstall_leave(_RET_IP_);
> And then using the _RET_IP_ to identify the specific PSI type.
>
> If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
> the pressure caused by the memory cgroup, IOW, the limit of memcg is
> reached and it has to do memcg reclaim. Otherwise we can consider it
> as global memory pressure.
> try_to_free_mem_cgroup_pages
>     psi_memstall_enter
>         if (static_branch_likely(&psi_disabled))
>             return;
>         *flags = current->in_memstall;
>          if (*flags)
>              return;
>          trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure
>

Thanks for the response. I am looking for 'always on' monitoring. More
specifically defining the system level SLIs based on PSI. My concern
with ftrace is its global shared state and also it is not really for
'always on' monitoring. You have mentioned ebpf. Is ebpf fine for
'always on' monitoring and is it possible to notify user space by ebpf
on specific conditions (e.g. a process stuck in global reclaim for 60
seconds)?

thanks,
Shakeel
Yafang Shao July 17, 2020, 1:43 a.m. UTC | #4
On Fri, Jul 17, 2020 at 1:04 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Wed, Jul 15, 2020 at 8:19 PM Yafang Shao <laoar.shao@gmail.com> wrote:
> >
> > On Thu, Jul 16, 2020 at 12:36 AM Shakeel Butt <shakeelb@google.com> wrote:
> > >
> > > Hi Yafang,
> > >
> > > On Tue, Mar 31, 2020 at 3:05 AM Yafang Shao <laoar.shao@gmail.com> wrote:
> > > >
> > > > PSI gives us a powerful way to anaylze memory pressure issue, but we can
> > > > make it more powerful with the help of tracepoint, kprobe, ebpf and etc.
> > > > Especially with ebpf we can flexiblely get more details of the memory
> > > > pressure.
> > > >
> > > > In orderc to achieve this goal, a new parameter is added into
> > > > psi_memstall_{enter, leave}, which indicates the specific type of a
> > > > memstall. There're totally ten memstalls by now,
> > > >         MEMSTALL_KSWAPD
> > > >         MEMSTALL_RECLAIM_DIRECT
> > > >         MEMSTALL_RECLAIM_MEMCG
> > > >         MEMSTALL_RECLAIM_HIGH
> > > >         MEMSTALL_KCOMPACTD
> > > >         MEMSTALL_COMPACT
> > > >         MEMSTALL_WORKINGSET_REFAULT
> > > >         MEMSTALL_WORKINGSET_THRASH
> > > >         MEMSTALL_MEMDELAY
> > > >         MEMSTALL_SWAPIO
> > > > With the help of kprobe or tracepoint to trace this newly added agument we
> > > > can know which type of memstall it is and then do corresponding
> > > > improvement. I can also help us to analyze the latency spike caused by
> > > > memory pressure.
> > > >
> > > > But note that we can't use it to build memory pressure for a specific type
> > > > of memstall, e.g. memcg pressure, compaction pressure and etc, because it
> > > > doesn't implement various types of task->in_memstall, e.g.
> > > > task->in_memcgstall, task->in_compactionstall and etc.
> > > >
> > > > Although there're already some tracepoints can help us to achieve this
> > > > goal, e.g.
> > > >         vmscan:mm_vmscan_kswapd_{wake, sleep}
> > > >         vmscan:mm_vmscan_direct_reclaim_{begin, end}
> > > >         vmscan:mm_vmscan_memcg_reclaim_{begin, end}
> > > >         /* no tracepoint for memcg high reclaim*/
> > > >         compcation:mm_compaction_kcompactd_{wake, sleep}
> > > >         compcation:mm_compaction_begin_{begin, end}
> > > >         /* no tracepoint for workingset refault */
> > > >         /* no tracepoint for workingset thrashing */
> > > >         /* no tracepoint for use memdelay */
> > > >         /* no tracepoint for swapio */
> > > > but psi_memstall_{enter, leave} gives us a unified entrance for all
> > > > types of memstall and we don't need to add many begin and end tracepoints
> > > > that hasn't been implemented yet.
> > > >
> > > > Patch #2 gives us an example of how to use it with ebpf. With the help of
> > > > ebpf we can trace a specific task, application, container and etc. It also
> > > > can help us to analyze the spread of latencies and whether they were
> > > > clustered at a point of time or spread out over long periods of time.
> > > >
> > > > To summarize, with the pressure data in /proc/pressure/memroy we know that
> > > > the system is under memory pressure, and then with the newly added tracing
> > > > facility in this patchset we can get the reason of this memory pressure,
> > > > and then thinks about how to make the change.
> > > > The workflow can be illustrated as bellow.
> > > >
> > > >                    REASON         ACTION
> > > >                  | compcation   | improve compcation    |
> > > >                  | vmscan       | improve vmscan        |
> > > > Memory pressure -| workingset   | improve workingset    |
> > > >                  | etc          | ...                   |
> > > >
> > >
> > > I have not looked at the patch series in detail but I wanted to get
> > > your thoughts if it is possible to achieve what I am trying to do with
> > > this patch series.
> > >
> > > At the moment I am only interested in global reclaim and I wanted to
> > > enable alerts like "alert if there is process stuck in global reclaim
> > > for x seconds in last y seconds window" or "alert if all the processes
> > > are stuck in global reclaim for some z seconds".
> > >
> > > I see that using this series I can identify global reclaim but I am
> > > wondering if alert or notifications are possible. Android is using psi
> > > monitors for such alerts but it does not use cgroups, so, most of the
> > > memstalls are related to global reclaim stall. For cgroup environment,
> > > do we need for add support to psi monitor similar to this patch
> > > series?
> > >
> >
> > Hi Shakeel,
> >
> > We use the PSI tracepoints in our kernel to analyze the individual
> > latency caused by memory pressure, but the PSI tracepoints are
> > implemented with a new version as bellow:
> >     trace_psi_memstall_enter(_RET_IP_);
> >     trace_psi_memstall_leave(_RET_IP_);
> > And then using the _RET_IP_ to identify the specific PSI type.
> >
> > If the _RET_IP_ is at try_to_free_mem_cgroup_pages(), then it means
> > the pressure caused by the memory cgroup, IOW, the limit of memcg is
> > reached and it has to do memcg reclaim. Otherwise we can consider it
> > as global memory pressure.
> > try_to_free_mem_cgroup_pages
> >     psi_memstall_enter
> >         if (static_branch_likely(&psi_disabled))
> >             return;
> >         *flags = current->in_memstall;
> >          if (*flags)
> >              return;
> >          trace_psi_memstall_enter(_RET_IP_);  <<<<< memcg pressure
> >
>
> Thanks for the response. I am looking for 'always on' monitoring. More
> specifically defining the system level SLIs based on PSI. My concern
> with ftrace is its global shared state and also it is not really for
> 'always on' monitoring. You have mentioned ebpf. Is ebpf fine for
> 'always on' monitoring and is it possible to notify user space by ebpf
> on specific conditions (e.g. a process stuck in global reclaim for 60
> seconds)?
>

ebpf is fine for  'always on' monitoring  from my experience, but I'm
not sure whether it is possible to notify user space on specific
conditions.
Notifying user space would be a useful feature, so I think we can have a try.