diff mbox series

mm/memcg: add allocstall to memory.stat

Message ID 1554983991-16769-1-git-send-email-laoar.shao@gmail.com (mailing list archive)
State New, archived
Headers show
Series mm/memcg: add allocstall to memory.stat | expand

Commit Message

Yafang Shao April 11, 2019, 11:59 a.m. UTC
The current item 'pgscan' is for pages in the memcg,
which indicates how many pages owned by this memcg are scanned.
While these pages may not scanned by the taskes in this memcg, even for
PGSCAN_DIRECT.

Sometimes we need an item to indicate whehter the tasks in this memcg
under memory pressure or not.
So this new item allocstall is added into memory.stat.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  3 +++
 include/linux/memcontrol.h              | 18 ++++++++++++++++++
 mm/memcontrol.c                         | 18 +-----------------
 mm/vmscan.c                             |  2 ++
 4 files changed, 24 insertions(+), 17 deletions(-)

Comments

Michal Hocko April 11, 2019, 12:26 p.m. UTC | #1
On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> The current item 'pgscan' is for pages in the memcg,
> which indicates how many pages owned by this memcg are scanned.
> While these pages may not scanned by the taskes in this memcg, even for
> PGSCAN_DIRECT.
> 
> Sometimes we need an item to indicate whehter the tasks in this memcg
> under memory pressure or not.
> So this new item allocstall is added into memory.stat.

We do have memcg events for that purpose and those can even tell whether
the pressure is a result of high or hard limit. Why is this not
sufficient?

> Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Yafang Shao April 11, 2019, 12:41 p.m. UTC | #2
On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > The current item 'pgscan' is for pages in the memcg,
> > which indicates how many pages owned by this memcg are scanned.
> > While these pages may not scanned by the taskes in this memcg, even for
> > PGSCAN_DIRECT.
> >
> > Sometimes we need an item to indicate whehter the tasks in this memcg
> > under memory pressure or not.
> > So this new item allocstall is added into memory.stat.
>
> We do have memcg events for that purpose and those can even tell whether
> the pressure is a result of high or hard limit. Why is this not
> sufficient?
>

The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
memcg neither.
They all reflect the memory status of a memcg, rather than tasks
activity in this memcg.

> > Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko April 11, 2019, 1:39 p.m. UTC | #3
On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > The current item 'pgscan' is for pages in the memcg,
> > > which indicates how many pages owned by this memcg are scanned.
> > > While these pages may not scanned by the taskes in this memcg, even for
> > > PGSCAN_DIRECT.
> > >
> > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > under memory pressure or not.
> > > So this new item allocstall is added into memory.stat.
> >
> > We do have memcg events for that purpose and those can even tell whether
> > the pressure is a result of high or hard limit. Why is this not
> > sufficient?
> >
> 
> The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> memcg neither.
> They all reflect the memory status of a memcg, rather than tasks
> activity in this memcg.

I do not follow. Can you give me an example when does this matter? I
thought it is more important to see that there is a reclaim activity
for a specific memcg as you account for that memcg.
If you want to see/measure a reclaim imposed latency on a task then
the counter doesn't make so much sense as you have no way to match that
to a task. We have tracepoints for that purpose.
Yafang Shao April 11, 2019, 1:54 p.m. UTC | #4
On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > The current item 'pgscan' is for pages in the memcg,
> > > > which indicates how many pages owned by this memcg are scanned.
> > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > PGSCAN_DIRECT.
> > > >
> > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > under memory pressure or not.
> > > > So this new item allocstall is added into memory.stat.
> > >
> > > We do have memcg events for that purpose and those can even tell whether
> > > the pressure is a result of high or hard limit. Why is this not
> > > sufficient?
> > >
> >
> > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > memcg neither.
> > They all reflect the memory status of a memcg, rather than tasks
> > activity in this memcg.
>
> I do not follow. Can you give me an example when does this matter? I

For example, the tasks in this memcg may encounter direct page reclaim
due to system memory pressure,
meaning it is stalling in page alloc slow path.
At the same time, maybe there's no memory pressure in this memcg, I
mean, it could succussfully charge memcg.


> thought it is more important to see that there is a reclaim activity
> for a specific memcg as you account for that memcg.
> If you want to see/measure a reclaim imposed latency on a task then
> the counter doesn't make so much sense as you have no way to match that
> to a task. We have tracepoints for that purpose.

By the way, I have submitted a patch for enhancement to the memcg
tracepoints serveral days ago,
pls. help take a look.

Thanks
Yafang
Michal Hocko April 11, 2019, 3:10 p.m. UTC | #5
On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > PGSCAN_DIRECT.
> > > > >
> > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > under memory pressure or not.
> > > > > So this new item allocstall is added into memory.stat.
> > > >
> > > > We do have memcg events for that purpose and those can even tell whether
> > > > the pressure is a result of high or hard limit. Why is this not
> > > > sufficient?
> > > >
> > >
> > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > memcg neither.
> > > They all reflect the memory status of a memcg, rather than tasks
> > > activity in this memcg.
> >
> > I do not follow. Can you give me an example when does this matter? I
> 
> For example, the tasks in this memcg may encounter direct page reclaim
> due to system memory pressure,
> meaning it is stalling in page alloc slow path.
> At the same time, maybe there's no memory pressure in this memcg, I
> mean, it could succussfully charge memcg.

And that is exactly what those events aim for. They are measuring
_where_ the memory pressure comes from.

Can you please try to explain what do you want to achieve again?
Yafang Shao April 12, 2019, 1:32 a.m. UTC | #6
On Thu, Apr 11, 2019 at 11:10 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> > On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > > PGSCAN_DIRECT.
> > > > > >
> > > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > > under memory pressure or not.
> > > > > > So this new item allocstall is added into memory.stat.
> > > > >
> > > > > We do have memcg events for that purpose and those can even tell whether
> > > > > the pressure is a result of high or hard limit. Why is this not
> > > > > sufficient?
> > > > >
> > > >
> > > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > > memcg neither.
> > > > They all reflect the memory status of a memcg, rather than tasks
> > > > activity in this memcg.
> > >
> > > I do not follow. Can you give me an example when does this matter? I
> >
> > For example, the tasks in this memcg may encounter direct page reclaim
> > due to system memory pressure,
> > meaning it is stalling in page alloc slow path.
> > At the same time, maybe there's no memory pressure in this memcg, I
> > mean, it could succussfully charge memcg.
>
> And that is exactly what those events aim for. They are measuring
> _where_ the memory pressure comes from.
>
> Can you please try to explain what do you want to achieve again?

To know the impact of this memory pressure.
The current events can tell us the source of this pressure, but can't
tell us the impact of this pressure.
The tracepoints are always off until we meet some issue and then turn it on;
while the event counter is more lightweight, and with it we can know
which memcg is suffering this pressure.

Thanks
Yafang
Michal Hocko April 12, 2019, 6:34 a.m. UTC | #7
On Fri 12-04-19 09:32:55, Yafang Shao wrote:
> On Thu, Apr 11, 2019 at 11:10 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> > > On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > > > PGSCAN_DIRECT.
> > > > > > >
> > > > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > > > under memory pressure or not.
> > > > > > > So this new item allocstall is added into memory.stat.
> > > > > >
> > > > > > We do have memcg events for that purpose and those can even tell whether
> > > > > > the pressure is a result of high or hard limit. Why is this not
> > > > > > sufficient?
> > > > > >
> > > > >
> > > > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > > > memcg neither.
> > > > > They all reflect the memory status of a memcg, rather than tasks
> > > > > activity in this memcg.
> > > >
> > > > I do not follow. Can you give me an example when does this matter? I
> > >
> > > For example, the tasks in this memcg may encounter direct page reclaim
> > > due to system memory pressure,
> > > meaning it is stalling in page alloc slow path.
> > > At the same time, maybe there's no memory pressure in this memcg, I
> > > mean, it could succussfully charge memcg.
> >
> > And that is exactly what those events aim for. They are measuring
> > _where_ the memory pressure comes from.
> >
> > Can you please try to explain what do you want to achieve again?
> 
> To know the impact of this memory pressure.
> The current events can tell us the source of this pressure, but can't
> tell us the impact of this pressure.

Can you give me a more specific example how you are going to use this
counter in a real life please?
Yafang Shao April 12, 2019, 8:10 a.m. UTC | #8
On Fri, Apr 12, 2019 at 2:34 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 12-04-19 09:32:55, Yafang Shao wrote:
> > On Thu, Apr 11, 2019 at 11:10 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> > > > On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > > > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > > > > PGSCAN_DIRECT.
> > > > > > > >
> > > > > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > > > > under memory pressure or not.
> > > > > > > > So this new item allocstall is added into memory.stat.
> > > > > > >
> > > > > > > We do have memcg events for that purpose and those can even tell whether
> > > > > > > the pressure is a result of high or hard limit. Why is this not
> > > > > > > sufficient?
> > > > > > >
> > > > > >
> > > > > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > > > > memcg neither.
> > > > > > They all reflect the memory status of a memcg, rather than tasks
> > > > > > activity in this memcg.
> > > > >
> > > > > I do not follow. Can you give me an example when does this matter? I
> > > >
> > > > For example, the tasks in this memcg may encounter direct page reclaim
> > > > due to system memory pressure,
> > > > meaning it is stalling in page alloc slow path.
> > > > At the same time, maybe there's no memory pressure in this memcg, I
> > > > mean, it could succussfully charge memcg.
> > >
> > > And that is exactly what those events aim for. They are measuring
> > > _where_ the memory pressure comes from.
> > >
> > > Can you please try to explain what do you want to achieve again?
> >
> > To know the impact of this memory pressure.
> > The current events can tell us the source of this pressure, but can't
> > tell us the impact of this pressure.
>
> Can you give me a more specific example how you are going to use this
> counter in a real life please?

When we find this counter is higher, we know that the applications in
this memcg is suffering memory pressure.
Then we can do some trace for this memcg, i.e. to trace how long the
applicatons may stall via tracepoint.
(but current tracepoints can't trace a specified cgroup only, that's
another point to be improved.)
I'm not sure whether it is a good practice, but it can help us.

Thanks
Yafang
Michal Hocko April 12, 2019, 9:09 a.m. UTC | #9
On Fri 12-04-19 16:10:29, Yafang Shao wrote:
> On Fri, Apr 12, 2019 at 2:34 PM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 12-04-19 09:32:55, Yafang Shao wrote:
> > > On Thu, Apr 11, 2019 at 11:10 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> > > > > On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > >
> > > > > > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > > > > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > >
> > > > > > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > > > > > PGSCAN_DIRECT.
> > > > > > > > >
> > > > > > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > > > > > under memory pressure or not.
> > > > > > > > > So this new item allocstall is added into memory.stat.
> > > > > > > >
> > > > > > > > We do have memcg events for that purpose and those can even tell whether
> > > > > > > > the pressure is a result of high or hard limit. Why is this not
> > > > > > > > sufficient?
> > > > > > > >
> > > > > > >
> > > > > > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > > > > > memcg neither.
> > > > > > > They all reflect the memory status of a memcg, rather than tasks
> > > > > > > activity in this memcg.
> > > > > >
> > > > > > I do not follow. Can you give me an example when does this matter? I
> > > > >
> > > > > For example, the tasks in this memcg may encounter direct page reclaim
> > > > > due to system memory pressure,
> > > > > meaning it is stalling in page alloc slow path.
> > > > > At the same time, maybe there's no memory pressure in this memcg, I
> > > > > mean, it could succussfully charge memcg.
> > > >
> > > > And that is exactly what those events aim for. They are measuring
> > > > _where_ the memory pressure comes from.
> > > >
> > > > Can you please try to explain what do you want to achieve again?
> > >
> > > To know the impact of this memory pressure.
> > > The current events can tell us the source of this pressure, but can't
> > > tell us the impact of this pressure.
> >
> > Can you give me a more specific example how you are going to use this
> > counter in a real life please?
> 
> When we find this counter is higher, we know that the applications in
> this memcg is suffering memory pressure.

We do have pgscan/pgsteal counters that tell you that the memcg is being
reclaimed. If you see those numbers increasing then you know there is a
memory pressure. Along with reclaim events you can tell wehther this is
internal or external memory pressure. Sure you cannot distinguish
kaswapd from the direct reclaim but is this really so important? You have
other means to find out that the direct reclaim is happening and more
importantly a higher latency might be a result of kswapd reclaiming
memory as well (swap in or an expensive pagein from a remote storage
etc.).

The reason why I do not really like the new counter as you implemented
it is that it mixes task/memcg scopes. Say you are hitting the memcg
direct reclaim in a memcg A but the task is deeper in the A's hierarchy.
Unless I have misread your patch it will be B to account for allocstall
while it is the A's hierarchy to get directly reclaimed. B doesn't even
have to be reclaimed at all if we manage to reclaim other others. So
this is really confusing.

> Then we can do some trace for this memcg, i.e. to trace how long the
> applicatons may stall via tracepoint.
> (but current tracepoints can't trace a specified cgroup only, that's
> another point to be improved.)

It is a task that is stalled, not a cgroup.
Yafang Shao April 12, 2019, 9:29 a.m. UTC | #10
On Fri, Apr 12, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 12-04-19 16:10:29, Yafang Shao wrote:
> > On Fri, Apr 12, 2019 at 2:34 PM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Fri 12-04-19 09:32:55, Yafang Shao wrote:
> > > > On Thu, Apr 11, 2019 at 11:10 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > >
> > > > > On Thu 11-04-19 21:54:22, Yafang Shao wrote:
> > > > > > On Thu, Apr 11, 2019 at 9:39 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > >
> > > > > > > On Thu 11-04-19 20:41:32, Yafang Shao wrote:
> > > > > > > > On Thu, Apr 11, 2019 at 8:27 PM Michal Hocko <mhocko@kernel.org> wrote:
> > > > > > > > >
> > > > > > > > > On Thu 11-04-19 19:59:51, Yafang Shao wrote:
> > > > > > > > > > The current item 'pgscan' is for pages in the memcg,
> > > > > > > > > > which indicates how many pages owned by this memcg are scanned.
> > > > > > > > > > While these pages may not scanned by the taskes in this memcg, even for
> > > > > > > > > > PGSCAN_DIRECT.
> > > > > > > > > >
> > > > > > > > > > Sometimes we need an item to indicate whehter the tasks in this memcg
> > > > > > > > > > under memory pressure or not.
> > > > > > > > > > So this new item allocstall is added into memory.stat.
> > > > > > > > >
> > > > > > > > > We do have memcg events for that purpose and those can even tell whether
> > > > > > > > > the pressure is a result of high or hard limit. Why is this not
> > > > > > > > > sufficient?
> > > > > > > > >
> > > > > > > >
> > > > > > > > The MEMCG_HIGH and MEMCG_LOW may not be tiggered by the tasks in this
> > > > > > > > memcg neither.
> > > > > > > > They all reflect the memory status of a memcg, rather than tasks
> > > > > > > > activity in this memcg.
> > > > > > >
> > > > > > > I do not follow. Can you give me an example when does this matter? I
> > > > > >
> > > > > > For example, the tasks in this memcg may encounter direct page reclaim
> > > > > > due to system memory pressure,
> > > > > > meaning it is stalling in page alloc slow path.
> > > > > > At the same time, maybe there's no memory pressure in this memcg, I
> > > > > > mean, it could succussfully charge memcg.
> > > > >
> > > > > And that is exactly what those events aim for. They are measuring
> > > > > _where_ the memory pressure comes from.
> > > > >
> > > > > Can you please try to explain what do you want to achieve again?
> > > >
> > > > To know the impact of this memory pressure.
> > > > The current events can tell us the source of this pressure, but can't
> > > > tell us the impact of this pressure.
> > >
> > > Can you give me a more specific example how you are going to use this
> > > counter in a real life please?
> >
> > When we find this counter is higher, we know that the applications in
> > this memcg is suffering memory pressure.
>
> We do have pgscan/pgsteal counters that tell you that the memcg is being
> reclaimed. If you see those numbers increasing then you know there is a
> memory pressure. Along with reclaim events you can tell wehther this is
> internal or external memory pressure. Sure you cannot distinguish
> kaswapd from the direct reclaim but is this really so important? You have
> other means to find out that the direct reclaim is happening and more
> importantly a higher latency might be a result of kswapd reclaiming
> memory as well (swap in or an expensive pagein from a remote storage
> etc.).
>
> The reason why I do not really like the new counter as you implemented
> it is that it mixes task/memcg scopes. Say you are hitting the memcg
> direct reclaim in a memcg A but the task is deeper in the A's hierarchy.
> Unless I have misread your patch it will be B to account for allocstall
> while it is the A's hierarchy to get directly reclaimed. B doesn't even
> have to be reclaimed at all if we manage to reclaim other others. So
> this is really confusing.
>

I have to admire that it really mixes task/memcg scopes,
so let's drop this part.

> > Then we can do some trace for this memcg, i.e. to trace how long the
> > applicatons may stall via tracepoint.
> > (but current tracepoints can't trace a specified cgroup only, that's
> > another point to be improved.)
>
> It is a task that is stalled, not a cgroup.
>

But these tracepoints can't filter a speficied task neither.

Thanks
Yafang
Michal Hocko April 12, 2019, 9:36 a.m. UTC | #11
On Fri 12-04-19 17:29:04, Yafang Shao wrote:
> On Fri, Apr 12, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > > Then we can do some trace for this memcg, i.e. to trace how long the
> > > applicatons may stall via tracepoint.
> > > (but current tracepoints can't trace a specified cgroup only, that's
> > > another point to be improved.)
> >
> > It is a task that is stalled, not a cgroup.
> 
> But these tracepoints can't filter a speficied task neither.

each trace line output should cotain a pid, no?
Yafang Shao April 12, 2019, 9:48 a.m. UTC | #12
On Fri, Apr 12, 2019 at 5:36 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 12-04-19 17:29:04, Yafang Shao wrote:
> > On Fri, Apr 12, 2019 at 5:09 PM Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > > Then we can do some trace for this memcg, i.e. to trace how long the
> > > > applicatons may stall via tracepoint.
> > > > (but current tracepoints can't trace a specified cgroup only, that's
> > > > another point to be improved.)
> > >
> > > It is a task that is stalled, not a cgroup.
> >
> > But these tracepoints can't filter a speficied task neither.
>
> each trace line output should cotain a pid, no?

But that's not  enough.

Some drawbacks,
- the PID is variable, and it is not so conveninet to get the tasks
from this PID.
  i.e. when you use pidof to get the tasks, it may already exit and
you get nothing.
- the traceline don't always contain the task names.
- if we don't filter the tasks with tracepoint filter, there may be
lots of output.
  i.e. we always deploy lots of cgroup on a single host, but only some
of them are important,
  while the others are not import. So we limit the not important
cgroup to a low memory limit,
  and then the tasks in it may do frequent memcg reclaim, but we don't care.

Thanks

Yafang
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 19c4e78..a06f17a 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1221,6 +1221,9 @@  PAGE_SIZE multiple when read back.
 		Part of "slab" that cannot be reclaimed on memory
 		pressure.
 
+          allocstall
+                The number of direct reclaim the tasks in this memcg entering
+
 	  pgfault
 		Total number of page faults incurred
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1565831..7fe9c57 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -45,6 +45,7 @@  enum memcg_stat_item {
 	MEMCG_SOCK,
 	/* XXX: why are these zone and not node counters? */
 	MEMCG_KERNEL_STACK_KB,
+	MEMCG_ALLOCSTALL,
 	MEMCG_NR_STAT,
 };
 
@@ -412,6 +413,23 @@  static inline struct lruvec *mem_cgroup_lruvec(struct pglist_data *pgdat,
 
 struct mem_cgroup *get_mem_cgroup_from_page(struct page *page);
 
+/**
+ * If current->active_memcg is non-NULL, do not fallback to current->mm->memcg.
+ */
+static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
+{
+	if (unlikely(current->active_memcg)) {
+		struct mem_cgroup *memcg = root_mem_cgroup;
+
+		rcu_read_lock();
+		if (css_tryget_online(&current->active_memcg->css))
+			memcg = current->active_memcg;
+		rcu_read_unlock();
+		return memcg;
+	}
+	return get_mem_cgroup_from_mm(current->mm);
+}
+
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 10af4dd..780659f9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -853,23 +853,6 @@  struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
 EXPORT_SYMBOL(get_mem_cgroup_from_page);
 
 /**
- * If current->active_memcg is non-NULL, do not fallback to current->mm->memcg.
- */
-static __always_inline struct mem_cgroup *get_mem_cgroup_from_current(void)
-{
-	if (unlikely(current->active_memcg)) {
-		struct mem_cgroup *memcg = root_mem_cgroup;
-
-		rcu_read_lock();
-		if (css_tryget_online(&current->active_memcg->css))
-			memcg = current->active_memcg;
-		rcu_read_unlock();
-		return memcg;
-	}
-	return get_mem_cgroup_from_mm(current->mm);
-}
-
-/**
  * mem_cgroup_iter - iterate over memory cgroup hierarchy
  * @root: hierarchy root
  * @prev: previously returned memcg, NULL on first invocation
@@ -5624,6 +5607,7 @@  static int memory_stat_show(struct seq_file *m, void *v)
 
 	/* Accumulated memory events */
 
+	seq_printf(m, "allocstall %lu\n", acc.vmevents[MEMCG_ALLOCSTALL]);
 	seq_printf(m, "pgfault %lu\n", acc.vmevents[PGFAULT]);
 	seq_printf(m, "pgmajfault %lu\n", acc.vmevents[PGMAJFAULT]);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 347c9b3..3ff8b1b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3024,6 +3024,8 @@  static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 	if (global_reclaim(sc))
 		__count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
 
+	count_memcg_events(get_mem_cgroup_from_current(), MEMCG_ALLOCSTALL, 1);
+
 	do {
 		vmpressure_prio(sc->gfp_mask, sc->target_mem_cgroup,
 				sc->priority);