mbox series

[RFC,0/2] Remove shrinker's nr_deferred

Message ID 20200916185823.5347-1-shy828301@gmail.com (mailing list archive)
Headers show
Series Remove shrinker's nr_deferred | expand

Message

Yang Shi Sept. 16, 2020, 6:58 p.m. UTC
Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
it turned out there were huge amount accumulated nr_deferred objects seen by the
shrinker.

I managed to reproduce this problem with kernel build workload plus negative dentry
generator.

First step, run the below kernel build test script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

cd /root/Buildarea/linux-stable

for i in `seq 1500`; do
        cgcreate -g memory:kern_build
        echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes

        echo 3 > /proc/sys/vm/drop_caches
        cgexec -g memory:kern_build make clean > /dev/null 2>&1
        cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1

        cgdelete -g memory:kern_build
done

That would generate huge amount deferred objects due to __GFP_NOFS allocations.

Then run the below negative dentry generator script:

NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`

mkdir /sys/fs/cgroup/memory/test
echo $$ > /sys/fs/cgroup/memory/test/tasks

for i in `seq $NR_CPUS`; do
        while true; do
                FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
                cat $FILE 2>/dev/null
        done &
done

Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
showed:

	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928

There were huge deferred objects before the shrinker was called, the behavior does match the code
but it might be not desirable from the user's stand of point.

IIUC the deferred objects were used to make balance between slab and page cache, but since commit
9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use sc->priority for slab shrink targets") they
were decoupled.  And as that commit stated "these two things have nothing to do with each other".

So why do we have to still keep it around?  I can think of there might be huge slab accumulated
without taking into account deferred objects, but nowadays the most workloads are constrained by
memcg which could limit the usage of kmem (by default now), so it seems maintaining deferred
objects is not that useful anymore.  It seems we could remove it to simplify the shrinker logic
a lot.

I may overlook some other important usecases of nr_deferred, comments are much appreciated.

Comments

Dave Chinner Sept. 17, 2020, 2:37 a.m. UTC | #1
On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
> 
> Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> it turned out there were huge amount accumulated nr_deferred objects seen by the
> shrinker.
> 
> I managed to reproduce this problem with kernel build workload plus negative dentry
> generator.
> 
> First step, run the below kernel build test script:
> 
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> 
> cd /root/Buildarea/linux-stable
> 
> for i in `seq 1500`; do
>         cgcreate -g memory:kern_build
>         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
> 
>         echo 3 > /proc/sys/vm/drop_caches
>         cgexec -g memory:kern_build make clean > /dev/null 2>&1
>         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
> 
>         cgdelete -g memory:kern_build
> done
> 
> That would generate huge amount deferred objects due to __GFP_NOFS allocations.
> 
> Then run the below negative dentry generator script:
> 
> NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> 
> mkdir /sys/fs/cgroup/memory/test
> echo $$ > /sys/fs/cgroup/memory/test/tasks
> 
> for i in `seq $NR_CPUS`; do
>         while true; do
>                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
>                 cat $FILE 2>/dev/null
>         done &
> done
> 
> Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> showed:
> 
> 	kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> 	kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928


You have 93M dentries and inodes in the cache, and the reclaim delta is 45746,
which is totally sane for a priority 12 reclaim priority. SO you've
basically had to do a couple of million GFP_NOFS direct reclaim
passes that were unable to reclaim anything to get to a point
where the deferred reclaim would up to 4.9 -billion- objects.

Basically, you would up the deferred work so far that it got out of
control before a GFP_KERNEL reclaim context could do anything to
bring it under control.

However, removing defered work is not the solution. If we don't
defer some of this reclaim work, then filesystem intensive workloads
-cannot reclaim memory from their own caches- when they need memory.
And when those caches largely dominate the used memory in the
machine, this will grind the filesystem workload to a halt.. Hence
this deferral mechanism is actually critical to keeping the
filesystem caches balanced with the rest of the system.

The behaviour you see is the windup clamping code triggering:

        /*
         * We need to avoid excessive windup on filesystem shrinkers
         * due to large numbers of GFP_NOFS allocations causing the
         * shrinkers to return -1 all the time. This results in a large
         * nr being built up so when a shrink that can do some work
         * comes along it empties the entire cache due to nr >>>
         * freeable. This is bad for sustaining a working set in
         * memory.
         *
         * Hence only allow the shrinker to scan the entire cache when
         * a large delta change is calculated directly.
         */
        if (delta < freeable / 4)
                total_scan = min(total_scan, freeable / 2);

It clamps the worst case freeing to half the cache, and that is
exactly what you are seeing. This, unfortunately, won't be enough to
fix the windup problem once it's spiralled out of control. It's
fairly rare for this to happen - it takes effort to find an adverse
workload that will cause windup like this.

So, with all that said, a year ago I actually fixed this problem
as part of some work I did to provide non-blocking inode reclaim
infrastructure in the shrinker for XFS inode reclaim.
See this patch:

https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/

It did two things. First it ensured all the deferred work was done
by kswapd so that some poor direct reclaim victim didn't hit a
massive reclaim latency spike because of windup. Secondly, it
clamped the maximum windup to the maximum single pass reclaim scan
limit, which is (freeable * 2) objects.

Finally it also changed the amount of deferred work a single kswapd
pass did to be directly proportional to the reclaim priority. Hence
as we get closer to OOM, kswapd tries much harder to get the
deferred work backlog down to zero. This means that a single, low
priority reclaim pass will never reclaim half the cache - only
sustained memory pressure and _reclaim priority windup_ will do
that.

You probably want to look at all the shrinker infrastructure patches
in that series as the deferred work tracking and accounting changes
span a few patches in the series:

https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/

Unfortunately, none of the MM developers showed any interest in
these patches, so when I found a different solution to the XFS
problem it got dropped on the ground.

> So why do we have to still keep it around? 

Because we need a feedback mechanism to allow us to maintain control
of the size of filesystem caches that grow via GFP_NOFS allocations.

Cheers,

Dave.
Yang Shi Sept. 18, 2020, 12:12 a.m. UTC | #2
On Wed, Sep 16, 2020 at 7:37 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
> >
> > Recently huge amount one-off slab drop was seen on some vfs metadata heavy workloads,
> > it turned out there were huge amount accumulated nr_deferred objects seen by the
> > shrinker.
> >
> > I managed to reproduce this problem with kernel build workload plus negative dentry
> > generator.
> >
> > First step, run the below kernel build test script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > cd /root/Buildarea/linux-stable
> >
> > for i in `seq 1500`; do
> >         cgcreate -g memory:kern_build
> >         echo 4G > /sys/fs/cgroup/memory/kern_build/memory.limit_in_bytes
> >
> >         echo 3 > /proc/sys/vm/drop_caches
> >         cgexec -g memory:kern_build make clean > /dev/null 2>&1
> >         cgexec -g memory:kern_build make -j$NR_CPUS > /dev/null 2>&1
> >
> >         cgdelete -g memory:kern_build
> > done
> >
> > That would generate huge amount deferred objects due to __GFP_NOFS allocations.
> >
> > Then run the below negative dentry generator script:
> >
> > NR_CPUS=`cat /proc/cpuinfo | grep -e processor | wc -l`
> >
> > mkdir /sys/fs/cgroup/memory/test
> > echo $$ > /sys/fs/cgroup/memory/test/tasks
> >
> > for i in `seq $NR_CPUS`; do
> >         while true; do
> >                 FILE=`head /dev/urandom | tr -dc A-Za-z0-9 | head -c 64`
> >                 cat $FILE 2>/dev/null
> >         done &
> > done
> >
> > Then kswapd will shrink half of dentry cache in just one loop as the below tracing result
> > showed:
> >
> >       kswapd0-475   [028] .... 305968.252561: mm_shrink_slab_start: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0
> > objects to shrink 4994376020 gfp_flags GFP_KERNEL cache items 93689873 delta 45746 total_scan 46844936 priority 12
> >       kswapd0-475   [021] .... 306013.099399: mm_shrink_slab_end: super_cache_scan+0x0/0x190 0000000024acf00c: nid: 0 unused
> > scan count 4994376020 new scan count 4947576838 total_scan 8 last shrinker return val 46844928
>
>
> You have 93M dentries and inodes in the cache, and the reclaim delta is 45746,
> which is totally sane for a priority 12 reclaim priority. SO you've
> basically had to do a couple of million GFP_NOFS direct reclaim
> passes that were unable to reclaim anything to get to a point
> where the deferred reclaim would up to 4.9 -billion- objects.
>
> Basically, you would up the deferred work so far that it got out of
> control before a GFP_KERNEL reclaim context could do anything to
> bring it under control.
>
> However, removing defered work is not the solution. If we don't
> defer some of this reclaim work, then filesystem intensive workloads
> -cannot reclaim memory from their own caches- when they need memory.
> And when those caches largely dominate the used memory in the
> machine, this will grind the filesystem workload to a halt.. Hence
> this deferral mechanism is actually critical to keeping the
> filesystem caches balanced with the rest of the system.

Yes, I agree there might be imbalance if vfs caches shrinkers can't
keep up due to excessive __GFP_NOFS allocations.

>
> The behaviour you see is the windup clamping code triggering:
>
>         /*
>          * We need to avoid excessive windup on filesystem shrinkers
>          * due to large numbers of GFP_NOFS allocations causing the
>          * shrinkers to return -1 all the time. This results in a large
>          * nr being built up so when a shrink that can do some work
>          * comes along it empties the entire cache due to nr >>>
>          * freeable. This is bad for sustaining a working set in
>          * memory.
>          *
>          * Hence only allow the shrinker to scan the entire cache when
>          * a large delta change is calculated directly.
>          */
>         if (delta < freeable / 4)
>                 total_scan = min(total_scan, freeable / 2);
>
> It clamps the worst case freeing to half the cache, and that is
> exactly what you are seeing. This, unfortunately, won't be enough to
> fix the windup problem once it's spiralled out of control. It's
> fairly rare for this to happen - it takes effort to find an adverse
> workload that will cause windup like this.

I'm not sure if it is very rare, but my reproducer definitely could
generate huge amount of deferred objects easily. In addition it might
be easier to run into this case with hundreds of memcgs. Just imaging
hundreds memcgs run limit reclaims with __GFP_NOFS, the amount of
deferred objects can be built up easily.

On our production machine, I saw much more absurd deferred objects,
check the below tracing result out:

<...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
9300 cache items 1667 delta 11 total_scan 833

There are 2.5 trillion deferred objects on one node! So total > 5 trillion!

> So, with all that said, a year ago I actually fixed this problem
> as part of some work I did to provide non-blocking inode reclaim
> infrastructure in the shrinker for XFS inode reclaim.
> See this patch:
>
> https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/

Thanks for this. I remembered the patches, but I admitted I was not
aware deferred objects could go wild like that.

>
> It did two things. First it ensured all the deferred work was done
> by kswapd so that some poor direct reclaim victim didn't hit a
> massive reclaim latency spike because of windup. Secondly, it
> clamped the maximum windup to the maximum single pass reclaim scan
> limit, which is (freeable * 2) objects.
>
> Finally it also changed the amount of deferred work a single kswapd
> pass did to be directly proportional to the reclaim priority. Hence
> as we get closer to OOM, kswapd tries much harder to get the
> deferred work backlog down to zero. This means that a single, low
> priority reclaim pass will never reclaim half the cache - only
> sustained memory pressure and _reclaim priority windup_ will do
> that.

Other than these, there are more problems:

- The amount of deferred objects seem get significantly overestimated
and unbounded. For example, if one lru has 1000 objects, the amount of
reclaimables is bounded to 1000, but the amount of deferred is not. It
may go much bigger than 1000, right? As the above tracing result
shows, 2.5 trillion deferred objects on one node, assuming all of them
are dentry (192 bytes per object), so the total size of deferred on
one node is ~480TB! Or this is a bug?

- The deferred will be reset by the reclaimer who gets there first,
then other concurrent reclaimers just see 0 or very few deferred
objects. So the clamp may not happen on the lrus which have most
objects. For example, memcg A's dentry lru has 1000 objects, memcg B's
dentry lru has 1 million objects, but memcg A's limit reclaim is run
first, then just 500 was clamped.

- Currently the deferred objects are account per shrinker, it sounds
not very fair, particularly given the environment with hundreds of
memcgs. Some memcgs may not do a lot __GFP_NOFS allocations, but the
clamp may hit them. So, off the top of my head, I'm wondering whether
it sounds better to have deferred per-memcg, so who generates deferred
who gets punished.

- Some workloads, i.e. git server, don't want that clamp behavior or
wish it goes more mild. For example, the ratio between vfs caches and
page caches is 10:1 on some our production servers.

- Waiting for kswapd to clamp those deferred may be too late, and it
may not be able to drive deferred down to a reasonable number at all.
IMHO avoiding the amount of deferred objects goes out of control at
the first place may be much more important.

>
> You probably want to look at all the shrinker infrastructure patches
> in that series as the deferred work tracking and accounting changes
> span a few patches in the series:
>
> https://lore.kernel.org/linux-xfs/20191031234618.15403-1-david@fromorbit.com/

Yes, I will take a closer look. I do appreciate all the comments.

>
> Unfortunately, none of the MM developers showed any interest in
> these patches, so when I found a different solution to the XFS
> problem it got dropped on the ground.
>
> > So why do we have to still keep it around?
>
> Because we need a feedback mechanism to allow us to maintain control
> of the size of filesystem caches that grow via GFP_NOFS allocations.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Dave Chinner Sept. 21, 2020, 12:32 a.m. UTC | #3
On Thu, Sep 17, 2020 at 05:12:08PM -0700, Yang Shi wrote:
> On Wed, Sep 16, 2020 at 7:37 PM Dave Chinner <david@fromorbit.com> wrote:
> > On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
> > It clamps the worst case freeing to half the cache, and that is
> > exactly what you are seeing. This, unfortunately, won't be enough to
> > fix the windup problem once it's spiralled out of control. It's
> > fairly rare for this to happen - it takes effort to find an adverse
> > workload that will cause windup like this.
> 
> I'm not sure if it is very rare, but my reproducer definitely could
> generate huge amount of deferred objects easily. In addition it might
> be easier to run into this case with hundreds of memcgs. Just imaging
> hundreds memcgs run limit reclaims with __GFP_NOFS, the amount of
> deferred objects can be built up easily.

This is the first time I've seen a report that indicates excessive
wind-up is occurring in years. That definitely makes it a rare
problem in the real world.

> On our production machine, I saw much more absurd deferred objects,
> check the below tracing result out:
> 
> <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> 9300 cache items 1667 delta 11 total_scan 833
> 
> There are 2.5 trillion deferred objects on one node! So total > 5 trillion!

Sure, I'm not saying it's impossible to trigger, just that there are
not many common workloads that actually cause it to occur. And,
really, if it's wound up that far before you've noticed a problem,
then wind-up itself isn't typically a serious problem for
systems....

> > So, with all that said, a year ago I actually fixed this problem
> > as part of some work I did to provide non-blocking inode reclaim
> > infrastructure in the shrinker for XFS inode reclaim.
> > See this patch:
> >
> > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
> 
> Thanks for this. I remembered the patches, but I admitted I was not
> aware deferred objects could go wild like that.

Not many people are....

> > It did two things. First it ensured all the deferred work was done
> > by kswapd so that some poor direct reclaim victim didn't hit a
> > massive reclaim latency spike because of windup. Secondly, it
> > clamped the maximum windup to the maximum single pass reclaim scan
> > limit, which is (freeable * 2) objects.
> >
> > Finally it also changed the amount of deferred work a single kswapd
> > pass did to be directly proportional to the reclaim priority. Hence
> > as we get closer to OOM, kswapd tries much harder to get the
> > deferred work backlog down to zero. This means that a single, low
> > priority reclaim pass will never reclaim half the cache - only
> > sustained memory pressure and _reclaim priority windup_ will do
> > that.
> 
> Other than these, there are more problems:
> 
> - The amount of deferred objects seem get significantly overestimated
> and unbounded. For example, if one lru has 1000 objects, the amount of
> reclaimables is bounded to 1000, but the amount of deferred is not. It
> may go much bigger than 1000, right? As the above tracing result
> shows, 2.5 trillion deferred objects on one node, assuming all of them
> are dentry (192 bytes per object), so the total size of deferred on
> one node is ~480TB! Or this is a bug?

As the above patchset points out: it can get out of control because
it is unbounded. The above patchset bounds the deferred work to (2 *
current cache item count) and so it cannot ever spiral out of
control like this.

> - The deferred will be reset by the reclaimer who gets there first,
> then other concurrent reclaimers just see 0 or very few deferred
> objects.

No, not exactly.

The current behaviour is that the deferred count is drained by the
current shrinker context, then it does whatever work it can, then it
puts the remainder of the work that was not done back on the
deferred count. This was done so that only a single reclaim context
tried to execute the deferred work (i.e. to prevent the deferred
work being run multiple times by concurrent reclaim contexts), but
if the work didn't get done it was still accounted and would get
done later.

A side effect of this was that nothing ever zeros the deferred
count, however, because there is no serialisation between concurrent
shrinker contexts. That's why it can wind up if the number of
GFP_NOFS reclaim contexts greatly exceeds the number of GFP_KERNEL
reclaim contexts.

This is what the above patchset fixes - deferred work is only ever
done by kswapd(), which means it doesn't have to care about multiple
reclaim contexts doing deferred work. This simplifies it right down,
and it allows us to bound the quantity of deferred work as a single
reclaimer will be doing it all...

> So the clamp may not happen on the lrus which have most
> objects. For example, memcg A's dentry lru has 1000 objects, memcg B's
> dentry lru has 1 million objects, but memcg A's limit reclaim is run
> first, then just 500 was clamped.

Yup, that's a memcg bug. memcg's were grafted onto the side of the
shrinker infrastructure, and one of the parts of the shrinker
behaviour that was not made per-memcg was the amount of work
deferred from the per-memcg shrinker invocation. If you want memcgs
to behave correctly w.r.t. work deferred inside a specific memcg
shrinker context, then the deferred work accounting needs to be made
per-memcg, not just a global per-node count.

The first step to doing this, however, is fixing up the problems we
currently have with deferred work, and that is the patchset I
pointed you to above. We have to push the defered work to the kswapd
context so that it can process all the deferred work for all of the
memcgs in the system in a single reclaim context; if the memcg is
just doing GFP_NOFS allocations, then just deferring the work to the
next GFP_KERNEL direct reclaim that specific memcg enters is not
going to be sufficient.

> - Currently the deferred objects are account per shrinker, it sounds
> not very fair, particularly given the environment with hundreds of
> memcgs. Some memcgs may not do a lot __GFP_NOFS allocations, but the
> clamp may hit them. So, off the top of my head, I'm wondering whether
> it sounds better to have deferred per-memcg, so who generates deferred
> who gets punished.

Yup, see above.

> - Some workloads, i.e. git server, don't want that clamp behavior or
> wish it goes more mild. For example, the ratio between vfs caches and
> page caches is 10:1 on some our production servers.

The overall system cache balancing has nothing to do with deferred
work clamping. The deferred work mechanism is there to make sure
unrealised reclaim pressure is fed back into the reclaim subsystem
to tell it it needs to do more work...

> - Waiting for kswapd to clamp those deferred may be too late, and it
> may not be able to drive deferred down to a reasonable number at all.
> IMHO avoiding the amount of deferred objects goes out of control at
> the first place may be much more important.

kswapd is the only guaranteed reclaim context that can make
progress on deferred work. Windup is an indications that it hasn't
been kicked soon enough. One of the advantages of deferring work to
kswapd is that now we have a -algorithmic trigger- for shrinker
reclaim contexts kicking kswapd sooner than we currently do. e.g. if
the deferred work reaches 1/4 the size of the current cache, kick
kswapd to start doing the work we are deferring. This might require
marking memcgs and shrinkers as "needing deferred work" similar to
how we currently mark memcg shrinkers as "containing shrinkable
items" so that we can run kswapd quickly on just the memcgs/shrinker
contexts that need deferred work to be done....

Cheers,

Dave.
Yang Shi Sept. 22, 2020, 11:45 p.m. UTC | #4
On Sun, Sep 20, 2020 at 5:32 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, Sep 17, 2020 at 05:12:08PM -0700, Yang Shi wrote:
> > On Wed, Sep 16, 2020 at 7:37 PM Dave Chinner <david@fromorbit.com> wrote:
> > > On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
> > > It clamps the worst case freeing to half the cache, and that is
> > > exactly what you are seeing. This, unfortunately, won't be enough to
> > > fix the windup problem once it's spiralled out of control. It's
> > > fairly rare for this to happen - it takes effort to find an adverse
> > > workload that will cause windup like this.
> >
> > I'm not sure if it is very rare, but my reproducer definitely could
> > generate huge amount of deferred objects easily. In addition it might
> > be easier to run into this case with hundreds of memcgs. Just imaging
> > hundreds memcgs run limit reclaims with __GFP_NOFS, the amount of
> > deferred objects can be built up easily.
>
> This is the first time I've seen a report that indicates excessive
> wind-up is occurring in years. That definitely makes it a rare
> problem in the real world.
>
> > On our production machine, I saw much more absurd deferred objects,
> > check the below tracing result out:
> >
> > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> > 9300 cache items 1667 delta 11 total_scan 833
> >
> > There are 2.5 trillion deferred objects on one node! So total > 5 trillion!
>
> Sure, I'm not saying it's impossible to trigger, just that there are
> not many common workloads that actually cause it to occur. And,
> really, if it's wound up that far before you've noticed a problem,
> then wind-up itself isn't typically a serious problem for
> systems....

Actually the problem was observed some time ago, I just got some time
to look into the root cause.

This kind of problem may be more common with memcg environment. For
example, a misconfigured memcg may incur excessive __GFP_NOFS limit
reclaims.

>
> > > So, with all that said, a year ago I actually fixed this problem
> > > as part of some work I did to provide non-blocking inode reclaim
> > > infrastructure in the shrinker for XFS inode reclaim.
> > > See this patch:
> > >
> > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
> >
> > Thanks for this. I remembered the patches, but I admitted I was not
> > aware deferred objects could go wild like that.
>
> Not many people are....
>
> > > It did two things. First it ensured all the deferred work was done
> > > by kswapd so that some poor direct reclaim victim didn't hit a
> > > massive reclaim latency spike because of windup. Secondly, it
> > > clamped the maximum windup to the maximum single pass reclaim scan
> > > limit, which is (freeable * 2) objects.
> > >
> > > Finally it also changed the amount of deferred work a single kswapd
> > > pass did to be directly proportional to the reclaim priority. Hence
> > > as we get closer to OOM, kswapd tries much harder to get the
> > > deferred work backlog down to zero. This means that a single, low
> > > priority reclaim pass will never reclaim half the cache - only
> > > sustained memory pressure and _reclaim priority windup_ will do
> > > that.
> >
> > Other than these, there are more problems:
> >
> > - The amount of deferred objects seem get significantly overestimated
> > and unbounded. For example, if one lru has 1000 objects, the amount of
> > reclaimables is bounded to 1000, but the amount of deferred is not. It
> > may go much bigger than 1000, right? As the above tracing result
> > shows, 2.5 trillion deferred objects on one node, assuming all of them
> > are dentry (192 bytes per object), so the total size of deferred on
> > one node is ~480TB! Or this is a bug?
>
> As the above patchset points out: it can get out of control because
> it is unbounded. The above patchset bounds the deferred work to (2 *
> current cache item count) and so it cannot ever spiral out of
> control like this.

I was thinking about cap it to (2 * freeable) too before I looked into
your patches :-)

>
> > - The deferred will be reset by the reclaimer who gets there first,
> > then other concurrent reclaimers just see 0 or very few deferred
> > objects.
>
> No, not exactly.
>
> The current behaviour is that the deferred count is drained by the
> current shrinker context, then it does whatever work it can, then it
> puts the remainder of the work that was not done back on the
> deferred count. This was done so that only a single reclaim context
> tried to execute the deferred work (i.e. to prevent the deferred
> work being run multiple times by concurrent reclaim contexts), but
> if the work didn't get done it was still accounted and would get
> done later.

Yes, definitely. I should articulated it at the first place.

>
> A side effect of this was that nothing ever zeros the deferred
> count, however, because there is no serialisation between concurrent
> shrinker contexts. That's why it can wind up if the number of
> GFP_NOFS reclaim contexts greatly exceeds the number of GFP_KERNEL
> reclaim contexts.
>
> This is what the above patchset fixes - deferred work is only ever
> done by kswapd(), which means it doesn't have to care about multiple
> reclaim contexts doing deferred work. This simplifies it right down,
> and it allows us to bound the quantity of deferred work as a single
> reclaimer will be doing it all...
>
> > So the clamp may not happen on the lrus which have most
> > objects. For example, memcg A's dentry lru has 1000 objects, memcg B's
> > dentry lru has 1 million objects, but memcg A's limit reclaim is run
> > first, then just 500 was clamped.
>
> Yup, that's a memcg bug. memcg's were grafted onto the side of the
> shrinker infrastructure, and one of the parts of the shrinker
> behaviour that was not made per-memcg was the amount of work
> deferred from the per-memcg shrinker invocation. If you want memcgs
> to behave correctly w.r.t. work deferred inside a specific memcg
> shrinker context, then the deferred work accounting needs to be made
> per-memcg, not just a global per-node count.
>
> The first step to doing this, however, is fixing up the problems we
> currently have with deferred work, and that is the patchset I
> pointed you to above. We have to push the defered work to the kswapd
> context so that it can process all the deferred work for all of the
> memcgs in the system in a single reclaim context; if the memcg is
> just doing GFP_NOFS allocations, then just deferring the work to the
> next GFP_KERNEL direct reclaim that specific memcg enters is not
> going to be sufficient.

But kswapd may be not called in some cases at all. For example, the
system may have some memcgs configured, every memcg reaches its limit
and does limit reclaim, but the global memory usage is not high enough
to wake up kswapd. The deferred objects can get windup, and limit
reclaim can't bring it down under control.

By making nr_deferred per memcg, memcg limit reclaim could bring the
deferred objects under control.

>
> > - Currently the deferred objects are account per shrinker, it sounds
> > not very fair, particularly given the environment with hundreds of
> > memcgs. Some memcgs may not do a lot __GFP_NOFS allocations, but the
> > clamp may hit them. So, off the top of my head, I'm wondering whether
> > it sounds better to have deferred per-memcg, so who generates deferred
> > who gets punished.
>
> Yup, see above.
>
> > - Some workloads, i.e. git server, don't want that clamp behavior or
> > wish it goes more mild. For example, the ratio between vfs caches and
> > page caches is 10:1 on some our production servers.
>
> The overall system cache balancing has nothing to do with deferred
> work clamping. The deferred work mechanism is there to make sure
> unrealised reclaim pressure is fed back into the reclaim subsystem
> to tell it it needs to do more work...
>
> > - Waiting for kswapd to clamp those deferred may be too late, and it
> > may not be able to drive deferred down to a reasonable number at all.
> > IMHO avoiding the amount of deferred objects goes out of control at
> > the first place may be much more important.
>
> kswapd is the only guaranteed reclaim context that can make
> progress on deferred work. Windup is an indications that it hasn't
> been kicked soon enough. One of the advantages of deferring work to
> kswapd is that now we have a -algorithmic trigger- for shrinker
> reclaim contexts kicking kswapd sooner than we currently do. e.g. if
> the deferred work reaches 1/4 the size of the current cache, kick
> kswapd to start doing the work we are deferring. This might require
> marking memcgs and shrinkers as "needing deferred work" similar to
> how we currently mark memcg shrinkers as "containing shrinkable
> items" so that we can run kswapd quickly on just the memcgs/shrinker
> contexts that need deferred work to be done....

This seems feasible, but it sounds like we need introduce another "watermark".

IMHO we could make shrinker behave more fair among memcgs and keep
deferred objects under control just by making nr_deferred per memcg
and capping nr_deferred to (2 * freeable) or whatever reasonable
number.

Both kswapd and global direct reclaim would traverse all memcgs and
they get nr_deferred from each memcg, they can guarantee all memcgs
get shrunk at a fair rate. We could shrink harder in kswapd, but both
kswapd and direct reclaim should do shrink according to priority. This
should be able to mitigate direct reclaim latency.

Limit reclaim would traverse all memcgs under reclaim root, it could
help keep deferred objects under control for "limit reclaim only"
case.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
Yang Shi Sept. 26, 2020, 8:31 p.m. UTC | #5
Hi Dave,

I was exploring to make the "nr_deferred" per memcg. I looked into and
had some prototypes for two approaches so far:
1. Have per memcg data structure for each memcg aware shrinker, just
like what shrinker_map does.
2. Have "nr_deferred" on list_lru_one for memcg aware lists.

Both seem feasible, however the latter one looks much cleaner, I just
need to add two new APIs for shrinker which gets and sets
"nr_deferred" respectively. And, just memcg aware shrinkers need
define those two callouts. We just need to care about memcg aware
shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and
workingset) use list_lru, so I'd prefer the latter one.

But there are two memcg aware shrinkers are not that straightforward
to deal with:
1. The deferred split THP. It doesn't use list_lru, but actually I
don't worry about this one, since it is not cache just some partial
unmapped THPs. I could try to convert it to use list_lru later on or
just kill deferred split by making vmscan split partial unmapped THPs.
So TBH I don't think it is a blocker.
2. The fs_objects. This one looks weird. It shares the same shrinker
with inode/dentry. The only user is XFS currently. But it looks it is
not really memcg aware at all, right? They are managed by radix tree
which is not per memcg by looking into xfs code, so the "per list_lru
nr_deferred" can't work for it. I thought of a couple of ways to
tackle it off the top of my head:
    A. Just ignore it. If the amount of fs_objects are negligible
comparing to inode/dentry, then I think it can be just ignored and
kept it as is.
    B. Move it out of inode/dentry shrinker. Add a dedicated shrinker
for it, for example, sb->s_fs_obj_shrink.
    C. Make it really memcg aware and use list_lru.

I don't have any experience on XFS code, #C seems the most optimal,
but should be the most time consuming, I'm not sure if it is worth it
or not. So, #B sounds more preferred IMHO.

Advice is much appreciated. Thanks.


On Tue, Sep 22, 2020 at 4:45 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Sun, Sep 20, 2020 at 5:32 PM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Thu, Sep 17, 2020 at 05:12:08PM -0700, Yang Shi wrote:
> > > On Wed, Sep 16, 2020 at 7:37 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > On Wed, Sep 16, 2020 at 11:58:21AM -0700, Yang Shi wrote:
> > > > It clamps the worst case freeing to half the cache, and that is
> > > > exactly what you are seeing. This, unfortunately, won't be enough to
> > > > fix the windup problem once it's spiralled out of control. It's
> > > > fairly rare for this to happen - it takes effort to find an adverse
> > > > workload that will cause windup like this.
> > >
> > > I'm not sure if it is very rare, but my reproducer definitely could
> > > generate huge amount of deferred objects easily. In addition it might
> > > be easier to run into this case with hundreds of memcgs. Just imaging
> > > hundreds memcgs run limit reclaims with __GFP_NOFS, the amount of
> > > deferred objects can be built up easily.
> >
> > This is the first time I've seen a report that indicates excessive
> > wind-up is occurring in years. That definitely makes it a rare
> > problem in the real world.
> >
> > > On our production machine, I saw much more absurd deferred objects,
> > > check the below tracing result out:
> > >
> > > <...>-48776 [032] .... 27970562.458916: mm_shrink_slab_start:
> > > super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 0 objects to shrink
> > > 2531805877005 gfp_flags GFP_HIGHUSER_MOVABLE pgs_scanned 32 lru_pgs
> > > 9300 cache items 1667 delta 11 total_scan 833
> > >
> > > There are 2.5 trillion deferred objects on one node! So total > 5 trillion!
> >
> > Sure, I'm not saying it's impossible to trigger, just that there are
> > not many common workloads that actually cause it to occur. And,
> > really, if it's wound up that far before you've noticed a problem,
> > then wind-up itself isn't typically a serious problem for
> > systems....
>
> Actually the problem was observed some time ago, I just got some time
> to look into the root cause.
>
> This kind of problem may be more common with memcg environment. For
> example, a misconfigured memcg may incur excessive __GFP_NOFS limit
> reclaims.
>
> >
> > > > So, with all that said, a year ago I actually fixed this problem
> > > > as part of some work I did to provide non-blocking inode reclaim
> > > > infrastructure in the shrinker for XFS inode reclaim.
> > > > See this patch:
> > > >
> > > > https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/
> > >
> > > Thanks for this. I remembered the patches, but I admitted I was not
> > > aware deferred objects could go wild like that.
> >
> > Not many people are....
> >
> > > > It did two things. First it ensured all the deferred work was done
> > > > by kswapd so that some poor direct reclaim victim didn't hit a
> > > > massive reclaim latency spike because of windup. Secondly, it
> > > > clamped the maximum windup to the maximum single pass reclaim scan
> > > > limit, which is (freeable * 2) objects.
> > > >
> > > > Finally it also changed the amount of deferred work a single kswapd
> > > > pass did to be directly proportional to the reclaim priority. Hence
> > > > as we get closer to OOM, kswapd tries much harder to get the
> > > > deferred work backlog down to zero. This means that a single, low
> > > > priority reclaim pass will never reclaim half the cache - only
> > > > sustained memory pressure and _reclaim priority windup_ will do
> > > > that.
> > >
> > > Other than these, there are more problems:
> > >
> > > - The amount of deferred objects seem get significantly overestimated
> > > and unbounded. For example, if one lru has 1000 objects, the amount of
> > > reclaimables is bounded to 1000, but the amount of deferred is not. It
> > > may go much bigger than 1000, right? As the above tracing result
> > > shows, 2.5 trillion deferred objects on one node, assuming all of them
> > > are dentry (192 bytes per object), so the total size of deferred on
> > > one node is ~480TB! Or this is a bug?
> >
> > As the above patchset points out: it can get out of control because
> > it is unbounded. The above patchset bounds the deferred work to (2 *
> > current cache item count) and so it cannot ever spiral out of
> > control like this.
>
> I was thinking about cap it to (2 * freeable) too before I looked into
> your patches :-)
>
> >
> > > - The deferred will be reset by the reclaimer who gets there first,
> > > then other concurrent reclaimers just see 0 or very few deferred
> > > objects.
> >
> > No, not exactly.
> >
> > The current behaviour is that the deferred count is drained by the
> > current shrinker context, then it does whatever work it can, then it
> > puts the remainder of the work that was not done back on the
> > deferred count. This was done so that only a single reclaim context
> > tried to execute the deferred work (i.e. to prevent the deferred
> > work being run multiple times by concurrent reclaim contexts), but
> > if the work didn't get done it was still accounted and would get
> > done later.
>
> Yes, definitely. I should articulated it at the first place.
>
> >
> > A side effect of this was that nothing ever zeros the deferred
> > count, however, because there is no serialisation between concurrent
> > shrinker contexts. That's why it can wind up if the number of
> > GFP_NOFS reclaim contexts greatly exceeds the number of GFP_KERNEL
> > reclaim contexts.
> >
> > This is what the above patchset fixes - deferred work is only ever
> > done by kswapd(), which means it doesn't have to care about multiple
> > reclaim contexts doing deferred work. This simplifies it right down,
> > and it allows us to bound the quantity of deferred work as a single
> > reclaimer will be doing it all...
> >
> > > So the clamp may not happen on the lrus which have most
> > > objects. For example, memcg A's dentry lru has 1000 objects, memcg B's
> > > dentry lru has 1 million objects, but memcg A's limit reclaim is run
> > > first, then just 500 was clamped.
> >
> > Yup, that's a memcg bug. memcg's were grafted onto the side of the
> > shrinker infrastructure, and one of the parts of the shrinker
> > behaviour that was not made per-memcg was the amount of work
> > deferred from the per-memcg shrinker invocation. If you want memcgs
> > to behave correctly w.r.t. work deferred inside a specific memcg
> > shrinker context, then the deferred work accounting needs to be made
> > per-memcg, not just a global per-node count.
> >
> > The first step to doing this, however, is fixing up the problems we
> > currently have with deferred work, and that is the patchset I
> > pointed you to above. We have to push the defered work to the kswapd
> > context so that it can process all the deferred work for all of the
> > memcgs in the system in a single reclaim context; if the memcg is
> > just doing GFP_NOFS allocations, then just deferring the work to the
> > next GFP_KERNEL direct reclaim that specific memcg enters is not
> > going to be sufficient.
>
> But kswapd may be not called in some cases at all. For example, the
> system may have some memcgs configured, every memcg reaches its limit
> and does limit reclaim, but the global memory usage is not high enough
> to wake up kswapd. The deferred objects can get windup, and limit
> reclaim can't bring it down under control.
>
> By making nr_deferred per memcg, memcg limit reclaim could bring the
> deferred objects under control.
>
> >
> > > - Currently the deferred objects are account per shrinker, it sounds
> > > not very fair, particularly given the environment with hundreds of
> > > memcgs. Some memcgs may not do a lot __GFP_NOFS allocations, but the
> > > clamp may hit them. So, off the top of my head, I'm wondering whether
> > > it sounds better to have deferred per-memcg, so who generates deferred
> > > who gets punished.
> >
> > Yup, see above.
> >
> > > - Some workloads, i.e. git server, don't want that clamp behavior or
> > > wish it goes more mild. For example, the ratio between vfs caches and
> > > page caches is 10:1 on some our production servers.
> >
> > The overall system cache balancing has nothing to do with deferred
> > work clamping. The deferred work mechanism is there to make sure
> > unrealised reclaim pressure is fed back into the reclaim subsystem
> > to tell it it needs to do more work...
> >
> > > - Waiting for kswapd to clamp those deferred may be too late, and it
> > > may not be able to drive deferred down to a reasonable number at all.
> > > IMHO avoiding the amount of deferred objects goes out of control at
> > > the first place may be much more important.
> >
> > kswapd is the only guaranteed reclaim context that can make
> > progress on deferred work. Windup is an indications that it hasn't
> > been kicked soon enough. One of the advantages of deferring work to
> > kswapd is that now we have a -algorithmic trigger- for shrinker
> > reclaim contexts kicking kswapd sooner than we currently do. e.g. if
> > the deferred work reaches 1/4 the size of the current cache, kick
> > kswapd to start doing the work we are deferring. This might require
> > marking memcgs and shrinkers as "needing deferred work" similar to
> > how we currently mark memcg shrinkers as "containing shrinkable
> > items" so that we can run kswapd quickly on just the memcgs/shrinker
> > contexts that need deferred work to be done....
>
> This seems feasible, but it sounds like we need introduce another "watermark".
>
> IMHO we could make shrinker behave more fair among memcgs and keep
> deferred objects under control just by making nr_deferred per memcg
> and capping nr_deferred to (2 * freeable) or whatever reasonable
> number.
>
> Both kswapd and global direct reclaim would traverse all memcgs and
> they get nr_deferred from each memcg, they can guarantee all memcgs
> get shrunk at a fair rate. We could shrink harder in kswapd, but both
> kswapd and direct reclaim should do shrink according to priority. This
> should be able to mitigate direct reclaim latency.
>
> Limit reclaim would traverse all memcgs under reclaim root, it could
> help keep deferred objects under control for "limit reclaim only"
> case.
>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
Dave Chinner Sept. 30, 2020, 7:31 a.m. UTC | #6
On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote:
> Hi Dave,
> 
> I was exploring to make the "nr_deferred" per memcg. I looked into and
> had some prototypes for two approaches so far:
> 1. Have per memcg data structure for each memcg aware shrinker, just
> like what shrinker_map does.
> 2. Have "nr_deferred" on list_lru_one for memcg aware lists.
> 
> Both seem feasible, however the latter one looks much cleaner, I just
> need to add two new APIs for shrinker which gets and sets
> "nr_deferred" respectively. And, just memcg aware shrinkers need
> define those two callouts. We just need to care about memcg aware
> shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and
> workingset) use list_lru, so I'd prefer the latter one.

The list_lru is completely separate from the shrinker context. The
structure that tracks objects in a subsystem is not visible or aware
of how the high level shrinker scanning algorithms work. Not to
mention that subsystem shrinkers can be memcg aware without using
list_lru structures to index objects owned by a given memcg. Hence I
really don't like the idea of tying the shrinker control data deeply
into subsystem cache indexing....


> But there are two memcg aware shrinkers are not that straightforward
> to deal with:
> 1. The deferred split THP. It doesn't use list_lru, but actually I
> don't worry about this one, since it is not cache just some partial
> unmapped THPs. I could try to convert it to use list_lru later on or
> just kill deferred split by making vmscan split partial unmapped THPs.
> So TBH I don't think it is a blocker.

What a fantastic abuse of the reclaim infrastructure. :/

First it was just defered work. Then it became NUMA_AWARE. THen it
became MEMCG_AWARE and....

Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it
runs through shrink_slab_memcg() even when memcgs are configured in
but kmem tracking disabled. We have heaps of shrinkers that reclaim
from things that aren't slab caches, but this one is just nuts.

> 2. The fs_objects. This one looks weird. It shares the same shrinker
> with inode/dentry. The only user is XFS currently. But it looks it is
> not really memcg aware at all, right?

It most definitely is.

The VFS dentry and inode cache reclaim are memcg aware. The
fs_objects callout is for filesystem level object garbage collection
that can be done as a result of the dentry and inode caches being
reclaimed.

i.e. once the VFS has reclaimed the inode attached to the memcg, it
is no longer attached and accounted to the memcg anymore. It is
owned by the filesystem at this point, and it is entirely up to the
filesytem to when it can then be freed. Most filesystems do it in
the inode cache reclaim via the ->destroy method. XFS, OTOH, tags
freeable inodes in it's internal radix trees rather than freeing
them immediately because it still may have to clean the inode before
it can be freed. Hence we defer freeing of inodes until the
->fs_objects pass....

> They are managed by radix tree
> which is not per memcg by looking into xfs code, so the "per list_lru
> nr_deferred" can't work for it.  I thought of a couple of ways to
> tackle it off the top of my head:
>     A. Just ignore it. If the amount of fs_objects are negligible
> comparing to inode/dentry, then I think it can be just ignored and
> kept it as is.

Ah, no, they are not negliable. Under memory pressure, the number of
objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs
objects to be reclaimed. The dentries and VFS inodes are owned by
VFS level caches and associated with memcgs, the fs_objects are only
visible to the filesystem.

>     B. Move it out of inode/dentry shrinker. Add a dedicated shrinker
> for it, for example, sb->s_fs_obj_shrink.

No, they are there because the reclaim has to be kept in exact
proportion to the dentry and inode reclaim quantities. That's the
reason I put that code there in the first place: a separate inode
filesystem cache shrinker just didn't work well at all.

>     C. Make it really memcg aware and use list_lru.

Two things. Firstly, objects are owned by the filesystem at this
point, not memcgs. Memcgs were detatched at the VFS inode reclaim
layer.

Secondly, list-lru does not scale well enough for the use needed by
XFS. We use radix trees so we can do lockless batch lookups and
IO-efficient inode-order reclaim passes. We also have concurrent
reclaim capabilities because of the lockless tag lookup walks.
Using a list_lru for this substantially reduces reclaim performance
and greatly increases CPU usage of reclaim because of contention on
the internal list lru locks. Been there, measured that....

> I don't have any experience on XFS code, #C seems the most optimal,
> but should be the most time consuming, I'm not sure if it is worth it
> or not. So, #B sounds more preferred IMHO.

I think you're going completely in the wrong direction. The problem
that needs solving is integrating shrinker scanning control state
with memcgs more tightly, not force every memcg aware shrinker to
use list_lru for their subsystem shrinker implementations....

Cheers,

Dave.
Yang Shi Sept. 30, 2020, 5:23 p.m. UTC | #7
On Wed, Sep 30, 2020 at 12:31 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Sat, Sep 26, 2020 at 01:31:36PM -0700, Yang Shi wrote:
> > Hi Dave,
> >
> > I was exploring to make the "nr_deferred" per memcg. I looked into and
> > had some prototypes for two approaches so far:
> > 1. Have per memcg data structure for each memcg aware shrinker, just
> > like what shrinker_map does.
> > 2. Have "nr_deferred" on list_lru_one for memcg aware lists.
> >
> > Both seem feasible, however the latter one looks much cleaner, I just
> > need to add two new APIs for shrinker which gets and sets
> > "nr_deferred" respectively. And, just memcg aware shrinkers need
> > define those two callouts. We just need to care about memcg aware
> > shrinkers, and the most memcg aware shrinkers (inode/dentry, nfs and
> > workingset) use list_lru, so I'd prefer the latter one.
>
> The list_lru is completely separate from the shrinker context. The
> structure that tracks objects in a subsystem is not visible or aware
> of how the high level shrinker scanning algorithms work. Not to
> mention that subsystem shrinkers can be memcg aware without using
> list_lru structures to index objects owned by a given memcg. Hence I
> really don't like the idea of tying the shrinker control data deeply
> into subsystem cache indexing....

I see your points. Yes, makes sense to me. The list_lru is a common
data structure and could be used by other subsystems, not only memcg
aware shrinkers.

Putting shrinker control data in list_lru seems break the layer. So,
option #1 might be more appropriate. The change looks like:

struct mem_cgroup_per_node {
...
        struct memcg_shrinker_map __rcu *shrinker_map;
+       struct memcg_shrinker_deferred __rcu    *shrinker_deferred;
...
}

>
>
> > But there are two memcg aware shrinkers are not that straightforward
> > to deal with:
> > 1. The deferred split THP. It doesn't use list_lru, but actually I
> > don't worry about this one, since it is not cache just some partial
> > unmapped THPs. I could try to convert it to use list_lru later on or
> > just kill deferred split by making vmscan split partial unmapped THPs.
> > So TBH I don't think it is a blocker.
>
> What a fantastic abuse of the reclaim infrastructure. :/
>
> First it was just defered work. Then it became NUMA_AWARE. THen it
> became MEMCG_AWARE and....
>
> Oh, man what a nasty hack that SHRINKER_NONSLAB flag is so that it
> runs through shrink_slab_memcg() even when memcgs are configured in
> but kmem tracking disabled. We have heaps of shrinkers that reclaim
> from things that aren't slab caches, but this one is just nuts.
>
> > 2. The fs_objects. This one looks weird. It shares the same shrinker
> > with inode/dentry. The only user is XFS currently. But it looks it is
> > not really memcg aware at all, right?
>
> It most definitely is.
>
> The VFS dentry and inode cache reclaim are memcg aware. The
> fs_objects callout is for filesystem level object garbage collection
> that can be done as a result of the dentry and inode caches being
> reclaimed.
>
> i.e. once the VFS has reclaimed the inode attached to the memcg, it
> is no longer attached and accounted to the memcg anymore. It is
> owned by the filesystem at this point, and it is entirely up to the
> filesytem to when it can then be freed. Most filesystems do it in
> the inode cache reclaim via the ->destroy method. XFS, OTOH, tags
> freeable inodes in it's internal radix trees rather than freeing
> them immediately because it still may have to clean the inode before
> it can be freed. Hence we defer freeing of inodes until the
> ->fs_objects pass....

Aha, thanks for elaborating. Now I see what it is doing for...

>
> > They are managed by radix tree
> > which is not per memcg by looking into xfs code, so the "per list_lru
> > nr_deferred" can't work for it.  I thought of a couple of ways to
> > tackle it off the top of my head:
> >     A. Just ignore it. If the amount of fs_objects are negligible
> > comparing to inode/dentry, then I think it can be just ignored and
> > kept it as is.
>
> Ah, no, they are not negliable. Under memory pressure, the number of
> objects is typically 1/3rd dentries, 1/3rd VFS inodes, 1/3rd fs
> objects to be reclaimed. The dentries and VFS inodes are owned by
> VFS level caches and associated with memcgs, the fs_objects are only
> visible to the filesystem.
>
> >     B. Move it out of inode/dentry shrinker. Add a dedicated shrinker
> > for it, for example, sb->s_fs_obj_shrink.
>
> No, they are there because the reclaim has to be kept in exact
> proportion to the dentry and inode reclaim quantities. That's the
> reason I put that code there in the first place: a separate inode
> filesystem cache shrinker just didn't work well at all.
>
> >     C. Make it really memcg aware and use list_lru.
>
> Two things. Firstly, objects are owned by the filesystem at this
> point, not memcgs. Memcgs were detatched at the VFS inode reclaim
> layer.
>
> Secondly, list-lru does not scale well enough for the use needed by
> XFS. We use radix trees so we can do lockless batch lookups and
> IO-efficient inode-order reclaim passes. We also have concurrent
> reclaim capabilities because of the lockless tag lookup walks.
> Using a list_lru for this substantially reduces reclaim performance
> and greatly increases CPU usage of reclaim because of contention on
> the internal list lru locks. Been there, measured that....
>
> > I don't have any experience on XFS code, #C seems the most optimal,
> > but should be the most time consuming, I'm not sure if it is worth it
> > or not. So, #B sounds more preferred IMHO.
>
> I think you're going completely in the wrong direction. The problem
> that needs solving is integrating shrinker scanning control state
> with memcgs more tightly, not force every memcg aware shrinker to
> use list_lru for their subsystem shrinker implementations....

Thanks a lot for all the elaboration and advice. Integrating shrinker
scanning control state with memcgs more tightly makes sense to me.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com