mbox series

[RFC,0/4] Introduce unbalance proactive reclaim

Message ID	20231108065818.19932-1-link@vivo.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Huan Yang <link@vivo.com> To: Tejun Heo <tj@kernel.org>, Zefan Li <lizefan.x@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, Jonathan Corbet <corbet@lwn.net>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Shakeel Butt <shakeelb@google.com>, Muchun Song <muchun.song@linux.dev>, Andrew Morton <akpm@linux-foundation.org>, David Hildenbrand <david@redhat.com>, Matthew Wilcox <willy@infradead.org>, Huang Ying <ying.huang@intel.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Peter Xu <peterx@redhat.com>, "Vishal Moola (Oracle)" <vishal.moola@gmail.com>, Yosry Ahmed <yosryahmed@google.com>, Liu Shixin <liushixin2@huawei.com>, Hugh Dickins <hughd@google.com>, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: opensource.kernel@vivo.com, Huan Yang <link@vivo.com> Subject: [RFC 0/4] Introduce unbalance proactive reclaim Date: Wed, 8 Nov 2023 14:58:11 +0800 Message-Id: <20231108065818.19932-1-link@vivo.com> Content-Transfer-Encoding: 8bit Content-Type: text/plain MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Introduce unbalance proactive reclaim \| expand [RFC,0/4] Introduce unbalance proactive reclaim [1/4] mm: vmscan: LRU unbalance cgroup reclaim [2/4] mm: multi-gen LRU: MGLRU unbalance reclaim [3/4] mm: memcg: implement unbalance proactive reclaim [4/4] mm: memcg: apply proactive reclaim into cgroupv1

Message ID

20231108065818.19932-1-link@vivo.com (mailing list archive)

Headers

From: Huan Yang <link@vivo.com>
To: Tejun Heo <tj@kernel.org>,
	Zefan Li <lizefan.x@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>,
	Peter Xu <peterx@redhat.com>,
	"Vishal Moola (Oracle)" <vishal.moola@gmail.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	Liu Shixin <liushixin2@huawei.com>,
	Hugh Dickins <hughd@google.com>,
	cgroups@vger.kernel.org,
	linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Cc: opensource.kernel@vivo.com,
	Huan Yang <link@vivo.com>
Subject: [RFC 0/4] Introduce unbalance proactive reclaim
Date: Wed,  8 Nov 2023 14:58:11 +0800
Message-Id: <20231108065818.19932-1-link@vivo.com>
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
MIME-Version: 1.0
X-MS-Exchange-AntiSpam-MessageData-ChunkCount: 1
X-MS-Exchange-AntiSpam-MessageData-0: 
 3jw5ZiC7AK/T6AYy3lwjCIg0CflrHoqdBJQoOIYFePcfG07tAs4toWm+CNkkg2iuNJnIFbrCVEvnZuH/afWj9A/sFf1Rt0d0m+N9+oh5ca64EXfZVNyjWTAj9h2AGjfpP+4gse2pV55thh9jeGOE+dzAdmipQXah/g3+Osy4wvrLhvsvEU/ggPheuoJqY4NzEVThGFE+jjAhglw3BpnBs0Vuxb9JrQlf48M7x3vIZcmfAbpFV+lwnTyEVo+YQrdBSLjxtDDWGIAXMQCEbARww7/k8UKJLUDa4r4mX2x9VPG0Y3Gfv4MSdlNMWs+spJIwQKLSGhuZZQ7hmyx0iPE1vylGxf3Xf00QI+DQHHsuPdK71y7vthXZIsfC2aw9GCGp4tZ/gVXSbHjSlzVrY101cuLiJJHAlTNxWc3cuc9uEYn/8r8ZOM/Wi5elan3DOVV1xCPFroz+ON8KaZnSN/lDmQZ/l0mMIkonCVLsp/IVZPIGLYveR58hk5ys5iiO9zKeeY+d24OUyU3oiLcFf9/r7t0FzI8fOmak5VUe+KalDGGX0AwdAbNTK8cPwtTQ591NyF5IM/uerHy8TvYHe3eIZQtWA0Qx+8olz/zSJvvy6pwNhZrvMBBrQxKP5G8nPIF6lU1maUE3FA1ViquPSFstJB8kR24aFSEyffQXQiDwFJqvJWvcaLBvQ767Y4XPjfT22kTgXFpAV+B5w0bl4pLH8USGR5CI8o4XAkTl0bILqD4fD9l/BT24y2aHEJ+r4Onb+D8xUZzyGjSh+cNY9LW8Ep4a+8/RI4vV0s79iT04NlAiiy/3He5FaMTQP1FwcAWz9GaDps5lptKRNMNw31kyaRJQPtwRE+sG7qmYHchAfPTeQNb19W5Qm7tA3dndpOBAlkiVmblmuOH3aZg6TYFSJXiUgJY5ETnR6OXQjQRaAmMxoCeFtJVjvgtTTcua8QHSrshVUOTI8Ezlb19IVku3+tXqBu7ds8y15UD7B4N07fqN+GwsX8m5AWpmc2sCwvtkNju/rPWCe6U/dinXOJlQoKyW3dOA+t4TWppsiu+D3wStm+pe6/WpdRQI+VQgpUe29CMYDaPKQrtUyg94Nirj9ey5s+UPlXjcjmt12Ml6Pv3FgzvlFYUJ4DpK3CC2o3Nkyw7YL4keNLTaFy8rCvTA9iccZt+bTa8h3M7l/SClD+Wn2BV6r+/yVUDN39l0AK4esDsMrh7xkIOgrHUMGTIRKNwyQAkW43Qjl9KrFCLJWYLKGB573iJoqWHavJq+wJyIyp/TUAu9K8j740Zw0xcHp1RpVUK16O9TfxjGEBrbL44Ycc1R4roAlqKEl18tAthN5bl1Nvhz8app1symmVKQh3eOqpBSkVNk4QbJeHiHz1r9WW1GnGSzQHPinphZ9eoLketit+DwyeGO2+7+6G8YWYcQN3t/CxoflMUf+cxub557TB3CS5l0h9L9FONE/vNA4BjMxnj2H0KF2WGI7rfDYu+q/hengkSgtIAM9LnUADX06WMqRdVcQ6G0EAuXArT/Uvgc6lbcNFaEjf0ETTXypCwp1Ok2uYZQbij+XVAPD0jmlGjcc+h/PJj3fGdxVSri
X-OriginatorOrg: vivo.com
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 4a78337a-4ed9-4b85-d066-08dbe02834ca
X-MS-Exchange-CrossTenant-AuthSource: PUZPR06MB5676.apcprd06.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Internal
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 08 Nov 2023 06:59:09.3956
 (UTC)
X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted
X-MS-Exchange-CrossTenant-Id: 923e42dc-48d5-4cbe-b582-1a797a6412ed
X-MS-Exchange-CrossTenant-MailboxType: HOSTED
X-MS-Exchange-CrossTenant-UserPrincipalName: 
 HJkSgmpAWTASaVKRmesjeUOBOD8Cnet2AL2Io9MBhb5krXHg+BiGZ7wsnZQ8vviyUyJH8yqTO7wc+MDANvGGAQ==
X-MS-Exchange-Transport-CrossTenantHeadersStamped: TYZPR06MB3983
X-Rspam-User: 
X-Rspamd-Server: rspam06
X-Rspamd-Queue-Id: 50A1B80019
X-Stat-Signature: 3nyemb5djcfw1baajmm981ddkc4api96
X-HE-Tag: 1699426755-842562
X-HE-Meta: 
 U2FsdGVkX18g0vSn2Ow8z33B3qDZr1Gti2dUis32CKoLH0Dnm+rkjDzhgJFOOvlDbonhd3nEBHTbiqRKm2xsx+1Qd3c6kUh1kAJDG+BiQ5nXoooPw854dej/ZAxaSuImC/GizSqWnU+HLgz2xlHBk7bn+mgNKKjc5r+gLUUyCd97m3YNB9lZMiWzs799htFabRaPdYjiFsC8aSY23/bIdes8RHaMvKZFX9xLtSJpqzSYOm6Mv/1+TcD8YEyqtzqpvUNuEFv5Dao3amkmUM59ijV4UTe4WndDD97qp43jBYyKWTZh1FtzunyAliBqFyLxtmA7Y6YjRreNIO7/o/htLIyL1XCdSKZYFHcYE6EAecIuAnK2DtkACapyZ+0xXbkZTmtxuAnAT+laib8uxYWvq7cB6AmEmqh/UvMVcrgJg3LUTnmoTCpdEU3rKGEfqhuX/G9auEQ/54nDJqn5k5DnruQAgGJxJcGNW8+G401WWXBJlQ0FFCkm16h2jt8KcJ04FhoGuExH4zThS+G4X4WQ36dxDOalX+UJYYHPqop+mBBHGiRw/7+GldfDxdwcnfjwVSdhKrh4S+4H/IWJ9TkRnXBS4mYiXrkhfno8Kpxg2CzcHdilJ8g8ZxRU508Zx1gydY9V5K+SeQNV6g3tEhjFP5VSrE04bHWooN7sak7yLCFsjeihmD5VfnukUtCICSLbEZbh5C8hd4NezX+w3xGPctngWmEzOaFBPx2fLqDb6crQaxzOWipxB/DKRd7diXQoeKg587pZZjirt1tLia+GQiEaq0BQSsIYFqIm4eRGoqrCazhGsVSGKfzGfwFT5hh2bK4z17nN2Rk51qmfm129M7A8cTXDWgOiDzTjm+MQhb3+uruBaS4BTS6oUpT5wTH0zvdooFfu2jOmIpD1OAZmjDChLY1WkKxGKPoC8BZr1CLcIM+7nfUqthWTJN8rQZow0YlpwthLHKtLMXVD750
 Vl4zb+f1
 6K+0KTFmdWwpRt4Cq+34a7IME+JPcrfrky5Hgzaia+mX0ZaTPfcHHu8uIRyV5O1hxvv4NSWWMutZmpq48wG0C6ncis21kqjBxBIOww7o3E3zcLip1c3ARp7BxbTVTec3FKiCQpyW7rjWupbZc7uCRYjYHPCyAetqo9yDVn4qi6T89jfOAI8kTvjdhse653oux5VeakyDs1e3uczhl774cVOTsVPgLG2fLJKKzOdppTa8JkgqvJflBo6XwldLAymyxijH46wDZ57wkgE7A+KqJ9IN+s5lqgxZafl2/5AtpEyOKIGcFCVEukZ+t8ZbMMNJX0LWGLYPz/i/kWCNOKRXuVkFrksIFd7A/8pig
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Series

Introduce unbalance proactive reclaim | expand

Message

Huan Yang Nov. 8, 2023, 6:58 a.m. UTC

In some cases, we need to selectively reclaim file pages or anonymous
pages in an unbalanced manner.

For example, when an application is pushed to the background and frozen,
it may not be opened for a long time, and we can safely reclaim the
application's anonymous pages, but we do not want to touch the file pages.

This patchset extends the proactive reclaim interface to achieve
unbalanced reclamation. Users can control the reclamation tendency by
inputting swappiness under the original interface. Specifically, users
can input special values to extremely reclaim specific pages.

Example:
  	echo "1G" 200 > memory.reclaim (only reclaim anon)
	  echo "1G" 0  > memory.reclaim (only reclaim file)
	  echo "1G" 1  > memory.reclaim (only reclaim file)

Note that when performing unbalanced reclamation, the cgroup swappiness
will be temporarily adjusted dynamically to the input value. Therefore,
if the cgroup swappiness is further modified during runtime, there may
be some errors.

However, this is acceptable because the interface is dynamically called
by the user and the timing should be controlled by the user.

This patchset did not implement the type-based reclamation as expected
in the documentation.(anon or file) Because in addition to extreme unbalanced
reclamation, this patchset can also adapt to the reclamation tendency
allocated according to swappiness, which is more flexible.

Self test
========
After applying the following patches and myself debug patch, my self-test
results are as follows:

1. LRU test
===========
  a. Anon unbalance reclaim
  ```
  cat memory.stat | grep anon
    inactive_anon 7634944
    active_anon 7741440

  echo "200M" 200 > memory.reclaim

  cat memory.stat | grep anon
    inactive_anon 0
    active_anon 0

  cat memory.reclaim_stat_summary(self debug interface)
    [22368]sh total reclaimed 0 file, 3754 anon, covered item=0
  ```

  b. File unbalance reclaim
  ```
  cat memory.stat | grep file
    inactive_file 82862080
    active_file 48664576

  echo "100M" 0 > memory.reclaim
  cat memory.stat | grep file
    inactive_file 34164736
    active_file 18370560

  cat memory.reclaim_stat_summary(self debug interface)
    [22368]sh total reclaimed 13732 file, 0 anon, covered item=0
  ```

2. MGLRU test
============
a. Anon unbalance reclaim
```
echo y > /sys/kernel/mm/lru_gen/enabled
cat /sys/kernel/mm/lru_gen/enabled
  0x0003

cat memory.stat | grep anon
  inactive_anon 17653760
  active_anon 1740800

echo "100M" 200 > memory.reclaim

cat memory.reclaim_stat_summary
  [8251]sh total reclaimed 0 file, 5393 anon, covered item=0
```

b. File unbalance reclaim
```
cat memory.stat | grep file
  inactive_file 17858560
  active_file 5943296

echo "100M" 0 > memory.reclaim

cat memory.stat | grep file
  inactive_file 491520
  active_file 2764800
cat memory.reclaim_stat_summary
  [8251]sh total reclaimed 5230 file, 0 anon, covered item=0
```

Patch 1-3 implement the functionality described above.
Patch 4 aims to implement proactive reclamation to the cgroupv1 interface
for use on Android.

Huan Yang (4):
  mm: vmscan: LRU unbalance cgroup reclaim
  mm: multi-gen LRU: MGLRU unbalance reclaim
  mm: memcg: implement unbalance proactive reclaim
  mm: memcg: apply proactive reclaim into cgroupv1

 .../admin-guide/cgroup-v1/memory.rst          |  38 +++++-
 Documentation/admin-guide/cgroup-v2.rst       |  16 ++-
 include/linux/swap.h                          |   1 +
 mm/memcontrol.c                               | 126 ++++++++++++------
 mm/vmscan.c                                   |  38 +++++-
 5 files changed, 169 insertions(+), 50 deletions(-)

Comments

Huang, Ying Nov. 8, 2023, 7:35 a.m. UTC | #1

Huan Yang <link@vivo.com> writes:

> In some cases, we need to selectively reclaim file pages or anonymous
> pages in an unbalanced manner.
>
> For example, when an application is pushed to the background and frozen,
> it may not be opened for a long time, and we can safely reclaim the
> application's anonymous pages, but we do not want to touch the file pages.
>
> This patchset extends the proactive reclaim interface to achieve
> unbalanced reclamation. Users can control the reclamation tendency by
> inputting swappiness under the original interface. Specifically, users
> can input special values to extremely reclaim specific pages.

From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
swappiness.  So you need to add that firstly?

> Example:
>   	echo "1G" 200 > memory.reclaim (only reclaim anon)
> 	  echo "1G" 0  > memory.reclaim (only reclaim file)
> 	  echo "1G" 1  > memory.reclaim (only reclaim file)
>
> Note that when performing unbalanced reclamation, the cgroup swappiness
> will be temporarily adjusted dynamically to the input value. Therefore,
> if the cgroup swappiness is further modified during runtime, there may
> be some errors.

If cgroup swappiness will be adjusted temporarily, why not just change
it via a script before/after proactive reclaiming?

> However, this is acceptable because the interface is dynamically called
> by the user and the timing should be controlled by the user.
>
> This patchset did not implement the type-based reclamation as expected
> in the documentation.(anon or file) Because in addition to extreme unbalanced
> reclamation, this patchset can also adapt to the reclamation tendency
> allocated according to swappiness, which is more flexible.
>
> Self test
> ========
> After applying the following patches and myself debug patch, my self-test
> results are as follows:
>
> 1. LRU test
> ===========
>   a. Anon unbalance reclaim
>   ```
>   cat memory.stat | grep anon
>     inactive_anon 7634944
>     active_anon 7741440
>
>   echo "200M" 200 > memory.reclaim
>
>   cat memory.stat | grep anon
>     inactive_anon 0
>     active_anon 0
>
>   cat memory.reclaim_stat_summary(self debug interface)
>     [22368]sh total reclaimed 0 file, 3754 anon, covered item=0
>   ```
>
>   b. File unbalance reclaim
>   ```
>   cat memory.stat | grep file
>     inactive_file 82862080
>     active_file 48664576
>
>   echo "100M" 0 > memory.reclaim
>   cat memory.stat | grep file
>     inactive_file 34164736
>     active_file 18370560
>
>   cat memory.reclaim_stat_summary(self debug interface)
>     [22368]sh total reclaimed 13732 file, 0 anon, covered item=0
>   ```
>
> 2. MGLRU test
> ============
> a. Anon unbalance reclaim
> ```
> echo y > /sys/kernel/mm/lru_gen/enabled
> cat /sys/kernel/mm/lru_gen/enabled
>   0x0003
>
> cat memory.stat | grep anon
>   inactive_anon 17653760
>   active_anon 1740800
>
> echo "100M" 200 > memory.reclaim
>
> cat memory.reclaim_stat_summary
>   [8251]sh total reclaimed 0 file, 5393 anon, covered item=0
> ```
>
> b. File unbalance reclaim
> ```
> cat memory.stat | grep file
>   inactive_file 17858560
>   active_file 5943296
>
> echo "100M" 0 > memory.reclaim
>
> cat memory.stat | grep file
>   inactive_file 491520
>   active_file 2764800
> cat memory.reclaim_stat_summary
>   [8251]sh total reclaimed 5230 file, 0 anon, covered item=0
> ```
>
> Patch 1-3 implement the functionality described above.
> Patch 4 aims to implement proactive reclamation to the cgroupv1 interface
> for use on Android.
>
> Huan Yang (4):
>   mm: vmscan: LRU unbalance cgroup reclaim
>   mm: multi-gen LRU: MGLRU unbalance reclaim
>   mm: memcg: implement unbalance proactive reclaim
>   mm: memcg: apply proactive reclaim into cgroupv1

We will not add new features to cgroupv1 in upstream.

>  .../admin-guide/cgroup-v1/memory.rst          |  38 +++++-
>  Documentation/admin-guide/cgroup-v2.rst       |  16 ++-
>  include/linux/swap.h                          |   1 +
>  mm/memcontrol.c                               | 126 ++++++++++++------
>  mm/vmscan.c                                   |  38 +++++-
>  5 files changed, 169 insertions(+), 50 deletions(-)

--
Best Regards,
Huang, Ying

Huan Yang Nov. 8, 2023, 7:53 a.m. UTC | #2

HI Huang, Ying

Thanks for reply.

在 2023/11/8 15:35, Huang, Ying 写道:
> Huan Yang <link@vivo.com> writes:
>
>> In some cases, we need to selectively reclaim file pages or anonymous
>> pages in an unbalanced manner.
>>
>> For example, when an application is pushed to the background and frozen,
>> it may not be opened for a long time, and we can safely reclaim the
>> application's anonymous pages, but we do not want to touch the file pages.
>>
>> This patchset extends the proactive reclaim interface to achieve
>> unbalanced reclamation. Users can control the reclamation tendency by
>> inputting swappiness under the original interface. Specifically, users
>> can input special values to extremely reclaim specific pages.
>  From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
> swappiness.  So you need to add that firstly?
Sorry for this mistake, we always work on cgroupv1, so, not notice
this commit 4550c4e, thank your for point that.

I see this commit comment that `that's a different discussion`, but,
to implements this, I will try add.

>
>> Example:
>>    	echo "1G" 200 > memory.reclaim (only reclaim anon)
>> 	  echo "1G" 0  > memory.reclaim (only reclaim file)
>> 	  echo "1G" 1  > memory.reclaim (only reclaim file)
>>
>> Note that when performing unbalanced reclamation, the cgroup swappiness
>> will be temporarily adjusted dynamically to the input value. Therefore,
>> if the cgroup swappiness is further modified during runtime, there may
>> be some errors.
> If cgroup swappiness will be adjusted temporarily, why not just change
> it via a script before/after proactive reclaiming?
IMO, this unbalance reclaim only takes effect for a single command,
so if it is pre-set using a script, the judgment of the reclamation tendency
may become complicated.

So, do you mean avoid use cgroup swappiness, just type anon or file to 
control
this extreme unbalanced reclamation?

>
>> However, this is acceptable because the interface is dynamically called
>> by the user and the timing should be controlled by the user.
>>
>> This patchset did not implement the type-based reclamation as expected
>> in the documentation.(anon or file) Because in addition to extreme unbalanced
>> reclamation, this patchset can also adapt to the reclamation tendency
>> allocated according to swappiness, which is more flexible.
>>
>> Self test
>> ========
>> After applying the following patches and myself debug patch, my self-test
>> results are as follows:
>>
>> 1. LRU test
>> ===========
>>    a. Anon unbalance reclaim
>>    ```
>>    cat memory.stat | grep anon
>>      inactive_anon 7634944
>>      active_anon 7741440
>>
>>    echo "200M" 200 > memory.reclaim
>>
>>    cat memory.stat | grep anon
>>      inactive_anon 0
>>      active_anon 0
>>
>>    cat memory.reclaim_stat_summary(self debug interface)
>>      [22368]sh total reclaimed 0 file, 3754 anon, covered item=0
>>    ```
>>
>>    b. File unbalance reclaim
>>    ```
>>    cat memory.stat | grep file
>>      inactive_file 82862080
>>      active_file 48664576
>>
>>    echo "100M" 0 > memory.reclaim
>>    cat memory.stat | grep file
>>      inactive_file 34164736
>>      active_file 18370560
>>
>>    cat memory.reclaim_stat_summary(self debug interface)
>>      [22368]sh total reclaimed 13732 file, 0 anon, covered item=0
>>    ```
>>
>> 2. MGLRU test
>> ============
>> a. Anon unbalance reclaim
>> ```
>> echo y > /sys/kernel/mm/lru_gen/enabled
>> cat /sys/kernel/mm/lru_gen/enabled
>>    0x0003
>>
>> cat memory.stat | grep anon
>>    inactive_anon 17653760
>>    active_anon 1740800
>>
>> echo "100M" 200 > memory.reclaim
>>
>> cat memory.reclaim_stat_summary
>>    [8251]sh total reclaimed 0 file, 5393 anon, covered item=0
>> ```
>>
>> b. File unbalance reclaim
>> ```
>> cat memory.stat | grep file
>>    inactive_file 17858560
>>    active_file 5943296
>>
>> echo "100M" 0 > memory.reclaim
>>
>> cat memory.stat | grep file
>>    inactive_file 491520
>>    active_file 2764800
>> cat memory.reclaim_stat_summary
>>    [8251]sh total reclaimed 5230 file, 0 anon, covered item=0
>> ```
>>
>> Patch 1-3 implement the functionality described above.
>> Patch 4 aims to implement proactive reclamation to the cgroupv1 interface
>> for use on Android.
>>
>> Huan Yang (4):
>>    mm: vmscan: LRU unbalance cgroup reclaim
>>    mm: multi-gen LRU: MGLRU unbalance reclaim
>>    mm: memcg: implement unbalance proactive reclaim
>>    mm: memcg: apply proactive reclaim into cgroupv1
> We will not add new features to cgroupv1 in upstream.
Thx for point that. If this feature is worth further updating, the next 
patchset
will remove this patch.
>
>>   .../admin-guide/cgroup-v1/memory.rst          |  38 +++++-
>>   Documentation/admin-guide/cgroup-v2.rst       |  16 ++-
>>   include/linux/swap.h                          |   1 +
>>   mm/memcontrol.c                               | 126 ++++++++++++------
>>   mm/vmscan.c                                   |  38 +++++-
>>   5 files changed, 169 insertions(+), 50 deletions(-)
> --
> Best Regards,
> Huang, Ying

Thanks,
Huan Yang

Yosry Ahmed Nov. 8, 2023, 8 a.m. UTC | #3

+Wei Xu +David Rientjes

On Tue, Nov 7, 2023 at 10:59 PM Huan Yang <link@vivo.com> wrote:
>
> In some cases, we need to selectively reclaim file pages or anonymous
> pages in an unbalanced manner.
>
> For example, when an application is pushed to the background and frozen,
> it may not be opened for a long time, and we can safely reclaim the
> application's anonymous pages, but we do not want to touch the file pages.
>
> This patchset extends the proactive reclaim interface to achieve
> unbalanced reclamation. Users can control the reclamation tendency by
> inputting swappiness under the original interface. Specifically, users
> can input special values to extremely reclaim specific pages.

I proposed this a while back:

https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/

The takeaway from the discussion was that swappiness is not the right
way to do this. We can add separate arguments to specify types of
memory to reclaim, as Roman suggested in that thread. I had some
patches lying around to do that at some point, I can dig them up if
that's helpful, but they are probably based on a very old kernel now,
and before MGLRU landed. IIRC it wasn't very difficult, I think I
added anon/file/shrinkers bits to struct scan_control and then plumbed
them through to memory.reclaim.

>
> Example:
>         echo "1G" 200 > memory.reclaim (only reclaim anon)
>           echo "1G" 0  > memory.reclaim (only reclaim file)
>           echo "1G" 1  > memory.reclaim (only reclaim file)

The type of interface here is nested-keyed, so if we add arguments
they need to be in key=value format. Example:

echo 1G swappiness=200 > memory.reclaim

As I mentioned above though, I don't think swappiness is the right way
of doing this. Also, without swappiness, I don't think there's a v1 vs
v2 dilemma here. memory.reclaim can work as-is in cgroup v1, it just
needs to be exposed there.

Huang, Ying Nov. 8, 2023, 8:09 a.m. UTC | #4

Huan Yang <link@vivo.com> writes:

> HI Huang, Ying
>
> Thanks for reply.
>
> 在 2023/11/8 15:35, Huang, Ying 写道:
>> Huan Yang <link@vivo.com> writes:
>>
>>> In some cases, we need to selectively reclaim file pages or anonymous
>>> pages in an unbalanced manner.
>>>
>>> For example, when an application is pushed to the background and frozen,
>>> it may not be opened for a long time, and we can safely reclaim the
>>> application's anonymous pages, but we do not want to touch the file pages.
>>>
>>> This patchset extends the proactive reclaim interface to achieve
>>> unbalanced reclamation. Users can control the reclamation tendency by
>>> inputting swappiness under the original interface. Specifically, users
>>> can input special values to extremely reclaim specific pages.
>>  From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
>> swappiness.  So you need to add that firstly?
> Sorry for this mistake, we always work on cgroupv1, so, not notice
> this commit 4550c4e, thank your for point that.
>
> I see this commit comment that `that's a different discussion`, but,
> to implements this, I will try add.
>
>>
>>> Example:
>>>    	echo "1G" 200 > memory.reclaim (only reclaim anon)
>>> 	  echo "1G" 0  > memory.reclaim (only reclaim file)
>>> 	  echo "1G" 1  > memory.reclaim (only reclaim file)
>>>
>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>> if the cgroup swappiness is further modified during runtime, there may
>>> be some errors.
>> If cgroup swappiness will be adjusted temporarily, why not just change
>> it via a script before/after proactive reclaiming?
> IMO, this unbalance reclaim only takes effect for a single command,
> so if it is pre-set using a script, the judgment of the reclamation tendency
> may become complicated.

If swappiness == 0, then we will only reclaim file pages.  If swappiness
== 200, then we may still reclaim file pages.  So you need a way to
reclaim only anon pages?

If so, can we use some special swappiness value to specify that?  I
don't know whether use 200 will cause regression.  If so, we may need
some other value, e.g. >= 65536.

> So, do you mean avoid use cgroup swappiness, just type anon or file to
> control
> this extreme unbalanced reclamation?
>
>>
>>> However, this is acceptable because the interface is dynamically called
>>> by the user and the timing should be controlled by the user.
>>>
>>> This patchset did not implement the type-based reclamation as expected
>>> in the documentation.(anon or file) Because in addition to extreme unbalanced
>>> reclamation, this patchset can also adapt to the reclamation tendency
>>> allocated according to swappiness, which is more flexible.
>>>

--
Best Regards,
Huang, Ying

Yosry Ahmed Nov. 8, 2023, 8:14 a.m. UTC | #5

On Wed, Nov 8, 2023 at 12:11 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Huan Yang <link@vivo.com> writes:
>
> > HI Huang, Ying
> >
> > Thanks for reply.
> >
> > 在 2023/11/8 15:35, Huang, Ying 写道:
> >> Huan Yang <link@vivo.com> writes:
> >>
> >>> In some cases, we need to selectively reclaim file pages or anonymous
> >>> pages in an unbalanced manner.
> >>>
> >>> For example, when an application is pushed to the background and frozen,
> >>> it may not be opened for a long time, and we can safely reclaim the
> >>> application's anonymous pages, but we do not want to touch the file pages.
> >>>
> >>> This patchset extends the proactive reclaim interface to achieve
> >>> unbalanced reclamation. Users can control the reclamation tendency by
> >>> inputting swappiness under the original interface. Specifically, users
> >>> can input special values to extremely reclaim specific pages.
> >>  From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
> >> swappiness.  So you need to add that firstly?
> > Sorry for this mistake, we always work on cgroupv1, so, not notice
> > this commit 4550c4e, thank your for point that.
> >
> > I see this commit comment that `that's a different discussion`, but,
> > to implements this, I will try add.
> >
> >>
> >>> Example:
> >>>     echo "1G" 200 > memory.reclaim (only reclaim anon)
> >>>       echo "1G" 0  > memory.reclaim (only reclaim file)
> >>>       echo "1G" 1  > memory.reclaim (only reclaim file)
> >>>
> >>> Note that when performing unbalanced reclamation, the cgroup swappiness
> >>> will be temporarily adjusted dynamically to the input value. Therefore,
> >>> if the cgroup swappiness is further modified during runtime, there may
> >>> be some errors.
> >> If cgroup swappiness will be adjusted temporarily, why not just change
> >> it via a script before/after proactive reclaiming?
> > IMO, this unbalance reclaim only takes effect for a single command,
> > so if it is pre-set using a script, the judgment of the reclamation tendency
> > may become complicated.
>
> If swappiness == 0, then we will only reclaim file pages.  If swappiness
> == 200, then we may still reclaim file pages.  So you need a way to
> reclaim only anon pages?
>
> If so, can we use some special swappiness value to specify that?  I
> don't know whether use 200 will cause regression.  If so, we may need
> some other value, e.g. >= 65536.

I don't think swappiness is the answer here. This has been discussed a
while back, please see my response. As you mentioned, swappiness may
be ignored by the kernel in some cases, and its behavior has
historically changed before.

Huan Yang Nov. 8, 2023, 8:21 a.m. UTC | #6

在 2023/11/8 16:14, Yosry Ahmed 写道:
> On Wed, Nov 8, 2023 at 12:11 AM Huang, Ying <ying.huang@intel.com> wrote:
>> Huan Yang <link@vivo.com> writes:
>>
>>> HI Huang, Ying
>>>
>>> Thanks for reply.
>>>
>>> 在 2023/11/8 15:35, Huang, Ying 写道:
>>>> Huan Yang <link@vivo.com> writes:
>>>>
>>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>>> pages in an unbalanced manner.
>>>>>
>>>>> For example, when an application is pushed to the background and frozen,
>>>>> it may not be opened for a long time, and we can safely reclaim the
>>>>> application's anonymous pages, but we do not want to touch the file pages.
>>>>>
>>>>> This patchset extends the proactive reclaim interface to achieve
>>>>> unbalanced reclamation. Users can control the reclamation tendency by
>>>>> inputting swappiness under the original interface. Specifically, users
>>>>> can input special values to extremely reclaim specific pages.
>>>>   From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
>>>> swappiness.  So you need to add that firstly?
>>> Sorry for this mistake, we always work on cgroupv1, so, not notice
>>> this commit 4550c4e, thank your for point that.
>>>
>>> I see this commit comment that `that's a different discussion`, but,
>>> to implements this, I will try add.
>>>
>>>>> Example:
>>>>>      echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>>>        echo "1G" 0  > memory.reclaim (only reclaim file)
>>>>>        echo "1G" 1  > memory.reclaim (only reclaim file)
>>>>>
>>>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>>>> if the cgroup swappiness is further modified during runtime, there may
>>>>> be some errors.
>>>> If cgroup swappiness will be adjusted temporarily, why not just change
>>>> it via a script before/after proactive reclaiming?
>>> IMO, this unbalance reclaim only takes effect for a single command,
>>> so if it is pre-set using a script, the judgment of the reclamation tendency
>>> may become complicated.
>> If swappiness == 0, then we will only reclaim file pages.  If swappiness
>> == 200, then we may still reclaim file pages.  So you need a way to
>> reclaim only anon pages?
>>
>> If so, can we use some special swappiness value to specify that?  I
>> don't know whether use 200 will cause regression.  If so, we may need
>> some other value, e.g. >= 65536.
> I don't think swappiness is the answer here. This has been discussed a
> while back, please see my response. As you mentioned, swappiness may
> be ignored by the kernel in some cases, and its behavior has
> historically changed before.

For type base, reclaim can have direct tendencies as well. It's good. 
But, what if
we only want to make small adjustments to the reclamation ratio?
Of course, sometimes swappiness may become ineffective.

Huan Yang Nov. 8, 2023, 8:26 a.m. UTC | #7

在 2023/11/8 16:00, Yosry Ahmed 写道:
> +Wei Xu +David Rientjes
>
> On Tue, Nov 7, 2023 at 10:59 PM Huan Yang <link@vivo.com> wrote:
>> In some cases, we need to selectively reclaim file pages or anonymous
>> pages in an unbalanced manner.
>>
>> For example, when an application is pushed to the background and frozen,
>> it may not be opened for a long time, and we can safely reclaim the
>> application's anonymous pages, but we do not want to touch the file pages.
>>
>> This patchset extends the proactive reclaim interface to achieve
>> unbalanced reclamation. Users can control the reclamation tendency by
>> inputting swappiness under the original interface. Specifically, users
>> can input special values to extremely reclaim specific pages.
> I proposed this a while back:
>
> https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
Well to know this, proactive reclaim single type is usefull in our 
production too.
>
> The takeaway from the discussion was that swappiness is not the right
> way to do this. We can add separate arguments to specify types of
> memory to reclaim, as Roman suggested in that thread. I had some
> patches lying around to do that at some point, I can dig them up if
> that's helpful, but they are probably based on a very old kernel now,
> and before MGLRU landed. IIRC it wasn't very difficult, I think I
> added anon/file/shrinkers bits to struct scan_control and then plumbed
> them through to memory.reclaim.
>
>> Example:
>>          echo "1G" 200 > memory.reclaim (only reclaim anon)
>>            echo "1G" 0  > memory.reclaim (only reclaim file)
>>            echo "1G" 1  > memory.reclaim (only reclaim file)
> The type of interface here is nested-keyed, so if we add arguments
> they need to be in key=value format. Example:
>
> echo 1G swappiness=200 > memory.reclaim
Yes, this is better.
>
> As I mentioned above though, I don't think swappiness is the right way
> of doing this. Also, without swappiness, I don't think there's a v1 vs
> v2 dilemma here. memory.reclaim can work as-is in cgroup v1, it just
> needs to be exposed there.
Cgroupv1 can't use memory.reclaim, so, how to exposed it? Reclaim this by
pass memcg's ID?

Yosry Ahmed Nov. 8, 2023, 8:59 a.m. UTC | #8

On Wed, Nov 8, 2023 at 12:26 AM Huan Yang <link@vivo.com> wrote:
>
>
> 在 2023/11/8 16:00, Yosry Ahmed 写道:
> > +Wei Xu +David Rientjes
> >
> > On Tue, Nov 7, 2023 at 10:59 PM Huan Yang <link@vivo.com> wrote:
> >> In some cases, we need to selectively reclaim file pages or anonymous
> >> pages in an unbalanced manner.
> >>
> >> For example, when an application is pushed to the background and frozen,
> >> it may not be opened for a long time, and we can safely reclaim the
> >> application's anonymous pages, but we do not want to touch the file pages.
> >>
> >> This patchset extends the proactive reclaim interface to achieve
> >> unbalanced reclamation. Users can control the reclamation tendency by
> >> inputting swappiness under the original interface. Specifically, users
> >> can input special values to extremely reclaim specific pages.
> > I proposed this a while back:
> >
> > https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> Well to know this, proactive reclaim single type is usefull in our
> production too.
> >
> > The takeaway from the discussion was that swappiness is not the right
> > way to do this. We can add separate arguments to specify types of
> > memory to reclaim, as Roman suggested in that thread. I had some
> > patches lying around to do that at some point, I can dig them up if
> > that's helpful, but they are probably based on a very old kernel now,
> > and before MGLRU landed. IIRC it wasn't very difficult, I think I
> > added anon/file/shrinkers bits to struct scan_control and then plumbed
> > them through to memory.reclaim.
> >
> >> Example:
> >>          echo "1G" 200 > memory.reclaim (only reclaim anon)
> >>            echo "1G" 0  > memory.reclaim (only reclaim file)
> >>            echo "1G" 1  > memory.reclaim (only reclaim file)
> > The type of interface here is nested-keyed, so if we add arguments
> > they need to be in key=value format. Example:
> >
> > echo 1G swappiness=200 > memory.reclaim
> Yes, this is better.
> >
> > As I mentioned above though, I don't think swappiness is the right way
> > of doing this. Also, without swappiness, I don't think there's a v1 vs
> > v2 dilemma here. memory.reclaim can work as-is in cgroup v1, it just
> > needs to be exposed there.
> Cgroupv1 can't use memory.reclaim, so, how to exposed it? Reclaim this by
> pass memcg's ID?

That was mainly about the idea that cgroup v2 does not have per-memcg
swappiness, so this proposal seems to be inclined towards v1, at least
conceptually. Either way, we need to add memory.reclaim to the v1
files to get it to work on v1. Whether this is acceptable or not is up
to the maintainers. I personally don't think it's a problem, it should
work as-is for v1.

Yosry Ahmed Nov. 8, 2023, 9 a.m. UTC | #9

On Wed, Nov 8, 2023 at 12:22 AM Huan Yang <link@vivo.com> wrote:
>
>
> 在 2023/11/8 16:14, Yosry Ahmed 写道:
> > On Wed, Nov 8, 2023 at 12:11 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> Huan Yang <link@vivo.com> writes:
> >>
> >>> HI Huang, Ying
> >>>
> >>> Thanks for reply.
> >>>
> >>> 在 2023/11/8 15:35, Huang, Ying 写道:
> >>>> Huan Yang <link@vivo.com> writes:
> >>>>
> >>>>> In some cases, we need to selectively reclaim file pages or anonymous
> >>>>> pages in an unbalanced manner.
> >>>>>
> >>>>> For example, when an application is pushed to the background and frozen,
> >>>>> it may not be opened for a long time, and we can safely reclaim the
> >>>>> application's anonymous pages, but we do not want to touch the file pages.
> >>>>>
> >>>>> This patchset extends the proactive reclaim interface to achieve
> >>>>> unbalanced reclamation. Users can control the reclamation tendency by
> >>>>> inputting swappiness under the original interface. Specifically, users
> >>>>> can input special values to extremely reclaim specific pages.
> >>>>   From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
> >>>> swappiness.  So you need to add that firstly?
> >>> Sorry for this mistake, we always work on cgroupv1, so, not notice
> >>> this commit 4550c4e, thank your for point that.
> >>>
> >>> I see this commit comment that `that's a different discussion`, but,
> >>> to implements this, I will try add.
> >>>
> >>>>> Example:
> >>>>>      echo "1G" 200 > memory.reclaim (only reclaim anon)
> >>>>>        echo "1G" 0  > memory.reclaim (only reclaim file)
> >>>>>        echo "1G" 1  > memory.reclaim (only reclaim file)
> >>>>>
> >>>>> Note that when performing unbalanced reclamation, the cgroup swappiness
> >>>>> will be temporarily adjusted dynamically to the input value. Therefore,
> >>>>> if the cgroup swappiness is further modified during runtime, there may
> >>>>> be some errors.
> >>>> If cgroup swappiness will be adjusted temporarily, why not just change
> >>>> it via a script before/after proactive reclaiming?
> >>> IMO, this unbalance reclaim only takes effect for a single command,
> >>> so if it is pre-set using a script, the judgment of the reclamation tendency
> >>> may become complicated.
> >> If swappiness == 0, then we will only reclaim file pages.  If swappiness
> >> == 200, then we may still reclaim file pages.  So you need a way to
> >> reclaim only anon pages?
> >>
> >> If so, can we use some special swappiness value to specify that?  I
> >> don't know whether use 200 will cause regression.  If so, we may need
> >> some other value, e.g. >= 65536.
> > I don't think swappiness is the answer here. This has been discussed a
> > while back, please see my response. As you mentioned, swappiness may
> > be ignored by the kernel in some cases, and its behavior has
> > historically changed before.
>
> For type base, reclaim can have direct tendencies as well. It's good.
> But, what if
> we only want to make small adjustments to the reclamation ratio?
> Of course, sometimes swappiness may become ineffective.
>

Is there a real use case for this? I think it's difficult to reason
about swappiness and make small adjustments to the file/anon ratio
based on it. I'd prefer a more concrete implementation.

Huan Yang Nov. 8, 2023, 9:05 a.m. UTC | #10

在 2023/11/8 17:00, Yosry Ahmed 写道:
> On Wed, Nov 8, 2023 at 12:22 AM Huan Yang <link@vivo.com> wrote:
>>
>> 在 2023/11/8 16:14, Yosry Ahmed 写道:
>>> On Wed, Nov 8, 2023 at 12:11 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>> Huan Yang <link@vivo.com> writes:
>>>>
>>>>> HI Huang, Ying
>>>>>
>>>>> Thanks for reply.
>>>>>
>>>>> 在 2023/11/8 15:35, Huang, Ying 写道:
>>>>>> Huan Yang <link@vivo.com> writes:
>>>>>>
>>>>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>>>>> pages in an unbalanced manner.
>>>>>>>
>>>>>>> For example, when an application is pushed to the background and frozen,
>>>>>>> it may not be opened for a long time, and we can safely reclaim the
>>>>>>> application's anonymous pages, but we do not want to touch the file pages.
>>>>>>>
>>>>>>> This patchset extends the proactive reclaim interface to achieve
>>>>>>> unbalanced reclamation. Users can control the reclamation tendency by
>>>>>>> inputting swappiness under the original interface. Specifically, users
>>>>>>> can input special values to extremely reclaim specific pages.
>>>>>>    From mem_cgroup_swappiness(), cgroupv2 doesn't have per-cgroup
>>>>>> swappiness.  So you need to add that firstly?
>>>>> Sorry for this mistake, we always work on cgroupv1, so, not notice
>>>>> this commit 4550c4e, thank your for point that.
>>>>>
>>>>> I see this commit comment that `that's a different discussion`, but,
>>>>> to implements this, I will try add.
>>>>>
>>>>>>> Example:
>>>>>>>       echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>>>>>         echo "1G" 0  > memory.reclaim (only reclaim file)
>>>>>>>         echo "1G" 1  > memory.reclaim (only reclaim file)
>>>>>>>
>>>>>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>>>>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>>>>>> if the cgroup swappiness is further modified during runtime, there may
>>>>>>> be some errors.
>>>>>> If cgroup swappiness will be adjusted temporarily, why not just change
>>>>>> it via a script before/after proactive reclaiming?
>>>>> IMO, this unbalance reclaim only takes effect for a single command,
>>>>> so if it is pre-set using a script, the judgment of the reclamation tendency
>>>>> may become complicated.
>>>> If swappiness == 0, then we will only reclaim file pages.  If swappiness
>>>> == 200, then we may still reclaim file pages.  So you need a way to
>>>> reclaim only anon pages?
>>>>
>>>> If so, can we use some special swappiness value to specify that?  I
>>>> don't know whether use 200 will cause regression.  If so, we may need
>>>> some other value, e.g. >= 65536.
>>> I don't think swappiness is the answer here. This has been discussed a
>>> while back, please see my response. As you mentioned, swappiness may
>>> be ignored by the kernel in some cases, and its behavior has
>>> historically changed before.
>> For type base, reclaim can have direct tendencies as well. It's good.
>> But, what if
>> we only want to make small adjustments to the reclamation ratio?
>> Of course, sometimes swappiness may become ineffective.
>>
> Is there a real use case for this? I think it's difficult to reason
> about swappiness and make small adjustments to the file/anon ratio
> based on it. I'd prefer a more concrete implementation.

For example, swappiness=170 to try hard reclaim anon, a little pressure to
reclaim file(expect reclaim clean file). In theory, this method can help 
reduce
memory pressure.
Or else, reclaim 80% anon and trim 5% code file control is good when it is
detected that an application has been frozen for a period of time.

Huan Yang Nov. 8, 2023, 9:12 a.m. UTC | #11

在 2023/11/8 16:59, Yosry Ahmed 写道:
> On Wed, Nov 8, 2023 at 12:26 AM Huan Yang <link@vivo.com> wrote:
>>
>> 在 2023/11/8 16:00, Yosry Ahmed 写道:
>>> +Wei Xu +David Rientjes
>>>
>>> On Tue, Nov 7, 2023 at 10:59 PM Huan Yang <link@vivo.com> wrote:
>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>> pages in an unbalanced manner.
>>>>
>>>> For example, when an application is pushed to the background and frozen,
>>>> it may not be opened for a long time, and we can safely reclaim the
>>>> application's anonymous pages, but we do not want to touch the file pages.
>>>>
>>>> This patchset extends the proactive reclaim interface to achieve
>>>> unbalanced reclamation. Users can control the reclamation tendency by
>>>> inputting swappiness under the original interface. Specifically, users
>>>> can input special values to extremely reclaim specific pages.
>>> I proposed this a while back:
>>>
>>> https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
>> Well to know this, proactive reclaim single type is usefull in our
>> production too.
>>> The takeaway from the discussion was that swappiness is not the right
>>> way to do this. We can add separate arguments to specify types of
>>> memory to reclaim, as Roman suggested in that thread. I had some
>>> patches lying around to do that at some point, I can dig them up if
>>> that's helpful, but they are probably based on a very old kernel now,
>>> and before MGLRU landed. IIRC it wasn't very difficult, I think I
>>> added anon/file/shrinkers bits to struct scan_control and then plumbed
>>> them through to memory.reclaim.
>>>
>>>> Example:
>>>>           echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>>             echo "1G" 0  > memory.reclaim (only reclaim file)
>>>>             echo "1G" 1  > memory.reclaim (only reclaim file)
>>> The type of interface here is nested-keyed, so if we add arguments
>>> they need to be in key=value format. Example:
>>>
>>> echo 1G swappiness=200 > memory.reclaim
>> Yes, this is better.
>>> As I mentioned above though, I don't think swappiness is the right way
>>> of doing this. Also, without swappiness, I don't think there's a v1 vs
>>> v2 dilemma here. memory.reclaim can work as-is in cgroup v1, it just
>>> needs to be exposed there.
>> Cgroupv1 can't use memory.reclaim, so, how to exposed it? Reclaim this by
>> pass memcg's ID?
> That was mainly about the idea that cgroup v2 does not have per-memcg
> swappiness, so this proposal seems to be inclined towards v1, at least
I seem current comments of mem_cgroup_swappiness it is believed that
per-memcg swappiness can be added.

But, we first need to explain that using swappiness is a very useful way.
And in the discussions of your patchset, end that not use it.
> conceptually. Either way, we need to add memory.reclaim to the v1
> files to get it to work on v1. Whether this is acceptable or not is up
> to the maintainers. I personally don't think it's a problem, it should
Yes, but, I understand that cgroup v2 is a trend, so it is 
understandable that no
new interfaces are added to v1. :)
Maybe you can promoting the use of cgroupv2 on Android?
> work as-is for v1.

Michal Hocko Nov. 8, 2023, 2:06 p.m. UTC | #12

On Wed 08-11-23 14:58:11, Huan Yang wrote:
> In some cases, we need to selectively reclaim file pages or anonymous
> pages in an unbalanced manner.
> 
> For example, when an application is pushed to the background and frozen,
> it may not be opened for a long time, and we can safely reclaim the
> application's anonymous pages, but we do not want to touch the file pages.

Could you explain why? And also why do you need to swap out in that
case?
 
> This patchset extends the proactive reclaim interface to achieve
> unbalanced reclamation. Users can control the reclamation tendency by
> inputting swappiness under the original interface. Specifically, users
> can input special values to extremely reclaim specific pages.

Other have already touched on this in other replies but v2 doesn't have
a per-memcg swappiness

> Example:
>   	echo "1G" 200 > memory.reclaim (only reclaim anon)
> 	  echo "1G" 0  > memory.reclaim (only reclaim file)
> 	  echo "1G" 1  > memory.reclaim (only reclaim file)
> 
> Note that when performing unbalanced reclamation, the cgroup swappiness
> will be temporarily adjusted dynamically to the input value. Therefore,
> if the cgroup swappiness is further modified during runtime, there may
> be some errors.

In general this is a bad semantic. The operation shouldn't have side
effect that are potentially visible for another operation.

Andrew Morton Nov. 8, 2023, 4:14 p.m. UTC | #13

On Wed,  8 Nov 2023 14:58:11 +0800 Huan Yang <link@vivo.com> wrote:

> For example, when an application is pushed to the background and frozen,
> it may not be opened for a long time, and we can safely reclaim the
> application's anonymous pages, but we do not want to touch the file pages.

This paragraph is key to the entire patchset and it would benefit from
some expanding upon.

If the application is dormant, why on earth would we want to evict its
text but keep its data around?

Huan Yang Nov. 9, 2023, 1:56 a.m. UTC | #14

在 2023/11/8 22:06, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed 08-11-23 14:58:11, Huan Yang wrote:
>> In some cases, we need to selectively reclaim file pages or anonymous
>> pages in an unbalanced manner.
>>
>> For example, when an application is pushed to the background and frozen,
>> it may not be opened for a long time, and we can safely reclaim the
>> application's anonymous pages, but we do not want to touch the file pages.
> Could you explain why? And also why do you need to swap out in that
> case?
When an application is frozen, it usually means that we predict that it 
will not be
used for a long time. In order to proactively save some memory, our 
strategy will
choose to compress the application's private data into zram. And we will 
also
select some of the cold application data that we think is in zram and 
swap it out.

The above operations assume that anonymous pages are private to the 
application.
After the application is frozen, compressing these pages into zram can 
save memory
to some extent without worrying about frequent refaults.

And the cost of refaults on zram is lower than that of IO.


>
>> This patchset extends the proactive reclaim interface to achieve
>> unbalanced reclamation. Users can control the reclamation tendency by
>> inputting swappiness under the original interface. Specifically, users
>> can input special values to extremely reclaim specific pages.
> Other have already touched on this in other replies but v2 doesn't have
> a per-memcg swappiness
>
>> Example:
>>        echo "1G" 200 > memory.reclaim (only reclaim anon)
>>          echo "1G" 0  > memory.reclaim (only reclaim file)
>>          echo "1G" 1  > memory.reclaim (only reclaim file)
>>
>> Note that when performing unbalanced reclamation, the cgroup swappiness
>> will be temporarily adjusted dynamically to the input value. Therefore,
>> if the cgroup swappiness is further modified during runtime, there may
>> be some errors.
> In general this is a bad semantic. The operation shouldn't have side
> effect that are potentially visible for another operation.
So, maybe pass swappiness into sc and keep a single reclamation ensure that
swappiness is not changed?
Or, it's a bad idea that use swappiness to control unbalance reclaim.
> --
> Michal Hocko
> SUSE Labs

Huan Yang Nov. 9, 2023, 1:58 a.m. UTC | #15

在 2023/11/9 0:14, Andrew Morton 写道:
> [Some people who received this message don't often get email from akpm@linux-foundation.org. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Wed,  8 Nov 2023 14:58:11 +0800 Huan Yang <link@vivo.com> wrote:
>
>> For example, when an application is pushed to the background and frozen,
>> it may not be opened for a long time, and we can safely reclaim the
>> application's anonymous pages, but we do not want to touch the file pages.
> This paragraph is key to the entire patchset and it would benefit from
> some expanding upon.
>
> If the application is dormant, why on earth would we want to evict its
> text but keep its data around?
In fact, we currently use this method to only reclaim application 
anonymous pages,
because we believe that the refault cost of reclaiming anonymous pages 
is relatively
small, especially when using zram and only proactively reclaiming the 
anonymous
pages of frozen applications.

Huang, Ying Nov. 9, 2023, 3:15 a.m. UTC | #16

Huan Yang <link@vivo.com> writes:

> 在 2023/11/8 22:06, Michal Hocko 写道:
>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Wed 08-11-23 14:58:11, Huan Yang wrote:
>>> In some cases, we need to selectively reclaim file pages or anonymous
>>> pages in an unbalanced manner.
>>>
>>> For example, when an application is pushed to the background and frozen,
>>> it may not be opened for a long time, and we can safely reclaim the
>>> application's anonymous pages, but we do not want to touch the file pages.
>> Could you explain why? And also why do you need to swap out in that
>> case?
> When an application is frozen, it usually means that we predict that
> it will not be
> used for a long time. In order to proactively save some memory, our
> strategy will
> choose to compress the application's private data into zram. And we
> will also
> select some of the cold application data that we think is in zram and
> swap it out.
>
> The above operations assume that anonymous pages are private to the
> application.

If so, is it better only to reclaim private anonymous pages explicitly?
Add another option for that?

> After the application is frozen, compressing these pages into zram can
> save memory
> to some extent without worrying about frequent refaults.
>
> And the cost of refaults on zram is lower than that of IO.

If so, swappiness should be high system-wise?

--
Best Regards,
Huang, Ying

>
>>
>>> This patchset extends the proactive reclaim interface to achieve
>>> unbalanced reclamation. Users can control the reclamation tendency by
>>> inputting swappiness under the original interface. Specifically, users
>>> can input special values to extremely reclaim specific pages.
>> Other have already touched on this in other replies but v2 doesn't have
>> a per-memcg swappiness
>>
>>> Example:
>>>        echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>          echo "1G" 0  > memory.reclaim (only reclaim file)
>>>          echo "1G" 1  > memory.reclaim (only reclaim file)
>>>
>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>> if the cgroup swappiness is further modified during runtime, there may
>>> be some errors.
>> In general this is a bad semantic. The operation shouldn't have side
>> effect that are potentially visible for another operation.
> So, maybe pass swappiness into sc and keep a single reclamation ensure that
> swappiness is not changed?
> Or, it's a bad idea that use swappiness to control unbalance reclaim.
>> --
>> Michal Hocko
>> SUSE Labs

Huan Yang Nov. 9, 2023, 3:38 a.m. UTC | #17

在 2023/11/9 11:15, Huang, Ying 写道:
> [Some people who received this message don't often get email from ying.huang@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Huan Yang <link@vivo.com> writes:
>
>> 在 2023/11/8 22:06, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Wed 08-11-23 14:58:11, Huan Yang wrote:
>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>> pages in an unbalanced manner.
>>>>
>>>> For example, when an application is pushed to the background and frozen,
>>>> it may not be opened for a long time, and we can safely reclaim the
>>>> application's anonymous pages, but we do not want to touch the file pages.
>>> Could you explain why? And also why do you need to swap out in that
>>> case?
>> When an application is frozen, it usually means that we predict that
>> it will not be
>> used for a long time. In order to proactively save some memory, our
>> strategy will
>> choose to compress the application's private data into zram. And we
>> will also
>> select some of the cold application data that we think is in zram and
>> swap it out.
>>
>> The above operations assume that anonymous pages are private to the
>> application.
> If so, is it better only to reclaim private anonymous pages explicitly?
Yes, in practice, we only proactively compress anonymous pages and do not
want to touch file pages.

However, I like the phrase "Provide mechanisms, not strategies". Maybe 
letter
zcache can use well, we can also proactively compress certain file pages at
a lower cost.

So, maybe give a way to only reclaim page cache is good?

> Add another option for that?

But, yes, I also believe that providing a way to specify the tendency to 
reclaim
anonymous and file types can achieve a certain degree of flexibility.
And swappiness-based control is currently not very accurate.

>
>> After the application is frozen, compressing these pages into zram can
>> save memory
>> to some extent without worrying about frequent refaults.
>>
>> And the cost of refaults on zram is lower than that of IO.
> If so, swappiness should be high system-wise?

Yes, I agree. Swappiness should not be used to control unbalanced 
reclamation.

Moreover, this patchset actually use flags to control unbalanced 
reclamation.

Therefore, the proactive reclamation interface should receive additional 
options
(such as anon, file) instead of swappiness.

>
> --
> Best Regards,
> Huang, Ying
>
>>>> This patchset extends the proactive reclaim interface to achieve
>>>> unbalanced reclamation. Users can control the reclamation tendency by
>>>> inputting swappiness under the original interface. Specifically, users
>>>> can input special values to extremely reclaim specific pages.
>>> Other have already touched on this in other replies but v2 doesn't have
>>> a per-memcg swappiness
>>>
>>>> Example:
>>>>         echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>>           echo "1G" 0  > memory.reclaim (only reclaim file)
>>>>           echo "1G" 1  > memory.reclaim (only reclaim file)
>>>>
>>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>>> if the cgroup swappiness is further modified during runtime, there may
>>>> be some errors.
>>> In general this is a bad semantic. The operation shouldn't have side
>>> effect that are potentially visible for another operation.
>> So, maybe pass swappiness into sc and keep a single reclamation ensure that
>> swappiness is not changed?
>> Or, it's a bad idea that use swappiness to control unbalance reclaim.
>>> --
>>> Michal Hocko
>>> SUSE Labs

Michal Hocko Nov. 9, 2023, 9:53 a.m. UTC | #18

On Thu 09-11-23 09:56:46, Huan Yang wrote:
> 
> 在 2023/11/8 22:06, Michal Hocko 写道:
> > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Wed 08-11-23 14:58:11, Huan Yang wrote:
> > > In some cases, we need to selectively reclaim file pages or anonymous
> > > pages in an unbalanced manner.
> > > 
> > > For example, when an application is pushed to the background and frozen,
> > > it may not be opened for a long time, and we can safely reclaim the
> > > application's anonymous pages, but we do not want to touch the file pages.
> > Could you explain why? And also why do you need to swap out in that
> > case?
>
> When an application is frozen, it usually means that we predict that
> it will not be used for a long time. In order to proactively save some
> memory, our strategy will choose to compress the application's private
> data into zram. And we will also select some of the cold application
> data that we think is in zram and swap it out.
>
> The above operations assume that anonymous pages are private to the
> application.  After the application is frozen, compressing these pages
> into zram can save memory to some extent without worrying about
> frequent refaults.

Why don't you rely on the default reclaim heuristics? In other words do
you have any numbers showing that a selective reclaim results in a much
better behavior? How do you evaluate that?

> 
> And the cost of refaults on zram is lower than that of IO.
> 
> 
> > 
> > > This patchset extends the proactive reclaim interface to achieve
> > > unbalanced reclamation. Users can control the reclamation tendency by
> > > inputting swappiness under the original interface. Specifically, users
> > > can input special values to extremely reclaim specific pages.
> > Other have already touched on this in other replies but v2 doesn't have
> > a per-memcg swappiness
> > 
> > > Example:
> > >        echo "1G" 200 > memory.reclaim (only reclaim anon)
> > >          echo "1G" 0  > memory.reclaim (only reclaim file)
> > >          echo "1G" 1  > memory.reclaim (only reclaim file)
> > > 
> > > Note that when performing unbalanced reclamation, the cgroup swappiness
> > > will be temporarily adjusted dynamically to the input value. Therefore,
> > > if the cgroup swappiness is further modified during runtime, there may
> > > be some errors.
> > In general this is a bad semantic. The operation shouldn't have side
> > effect that are potentially visible for another operation.
> So, maybe pass swappiness into sc and keep a single reclamation ensure that
> swappiness is not changed?

That would be a much saner approach.

> Or, it's a bad idea that use swappiness to control unbalance reclaim.

Memory reclaim is not really obliged to consider swappiness. In fact the
actual behavior has changed several times in the past and it is safer to
assume this might change in the future again.

Michal Hocko Nov. 9, 2023, 9:57 a.m. UTC | #19

On Thu 09-11-23 11:38:56, Huan Yang wrote:
[...]
> > If so, is it better only to reclaim private anonymous pages explicitly?
> Yes, in practice, we only proactively compress anonymous pages and do not
> want to touch file pages.

If that is the case and this is mostly application centric (which you
seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
instead.

Huan Yang Nov. 9, 2023, 10:29 a.m. UTC | #20

HI Michal Hocko,

Thanks for your suggestion.

在 2023/11/9 17:57, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Thu 09-11-23 11:38:56, Huan Yang wrote:
> [...]
>>> If so, is it better only to reclaim private anonymous pages explicitly?
>> Yes, in practice, we only proactively compress anonymous pages and do not
>> want to touch file pages.
> If that is the case and this is mostly application centric (which you
> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
> instead.
Madvise  may not be applicable in this scenario.(IMO)

This feature is aimed at a core goal, which is to compress the anonymous 
pages
of frozen applications.

How to detect that an application is frozen and determine which pages can be
safely reclaimed is the responsibility of the policy part.

Setting madvise for an application is an active behavior, while the 
above policy
is a passive approach.(If I misunderstood, please let me know if there 
is a better
way to set madvise.)
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 9, 2023, 10:39 a.m. UTC | #21

On Thu 09-11-23 18:29:03, Huan Yang wrote:
> HI Michal Hocko,
> 
> Thanks for your suggestion.
> 
> 在 2023/11/9 17:57, Michal Hocko 写道:
> > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Thu 09-11-23 11:38:56, Huan Yang wrote:
> > [...]
> > > > If so, is it better only to reclaim private anonymous pages explicitly?
> > > Yes, in practice, we only proactively compress anonymous pages and do not
> > > want to touch file pages.
> > If that is the case and this is mostly application centric (which you
> > seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
> > instead.
> Madvise  may not be applicable in this scenario.(IMO)
>
> This feature is aimed at a core goal, which is to compress the anonymous
> pages
> of frozen applications.
> 
> How to detect that an application is frozen and determine which pages can be
> safely reclaimed is the responsibility of the policy part.
> 
> Setting madvise for an application is an active behavior, while the above
> policy
> is a passive approach.(If I misunderstood, please let me know if there is a
> better
> way to set madvise.)

You are proposing an extension to the pro-active reclaim interface so
this is an active behavior pretty much by definition. So I am really not
following you here. Your agent can simply scan the address space of the
application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
on the private memory is that is really what you want/need.

Huan Yang Nov. 9, 2023, 10:50 a.m. UTC | #22

在 2023/11/9 18:39, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>> HI Michal Hocko,
>>
>> Thanks for your suggestion.
>>
>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>> [...]
>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>> want to touch file pages.
>>> If that is the case and this is mostly application centric (which you
>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>> instead.
>> Madvise  may not be applicable in this scenario.(IMO)
>>
>> This feature is aimed at a core goal, which is to compress the anonymous
>> pages
>> of frozen applications.
>>
>> How to detect that an application is frozen and determine which pages can be
>> safely reclaimed is the responsibility of the policy part.
>>
>> Setting madvise for an application is an active behavior, while the above
>> policy
>> is a passive approach.(If I misunderstood, please let me know if there is a
>> better
>> way to set madvise.)
> You are proposing an extension to the pro-active reclaim interface so
> this is an active behavior pretty much by definition. So I am really not
> following you here. Your agent can simply scan the address space of the
> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
> on the private memory is that is really what you want/need.
There is a key point here. We want to use the grouping policy of memcg 
to perform
proactive reclamation with certain tendencies. Your suggestion is to 
reclaim memory
by scanning the task process space. However, in the mobile field, memory 
is usually
viewed at the granularity of an APP.

Therefore, after an APP is frozen, we hope to reclaim memory uniformly 
according
to the pre-grouped APP processes.

Of course, as you suggested, madvise can also achieve this, but 
implementing it in
the agent may be more complex.(In terms of achieving the same goal, 
using memcg
to group all the processes of an APP and perform proactive reclamation 
is simpler
than using madvise and scanning multiple processes of an application 
using an agent?)

>
> --
> Michal Hocko
> SUSE Labs

Huan Yang Nov. 9, 2023, 10:55 a.m. UTC | #23

在 2023/11/9 17:53, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Thu 09-11-23 09:56:46, Huan Yang wrote:
>> 在 2023/11/8 22:06, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Wed 08-11-23 14:58:11, Huan Yang wrote:
>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>> pages in an unbalanced manner.
>>>>
>>>> For example, when an application is pushed to the background and frozen,
>>>> it may not be opened for a long time, and we can safely reclaim the
>>>> application's anonymous pages, but we do not want to touch the file pages.
>>> Could you explain why? And also why do you need to swap out in that
>>> case?
>> When an application is frozen, it usually means that we predict that
>> it will not be used for a long time. In order to proactively save some
>> memory, our strategy will choose to compress the application's private
>> data into zram. And we will also select some of the cold application
>> data that we think is in zram and swap it out.
>>
>> The above operations assume that anonymous pages are private to the
>> application.  After the application is frozen, compressing these pages
>> into zram can save memory to some extent without worrying about
>> frequent refaults.
> Why don't you rely on the default reclaim heuristics? In other words do
As I mentioned earlier, the madvise approach may not be suitable for my 
needs.
> you have any numbers showing that a selective reclaim results in a much

In the mobile field, we have a core metric called application residency.

This mechanism can help us improve the application residency if we can 
provide
a good freeze detection and proactive reclamation policy.

I can only provide specific data from our internal tests, and it may be 
older data,
and it tested using cgroup v1:

In 12G ram phone, app residency improve from 29 to 38.


> better behavior? How do you evaluate that?
>
>> And the cost of refaults on zram is lower than that of IO.
>>
>>
>>>> This patchset extends the proactive reclaim interface to achieve
>>>> unbalanced reclamation. Users can control the reclamation tendency by
>>>> inputting swappiness under the original interface. Specifically, users
>>>> can input special values to extremely reclaim specific pages.
>>> Other have already touched on this in other replies but v2 doesn't have
>>> a per-memcg swappiness
>>>
>>>> Example:
>>>>         echo "1G" 200 > memory.reclaim (only reclaim anon)
>>>>           echo "1G" 0  > memory.reclaim (only reclaim file)
>>>>           echo "1G" 1  > memory.reclaim (only reclaim file)
>>>>
>>>> Note that when performing unbalanced reclamation, the cgroup swappiness
>>>> will be temporarily adjusted dynamically to the input value. Therefore,
>>>> if the cgroup swappiness is further modified during runtime, there may
>>>> be some errors.
>>> In general this is a bad semantic. The operation shouldn't have side
>>> effect that are potentially visible for another operation.
>> So, maybe pass swappiness into sc and keep a single reclamation ensure that
>> swappiness is not changed?
> That would be a much saner approach.
>
>> Or, it's a bad idea that use swappiness to control unbalance reclaim.
> Memory reclaim is not really obliged to consider swappiness. In fact the
> actual behavior has changed several times in the past and it is safer to
> assume this might change in the future again.
Thank you for the guidance.
>
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 9, 2023, 12:40 p.m. UTC | #24

On Thu 09-11-23 18:50:36, Huan Yang wrote:
> 
> 在 2023/11/9 18:39, Michal Hocko 写道:
> > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Thu 09-11-23 18:29:03, Huan Yang wrote:
> > > HI Michal Hocko,
> > > 
> > > Thanks for your suggestion.
> > > 
> > > 在 2023/11/9 17:57, Michal Hocko 写道:
> > > > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > 
> > > > On Thu 09-11-23 11:38:56, Huan Yang wrote:
> > > > [...]
> > > > > > If so, is it better only to reclaim private anonymous pages explicitly?
> > > > > Yes, in practice, we only proactively compress anonymous pages and do not
> > > > > want to touch file pages.
> > > > If that is the case and this is mostly application centric (which you
> > > > seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
> > > > instead.
> > > Madvise  may not be applicable in this scenario.(IMO)
> > > 
> > > This feature is aimed at a core goal, which is to compress the anonymous
> > > pages
> > > of frozen applications.
> > > 
> > > How to detect that an application is frozen and determine which pages can be
> > > safely reclaimed is the responsibility of the policy part.
> > > 
> > > Setting madvise for an application is an active behavior, while the above
> > > policy
> > > is a passive approach.(If I misunderstood, please let me know if there is a
> > > better
> > > way to set madvise.)
> > You are proposing an extension to the pro-active reclaim interface so
> > this is an active behavior pretty much by definition. So I am really not
> > following you here. Your agent can simply scan the address space of the
> > application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
> > on the private memory is that is really what you want/need.
>
> There is a key point here. We want to use the grouping policy of memcg
> to perform proactive reclamation with certain tendencies. Your
> suggestion is to reclaim memory by scanning the task process space.
> However, in the mobile field, memory is usually viewed at the
> granularity of an APP.

OK, sthis is likely a terminology gap on my end. By application you do
not really mean a process but rather a whole cgroup. That would have
been really useful to be explicit about.
 
> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
> according to the pre-grouped APP processes.
> 
> Of course, as you suggested, madvise can also achieve this, but
> implementing it in the agent may be more complex.(In terms of
> achieving the same goal, using memcg to group all the processes of an
> APP and perform proactive reclamation is simpler than using madvise
> and scanning multiple processes of an application using an agent?)

It might be more involved but the primary question is whether it is
usable for the specific use case. Madvise interface is not LRU aware but
you are not really talking about that to be a requirement? So it would
really help if you go deeper into details on how is the interface
actually supposed to be used in your case.

Also make sure to exaplain why you cannot use other existing interfaces.
For example, why you simply don't decrease the limit of the frozen
cgroup and rely on the normal reclaim process to evict the most cold
memory? What are you basing your anon vs. file proportion decision on?

In other words more details, ideally with some numbers and make sure to
describe why existing APIs cannot be used.

Michal Hocko Nov. 9, 2023, 12:45 p.m. UTC | #25

On Thu 09-11-23 18:55:09, Huan Yang wrote:
> 
> 在 2023/11/9 17:53, Michal Hocko 写道:
> > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > 
> > On Thu 09-11-23 09:56:46, Huan Yang wrote:
> > > 在 2023/11/8 22:06, Michal Hocko 写道:
> > > > [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > 
> > > > On Wed 08-11-23 14:58:11, Huan Yang wrote:
> > > > > In some cases, we need to selectively reclaim file pages or anonymous
> > > > > pages in an unbalanced manner.
> > > > > 
> > > > > For example, when an application is pushed to the background and frozen,
> > > > > it may not be opened for a long time, and we can safely reclaim the
> > > > > application's anonymous pages, but we do not want to touch the file pages.
> > > > Could you explain why? And also why do you need to swap out in that
> > > > case?
> > > When an application is frozen, it usually means that we predict that
> > > it will not be used for a long time. In order to proactively save some
> > > memory, our strategy will choose to compress the application's private
> > > data into zram. And we will also select some of the cold application
> > > data that we think is in zram and swap it out.
> > > 
> > > The above operations assume that anonymous pages are private to the
> > > application.  After the application is frozen, compressing these pages
> > > into zram can save memory to some extent without worrying about
> > > frequent refaults.
> > Why don't you rely on the default reclaim heuristics? In other words do
> As I mentioned earlier, the madvise approach may not be suitable for my
> needs.

I was asking about default reclaim behavior not madvise here.

> > you have any numbers showing that a selective reclaim results in a much
> 
> In the mobile field, we have a core metric called application residency.

As already pointed out in other reply, make sure you explain this so
that we, who are not active in mobile field, can understand the metric,
how it is affected by the tooling relying on this interface.

> This mechanism can help us improve the application residency if we can
> provide a good freeze detection and proactive reclamation policy.
> 
> I can only provide specific data from our internal tests, and it may
> be older data, and it tested using cgroup v1:
> 
> In 12G ram phone, app residency improve from 29 to 38.

cgroup v1 is in maintenance mode and new extension would need to pass
even a higher feasibility test than v2 based interface. Also make sure
that you are testing the current upstream kernel.

Also let me stress out that you are proposing an extension to the user
visible API and we will have to maintain that for ever. So make sure
your justification is solid and understandable.

Huan Yang Nov. 9, 2023, 1:07 p.m. UTC | #26

HI,

在 2023/11/9 20:40, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Thu 09-11-23 18:50:36, Huan Yang wrote:
>> 在 2023/11/9 18:39, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>>>> HI Michal Hocko,
>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>>>> [...]
>>>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>>>> want to touch file pages.
>>>>> If that is the case and this is mostly application centric (which you
>>>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>>>> instead.
>>>> Madvise  may not be applicable in this scenario.(IMO)
>>>>
>>>> This feature is aimed at a core goal, which is to compress the anonymous
>>>> pages
>>>> of frozen applications.
>>>>
>>>> How to detect that an application is frozen and determine which pages can be
>>>> safely reclaimed is the responsibility of the policy part.
>>>>
>>>> Setting madvise for an application is an active behavior, while the above
>>>> policy
>>>> is a passive approach.(If I misunderstood, please let me know if there is a
>>>> better
>>>> way to set madvise.)
>>> You are proposing an extension to the pro-active reclaim interface so
>>> this is an active behavior pretty much by definition. So I am really not
>>> following you here. Your agent can simply scan the address space of the
>>> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
>>> on the private memory is that is really what you want/need.
>> There is a key point here. We want to use the grouping policy of memcg
>> to perform proactive reclamation with certain tendencies. Your
>> suggestion is to reclaim memory by scanning the task process space.
>> However, in the mobile field, memory is usually viewed at the
>> granularity of an APP.
> OK, sthis is likely a terminology gap on my end. By application you do
> not really mean a process but rather a whole cgroup. That would have
> been really useful to be explicit about.
I'm sorry for the confusion. But, in reality, the example I gave was 
just the one we use
here. In terms of policy, any reasonable method can be chosen to 
organize cgroups
and reclaim memory with certain tendencies.
But, let's continue the discussion assuming that memcg is grouped by 
application to
avoid confusion.
>
>> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
>> according to the pre-grouped APP processes.
>>
>> Of course, as you suggested, madvise can also achieve this, but
>> implementing it in the agent may be more complex.(In terms of
>> achieving the same goal, using memcg to group all the processes of an
>> APP and perform proactive reclamation is simpler than using madvise
>> and scanning multiple processes of an application using an agent?)
> It might be more involved but the primary question is whether it is
> usable for the specific use case. Madvise interface is not LRU aware but
> you are not really talking about that to be a requirement? So it would
> really help if you go deeper into details on how is the interface
> actually supposed to be used in your case.
In mobile field, we usually configure zram to compress anonymous page.
We can approximate to expand memory usage with limited hardware memory
by using zram.

With proper strategies, an 8GB RAM phone can approximate the usage of a 
12GB phone
(or more).

In our strategy, we group memcg by application. When the agent detects 
that an
application has entered the background, then frozen, and has not been 
used for a long time,
the agent will slowly issue commands to reclaim the anonymous page of 
that application.

With this interface, `echo memory anon > memory.reclaim`

>
> Also make sure to exaplain why you cannot use other existing interfaces.
> For example, why you simply don't decrease the limit of the frozen
> cgroup and rely on the normal reclaim process to evict the most cold
This is a question of reclamation tendency, and simply decreasing the 
limit of the frozen
cgroup cannot achieve this.
> memory? What are you basing your anon vs. file proportion decision on?
When zram is configured and anonymous pages are reclaimed proactively, 
the refault
probability of anonymous pages is low when an application is frozen and 
not reopened.
Also, the cost of refaulting from zram is relatively low.

However, file pages usually have shared properties, so even if an 
application is frozen,
other processes may still access the file pages. If a limit is set and 
the reclamation encounters
file pages, it will cause a certain amount of refault I/O, which is 
costly for mobile devices.

Therefore, we want to have a proactive reclamation interface that has a 
tendency to only
reclaim anonymous pages rather than file pages.

By doing so, more application data can be stored in the background, and 
when the application
is reopened from the background, cold start can be avoided.(Cold start 
means that the application
needs to reload the required data and reinitialize its running logic.)
>
> In other words more details, ideally with some numbers and make sure to
> describe why existing APIs cannot be used.
> --
> Michal Hocko
> SUSE Labs

Huan Yang Nov. 9, 2023, 1:10 p.m. UTC | #27

在 2023/11/9 20:45, Michal Hocko 写道:
> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On Thu 09-11-23 18:55:09, Huan Yang wrote:
>> 在 2023/11/9 17:53, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Thu 09-11-23 09:56:46, Huan Yang wrote:
>>>> 在 2023/11/8 22:06, Michal Hocko 写道:
>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Wed 08-11-23 14:58:11, Huan Yang wrote:
>>>>>> In some cases, we need to selectively reclaim file pages or anonymous
>>>>>> pages in an unbalanced manner.
>>>>>>
>>>>>> For example, when an application is pushed to the background and frozen,
>>>>>> it may not be opened for a long time, and we can safely reclaim the
>>>>>> application's anonymous pages, but we do not want to touch the file pages.
>>>>> Could you explain why? And also why do you need to swap out in that
>>>>> case?
>>>> When an application is frozen, it usually means that we predict that
>>>> it will not be used for a long time. In order to proactively save some
>>>> memory, our strategy will choose to compress the application's private
>>>> data into zram. And we will also select some of the cold application
>>>> data that we think is in zram and swap it out.
>>>>
>>>> The above operations assume that anonymous pages are private to the
>>>> application.  After the application is frozen, compressing these pages
>>>> into zram can save memory to some extent without worrying about
>>>> frequent refaults.
>>> Why don't you rely on the default reclaim heuristics? In other words do
>> As I mentioned earlier, the madvise approach may not be suitable for my
>> needs.
> I was asking about default reclaim behavior not madvise here.
Sorry for the misunderstand.
>
>>> you have any numbers showing that a selective reclaim results in a much
>> In the mobile field, we have a core metric called application residency.
> As already pointed out in other reply, make sure you explain this so
> that we, who are not active in mobile field, can understand the metric,
> how it is affected by the tooling relying on this interface.
OK.
>
>> This mechanism can help us improve the application residency if we can
>> provide a good freeze detection and proactive reclamation policy.
>>
>> I can only provide specific data from our internal tests, and it may
>> be older data, and it tested using cgroup v1:
>>
>> In 12G ram phone, app residency improve from 29 to 38.
> cgroup v1 is in maintenance mode and new extension would need to pass
> even a higher feasibility test than v2 based interface. Also make sure
> that you are testing the current upstream kernel.
OK, if patchset v2 expect, I will change work into cgroup v2 and give 
test data.
>
> Also let me stress out that you are proposing an extension to the user
> visible API and we will have to maintain that for ever. So make sure
> your justification is solid and understandable.
Thank you very much for your explanation. Let's focus on these 
discussions in
another email.
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Nov. 9, 2023, 1:46 p.m. UTC | #28

On Thu 09-11-23 21:07:29, Huan Yang wrote:
[...]
> > > Of course, as you suggested, madvise can also achieve this, but
> > > implementing it in the agent may be more complex.(In terms of
> > > achieving the same goal, using memcg to group all the processes of an
> > > APP and perform proactive reclamation is simpler than using madvise
> > > and scanning multiple processes of an application using an agent?)
> > It might be more involved but the primary question is whether it is
> > usable for the specific use case. Madvise interface is not LRU aware but
> > you are not really talking about that to be a requirement? So it would
> > really help if you go deeper into details on how is the interface
> > actually supposed to be used in your case.
> In mobile field, we usually configure zram to compress anonymous page.
> We can approximate to expand memory usage with limited hardware memory
> by using zram.
> 
> With proper strategies, an 8GB RAM phone can approximate the usage of a 12GB
> phone
> (or more).
> 
> In our strategy, we group memcg by application. When the agent detects that
> an
> application has entered the background, then frozen, and has not been used
> for a long time,
> the agent will slowly issue commands to reclaim the anonymous page of that
> application.
> 
> With this interface, `echo memory anon > memory.reclaim`

This doesn't really answer my questions above.
  
> > Also make sure to exaplain why you cannot use other existing interfaces.
> > For example, why you simply don't decrease the limit of the frozen
> > cgroup and rely on the normal reclaim process to evict the most cold
> This is a question of reclamation tendency, and simply decreasing the limit
> of the frozen cgroup cannot achieve this.

Why?

> > memory? What are you basing your anon vs. file proportion decision on?
> When zram is configured and anonymous pages are reclaimed proactively, the
> refault
> probability of anonymous pages is low when an application is frozen and not
> reopened.
> Also, the cost of refaulting from zram is relatively low.
>
> However, file pages usually have shared properties, so even if an
> application is frozen,
> other processes may still access the file pages. If a limit is set and the
> reclamation encounters
> file pages, it will cause a certain amount of refault I/O, which is costly
> for mobile devices.

Two points here (and the reason why I am repeatedly asking for some
data) 1) are you really seeing shared and actively used page cache pages
being reclaimed? 2) Is the refault IO really a problem. What kind of
storage those phone have that this is more significant than potentially
GB of compressed anonymous memory which would need CPU to refaulted
back. I mean do you have any actual numbers to show that the default
reclaim strategy would lead to a less utilized or less performant
system?

Huang, Ying Nov. 10, 2023, 1:19 a.m. UTC | #29

Huan Yang <link@vivo.com> writes:

> 在 2023/11/9 18:39, Michal Hocko 写道:
>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>>> HI Michal Hocko,
>>>
>>> Thanks for your suggestion.
>>>
>>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>>> [...]
>>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>>> want to touch file pages.
>>>> If that is the case and this is mostly application centric (which you
>>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>>> instead.
>>> Madvise  may not be applicable in this scenario.(IMO)
>>>
>>> This feature is aimed at a core goal, which is to compress the anonymous
>>> pages
>>> of frozen applications.
>>>
>>> How to detect that an application is frozen and determine which pages can be
>>> safely reclaimed is the responsibility of the policy part.
>>>
>>> Setting madvise for an application is an active behavior, while the above
>>> policy
>>> is a passive approach.(If I misunderstood, please let me know if there is a
>>> better
>>> way to set madvise.)
>> You are proposing an extension to the pro-active reclaim interface so
>> this is an active behavior pretty much by definition. So I am really not
>> following you here. Your agent can simply scan the address space of the
>> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
>> on the private memory is that is really what you want/need.
> There is a key point here. We want to use the grouping policy of memcg
> to perform
> proactive reclamation with certain tendencies. Your suggestion is to
> reclaim memory
> by scanning the task process space. However, in the mobile field,
> memory is usually
> viewed at the granularity of an APP.
>
> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
> according
> to the pre-grouped APP processes.
>
> Of course, as you suggested, madvise can also achieve this, but
> implementing it in
> the agent may be more complex.(In terms of achieving the same goal,
> using memcg
> to group all the processes of an APP and perform proactive reclamation
> is simpler
> than using madvise and scanning multiple processes of an application
> using an agent?)

I still think that it's not too complex to use process_madvise() to do
this.  For each process of the application, the agent can read
/proc/PID/maps to get all anonymous address ranges, then call
process_madvise(MADV_PAGEOUT) to reclaim pages.  This can even filter
out shared anonymous pages.  Does this work for you?

--
Best Regards,
Huang, Ying

Huan Yang Nov. 10, 2023, 2:44 a.m. UTC | #30

在 2023/11/10 9:19, Huang, Ying 写道:
> [Some people who received this message don't often get email from ying.huang@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Huan Yang <link@vivo.com> writes:
>
>> 在 2023/11/9 18:39, Michal Hocko 写道:
>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>>>> HI Michal Hocko,
>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>>>> [...]
>>>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>>>> want to touch file pages.
>>>>> If that is the case and this is mostly application centric (which you
>>>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>>>> instead.
>>>> Madvise  may not be applicable in this scenario.(IMO)
>>>>
>>>> This feature is aimed at a core goal, which is to compress the anonymous
>>>> pages
>>>> of frozen applications.
>>>>
>>>> How to detect that an application is frozen and determine which pages can be
>>>> safely reclaimed is the responsibility of the policy part.
>>>>
>>>> Setting madvise for an application is an active behavior, while the above
>>>> policy
>>>> is a passive approach.(If I misunderstood, please let me know if there is a
>>>> better
>>>> way to set madvise.)
>>> You are proposing an extension to the pro-active reclaim interface so
>>> this is an active behavior pretty much by definition. So I am really not
>>> following you here. Your agent can simply scan the address space of the
>>> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
>>> on the private memory is that is really what you want/need.
>> There is a key point here. We want to use the grouping policy of memcg
>> to perform
>> proactive reclamation with certain tendencies. Your suggestion is to
>> reclaim memory
>> by scanning the task process space. However, in the mobile field,
>> memory is usually
>> viewed at the granularity of an APP.
>>
>> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
>> according
>> to the pre-grouped APP processes.
>>
>> Of course, as you suggested, madvise can also achieve this, but
>> implementing it in
>> the agent may be more complex.(In terms of achieving the same goal,
>> using memcg
>> to group all the processes of an APP and perform proactive reclamation
>> is simpler
>> than using madvise and scanning multiple processes of an application
>> using an agent?)
> I still think that it's not too complex to use process_madvise() to do
> this.  For each process of the application, the agent can read
> /proc/PID/maps to get all anonymous address ranges, then call
> process_madvise(MADV_PAGEOUT) to reclaim pages.  This can even filter
> out shared anonymous pages.  Does this work for you?

Thanks for this suggestion. This way can avoid touch shared anonymous, it's
pretty well. But, I have some doubts about this, CPU resources are 
usually limited in
embedded devices, and power consumption must also be taken into 
consideration.

If this approach is adopted, the agent needs to periodically scan frozen 
applications
and set pageout for the address space. Is the frequency of this active 
operation more
complex and unsuitable for embedded devices compared to reclamation based on
memcg grouping features?

In addition, without LRU, it is difficult to control the reclamation of 
only partially cold
anonymous page data of frozen applications. For example, if I only want 
to proactively
reclaim 100MB of anonymous pages and issue the proactive reclamation 
interface,
we can use the LRU feature to only reclaim 100MB of cold anonymous pages.
However, this cannot be achieved through madvise.(If I have 
misunderstood something,
please correct me.)

>
> --
> Best Regards,
> Huang, Ying

Huan Yang Nov. 10, 2023, 3:48 a.m. UTC | #31

在 2023/11/9 21:46, Michal Hocko 写道:
> On Thu 09-11-23 21:07:29, Huan Yang wrote:
> [...]
>>>> Of course, as you suggested, madvise can also achieve this, but
>>>> implementing it in the agent may be more complex.(In terms of
>>>> achieving the same goal, using memcg to group all the processes of an
>>>> APP and perform proactive reclamation is simpler than using madvise
>>>> and scanning multiple processes of an application using an agent?)
>>> It might be more involved but the primary question is whether it is
>>> usable for the specific use case. Madvise interface is not LRU aware but
>>> you are not really talking about that to be a requirement? So it would
>>> really help if you go deeper into details on how is the interface
>>> actually supposed to be used in your case.
>> In mobile field, we usually configure zram to compress anonymous page.
>> We can approximate to expand memory usage with limited hardware memory
>> by using zram.
>>
>> With proper strategies, an 8GB RAM phone can approximate the usage of a 12GB
>> phone
>> (or more).
>>
>> In our strategy, we group memcg by application. When the agent detects that
>> an
>> application has entered the background, then frozen, and has not been used
>> for a long time,
>> the agent will slowly issue commands to reclaim the anonymous page of that
>> application.
>>
>> With this interface, `echo memory anon > memory.reclaim`
> This doesn't really answer my questions above.
>    
>>> Also make sure to exaplain why you cannot use other existing interfaces.
>>> For example, why you simply don't decrease the limit of the frozen
>>> cgroup and rely on the normal reclaim process to evict the most cold
>> This is a question of reclamation tendency, and simply decreasing the limit
>> of the frozen cgroup cannot achieve this.
> Why?

Can I ask how to limit the reclamation to only anonymous pages using the 
limit?

>>> memory? What are you basing your anon vs. file proportion decision on?
>> When zram is configured and anonymous pages are reclaimed proactively, the
>> refault
>> probability of anonymous pages is low when an application is frozen and not
>> reopened.
>> Also, the cost of refaulting from zram is relatively low.
>>
>> However, file pages usually have shared properties, so even if an
>> application is frozen,
>> other processes may still access the file pages. If a limit is set and the
>> reclamation encounters
>> file pages, it will cause a certain amount of refault I/O, which is costly
>> for mobile devices.
> Two points here (and the reason why I am repeatedly asking for some
> data) 1) are you really seeing shared and actively used page cache pages
When we call the current proactive reclamation interface to actively 
reclaim memory,
the debug program can usually observe that file pages are partially 
reclaimed.

However, when we start other APPs for testing(the current reclaimed APP 
is in the background),
the trace shows that there is a lot of block I/O for the background 
application.
> being reclaimed? 2) Is the refault IO really a problem. What kind of
> storage those phone have that this is more significant than potentially
> GB of compressed anonymous memory which would need CPU to refaulted
Phone typically use UFS.
> back. I mean do you have any actual numbers to show that the default
> reclaim strategy would lead to a less utilized or less performant
> system?
Also, When the application enters the foreground, the startup speed may 
be slower. Also trace
show that here are a lot of block I/O. (usually 1000+ IO count and 
200+ms IO Time)
We usually observe very little block I/O caused by zram refault.(read: 
1698.39MB/s, write: 995.109MB/s),
usually, it is faster than random disk reads.(read: 48.1907MB/s write: 
49.1654MB/s). This test by
zram-perf and I change a little to test UFS.

Therefore, if the proactive reclamation encounters many file pages, the 
application may become
slow when it is opened.

Huang, Ying Nov. 10, 2023, 4 a.m. UTC | #32

Huan Yang <link@vivo.com> writes:

> 在 2023/11/10 9:19, Huang, Ying 写道:
>> [Some people who received this message don't often get email from ying.huang@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>
>> Huan Yang <link@vivo.com> writes:
>>
>>> 在 2023/11/9 18:39, Michal Hocko 写道:
>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>
>>>> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>>>>> HI Michal Hocko,
>>>>>
>>>>> Thanks for your suggestion.
>>>>>
>>>>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>
>>>>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>>>>> [...]
>>>>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>>>>> want to touch file pages.
>>>>>> If that is the case and this is mostly application centric (which you
>>>>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>>>>> instead.
>>>>> Madvise  may not be applicable in this scenario.(IMO)
>>>>>
>>>>> This feature is aimed at a core goal, which is to compress the anonymous
>>>>> pages
>>>>> of frozen applications.
>>>>>
>>>>> How to detect that an application is frozen and determine which pages can be
>>>>> safely reclaimed is the responsibility of the policy part.
>>>>>
>>>>> Setting madvise for an application is an active behavior, while the above
>>>>> policy
>>>>> is a passive approach.(If I misunderstood, please let me know if there is a
>>>>> better
>>>>> way to set madvise.)
>>>> You are proposing an extension to the pro-active reclaim interface so
>>>> this is an active behavior pretty much by definition. So I am really not
>>>> following you here. Your agent can simply scan the address space of the
>>>> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
>>>> on the private memory is that is really what you want/need.
>>> There is a key point here. We want to use the grouping policy of memcg
>>> to perform
>>> proactive reclamation with certain tendencies. Your suggestion is to
>>> reclaim memory
>>> by scanning the task process space. However, in the mobile field,
>>> memory is usually
>>> viewed at the granularity of an APP.
>>>
>>> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
>>> according
>>> to the pre-grouped APP processes.
>>>
>>> Of course, as you suggested, madvise can also achieve this, but
>>> implementing it in
>>> the agent may be more complex.(In terms of achieving the same goal,
>>> using memcg
>>> to group all the processes of an APP and perform proactive reclamation
>>> is simpler
>>> than using madvise and scanning multiple processes of an application
>>> using an agent?)
>> I still think that it's not too complex to use process_madvise() to do
>> this.  For each process of the application, the agent can read
>> /proc/PID/maps to get all anonymous address ranges, then call
>> process_madvise(MADV_PAGEOUT) to reclaim pages.  This can even filter
>> out shared anonymous pages.  Does this work for you?
>
> Thanks for this suggestion. This way can avoid touch shared anonymous, it's
> pretty well. But, I have some doubts about this, CPU resources are
> usually limited in
> embedded devices, and power consumption must also be taken into
> consideration.
>
> If this approach is adopted, the agent needs to periodically scan
> frozen applications
> and set pageout for the address space. Is the frequency of this active
> operation more
> complex and unsuitable for embedded devices compared to reclamation based on
> memcg grouping features?

In memcg based solution, when will you start the proactive reclaiming?
You can just replace the reclaiming part of the solution from memcg
proactive reclaiming to process_madvise(MADV_PAGEOUT).  Because you can
get PIDs in a memcg.  Is it possible?

> In addition, without LRU, it is difficult to control the reclamation
> of only partially cold
> anonymous page data of frozen applications. For example, if I only
> want to proactively
> reclaim 100MB of anonymous pages and issue the proactive reclamation
> interface,
> we can use the LRU feature to only reclaim 100MB of cold anonymous pages.
> However, this cannot be achieved through madvise.(If I have
> misunderstood something,
> please correct me.)

IIUC, it should be OK to reclaim all private anonymous pages of an
application in your specific use case?  If you really want to restrict
the number of pages reclaimed, it's possible too.  You can restrict the
size of address range to call process_madvise(MADV_PAGEOUT), and check
the RSS of the application.  The accuracy of the number reclaimed isn't
good.  But I think that it should OK in practice?

BTW: how do you know the number of pages to be reclaimed proactively in
memcg proactive reclaiming based solution?

--
Best Regards,
Huang, Ying

Huan Yang Nov. 10, 2023, 6:21 a.m. UTC | #33

在 2023/11/10 12:00, Huang, Ying 写道:
> [Some people who received this message don't often get email from ying.huang@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Huan Yang <link@vivo.com> writes:
>
>> 在 2023/11/10 9:19, Huang, Ying 写道:
>>> [Some people who received this message don't often get email from ying.huang@intel.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>
>>> Huan Yang <link@vivo.com> writes:
>>>
>>>> 在 2023/11/9 18:39, Michal Hocko 写道:
>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>
>>>>> On Thu 09-11-23 18:29:03, Huan Yang wrote:
>>>>>> HI Michal Hocko,
>>>>>>
>>>>>> Thanks for your suggestion.
>>>>>>
>>>>>> 在 2023/11/9 17:57, Michal Hocko 写道:
>>>>>>> [Some people who received this message don't often get email from mhocko@suse.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>>>>>>>
>>>>>>> On Thu 09-11-23 11:38:56, Huan Yang wrote:
>>>>>>> [...]
>>>>>>>>> If so, is it better only to reclaim private anonymous pages explicitly?
>>>>>>>> Yes, in practice, we only proactively compress anonymous pages and do not
>>>>>>>> want to touch file pages.
>>>>>>> If that is the case and this is mostly application centric (which you
>>>>>>> seem to be suggesting) then why don't you use madvise(MADV_PAGEOUT)
>>>>>>> instead.
>>>>>> Madvise  may not be applicable in this scenario.(IMO)
>>>>>>
>>>>>> This feature is aimed at a core goal, which is to compress the anonymous
>>>>>> pages
>>>>>> of frozen applications.
>>>>>>
>>>>>> How to detect that an application is frozen and determine which pages can be
>>>>>> safely reclaimed is the responsibility of the policy part.
>>>>>>
>>>>>> Setting madvise for an application is an active behavior, while the above
>>>>>> policy
>>>>>> is a passive approach.(If I misunderstood, please let me know if there is a
>>>>>> better
>>>>>> way to set madvise.)
>>>>> You are proposing an extension to the pro-active reclaim interface so
>>>>> this is an active behavior pretty much by definition. So I am really not
>>>>> following you here. Your agent can simply scan the address space of the
>>>>> application it is going to "freeze" and call pidfd_madvise(MADV_PAGEOUT)
>>>>> on the private memory is that is really what you want/need.
>>>> There is a key point here. We want to use the grouping policy of memcg
>>>> to perform
>>>> proactive reclamation with certain tendencies. Your suggestion is to
>>>> reclaim memory
>>>> by scanning the task process space. However, in the mobile field,
>>>> memory is usually
>>>> viewed at the granularity of an APP.
>>>>
>>>> Therefore, after an APP is frozen, we hope to reclaim memory uniformly
>>>> according
>>>> to the pre-grouped APP processes.
>>>>
>>>> Of course, as you suggested, madvise can also achieve this, but
>>>> implementing it in
>>>> the agent may be more complex.(In terms of achieving the same goal,
>>>> using memcg
>>>> to group all the processes of an APP and perform proactive reclamation
>>>> is simpler
>>>> than using madvise and scanning multiple processes of an application
>>>> using an agent?)
>>> I still think that it's not too complex to use process_madvise() to do
>>> this.  For each process of the application, the agent can read
>>> /proc/PID/maps to get all anonymous address ranges, then call
>>> process_madvise(MADV_PAGEOUT) to reclaim pages.  This can even filter
>>> out shared anonymous pages.  Does this work for you?
>> Thanks for this suggestion. This way can avoid touch shared anonymous, it's
>> pretty well. But, I have some doubts about this, CPU resources are
>> usually limited in
>> embedded devices, and power consumption must also be taken into
>> consideration.
>>
>> If this approach is adopted, the agent needs to periodically scan
>> frozen applications
>> and set pageout for the address space. Is the frequency of this active
>> operation more
>> complex and unsuitable for embedded devices compared to reclamation based on
>> memcg grouping features?
> In memcg based solution, when will you start the proactive reclaiming?
> You can just replace the reclaiming part of the solution from memcg
> proactive reclaiming to process_madvise(MADV_PAGEOUT).  Because you can
> get PIDs in a memcg.  Is it possible?
>
>> In addition, without LRU, it is difficult to control the reclamation
>> of only partially cold
>> anonymous page data of frozen applications. For example, if I only
>> want to proactively
>> reclaim 100MB of anonymous pages and issue the proactive reclamation
>> interface,
>> we can use the LRU feature to only reclaim 100MB of cold anonymous pages.
>> However, this cannot be achieved through madvise.(If I have
>> misunderstood something,
>> please correct me.)
> IIUC, it should be OK to reclaim all private anonymous pages of an
> application in your specific use case?  If you really want to restrict
This is a gradual process, It will not reclaim all anonymous pages at once.
> the number of pages reclaimed, it's possible too.  You can restrict the
> size of address range to call process_madvise(MADV_PAGEOUT), and check
> the RSS of the application.  The accuracy of the number reclaimed isn't
> good.  But I think that it should OK in practice?
If you only want to reclaim all anonymous memory, this can indeed be done,
and fast. :)
>
> BTW: how do you know the number of pages to be reclaimed proactively in
> memcg proactive reclaiming based solution?
One point here is that we are not sure how long the frozen application 
will be
opened, it could be 10 minutes, an hour, or even days.
So we need to predict and try, gradually reclaim anonymous pages in 
proportion,
preferably based on the LRU algorithm.
For example, if the application has been frozen for 10 minutes, reclaim 
5% of
anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.
It is even more complicated as it requires adding a mechanism for predicting
failure penalties.
>
> --
> Best Regards,
> Huang, Ying

Michal Hocko Nov. 10, 2023, 12:24 p.m. UTC | #34

On Fri 10-11-23 11:48:49, Huan Yang wrote:
[...]
> Also, When the application enters the foreground, the startup speed
> may be slower. Also trace show that here are a lot of block I/O.
> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
> 995.109MB/s), usually, it is faster than random disk reads.(read:
> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
> little to test UFS.
> 
> Therefore, if the proactive reclamation encounters many file pages,
> the application may become slow when it is opened.

OK, this is an interesting information. From the above it seems that
storage based IO refaults are order of magnitude more expensive than
swap (zram in this case). That means that the memory reclaim should 
_in general_ prefer anonymous memory reclaim over refaulted page cache,
right? Or is there any reason why "frozen" applications are any
different in this case?

Our traditional interface to control the anon vs. file balance has been
swappiness. It is not the best interface and it has its flaws but
have you experimented with the global swappiness to express that
preference? What were your observations? Please note that the behavior
might be really different with different kernel versions so I would
really stress out that testing with the current Linus (or akpm) tree is
necessary.

Anyway, the more I think about that the more I am convinced that
explicit anon/file extension for the memory.reclaim interface is just a
wrong way to address a more fundamental underlying problem. That is, the
default reclaim choice over anon vs file preference should consider the
cost of the refaulting IO. This is more a property of the underlying
storage than a global characteristic. In other words, say you have
mutlitple storages, one that is a network based with a high latency and
other that is a local fast SSD. Reclaiming a page backed by the slower
storage is going to be more expensive to refault than the one backed by
the fast storage.  So even page cache pages are not really all the same. 

It is quite likely that a IO cost aspect is not really easy to integrate
into the memory reclaim but it seems to me this is a better way to focus
on for a better long term solution. Our existing refaulting
infrastructure should help in that respect. Also MGLRU could fit for
that purpose better than the traditional LRU based reclaim as the higher
generations could be used for more more expensive pages.

Michal Hocko Nov. 10, 2023, 12:32 p.m. UTC | #35

On Fri 10-11-23 14:21:17, Huan Yang wrote:
[...]
> > BTW: how do you know the number of pages to be reclaimed proactively in
> > memcg proactive reclaiming based solution?
>
> One point here is that we are not sure how long the frozen application
> will be opened, it could be 10 minutes, an hour, or even days.  So we
> need to predict and try, gradually reclaim anonymous pages in
> proportion, preferably based on the LRU algorithm.  For example, if
> the application has been frozen for 10 minutes, reclaim 5% of
> anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
> complicated as it requires adding a mechanism for predicting failure
> penalties.

Why would make your reclaiming decisions based on time rather than the
actual memory demand? I can see how a pro-active reclaim could make a
head room for an unexpected memory pressure but applying more pressure
just because of inactivity sound rather dubious to me TBH. Why cannot
you simply wait for the external memory pressure (e.g. from kswapd) to
deal with that based on the demand?

Huan Yang Nov. 13, 2023, 1:54 a.m. UTC | #36

在 2023/11/10 20:32, Michal Hocko 写道:
> On Fri 10-11-23 14:21:17, Huan Yang wrote:
> [...]
>>> BTW: how do you know the number of pages to be reclaimed proactively in
>>> memcg proactive reclaiming based solution?
>> One point here is that we are not sure how long the frozen application
>> will be opened, it could be 10 minutes, an hour, or even days.  So we
>> need to predict and try, gradually reclaim anonymous pages in
>> proportion, preferably based on the LRU algorithm.  For example, if
>> the application has been frozen for 10 minutes, reclaim 5% of
>> anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
>> complicated as it requires adding a mechanism for predicting failure
>> penalties.
> Why would make your reclaiming decisions based on time rather than the
> actual memory demand? I can see how a pro-active reclaim could make a
> head room for an unexpected memory pressure but applying more pressure
> just because of inactivity sound rather dubious to me TBH. Why cannot
> you simply wait for the external memory pressure (e.g. from kswapd) to
> deal with that based on the demand?
Because the current kswapd and direct memory reclamation are a passive
memory reclamation based on the watermark, and in the event of triggering
these reclamation scenarios, the smoothness of the phone application cannot
be guaranteed. (We often observe that when the above reclamation is 
triggered,
there is a delay in the application startup, usually accompanied by 
block I/O, and
some concurrency issues caused by lock design.)

To ensure the smoothness of application startup, we have a module in 
Android called
LMKD (formerly known as lowmemorykiller). Based on a certain algorithm, LMKD
detects if application startup may be delayed and proactively kills 
inactive applications.
(For example, based on factors such as refault IO and swap usage.)

However, this behavior may cause the applications we want to protect to 
be killed,
which will result in users having to wait for them to restart when they 
are reopened,
which may affect the user experience.(For example, if the user wants to 
reopen the
application interface they are working on, or re-enter the order 
interface they were viewing.)

Therefore, the above proactive reclamation interface is designed to 
compress memory
types with minimal cost for upper-layer applications based on reasonable 
strategies,
in order to avoid triggering LMKD or memory reclamation as much as possible,
even if it is not balanced.

Huan Yang Nov. 13, 2023, 2:17 a.m. UTC | #37

在 2023/11/10 20:24, Michal Hocko 写道:
> On Fri 10-11-23 11:48:49, Huan Yang wrote:
> [...]
>> Also, When the application enters the foreground, the startup speed
>> may be slower. Also trace show that here are a lot of block I/O.
>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>> little to test UFS.
>>
>> Therefore, if the proactive reclamation encounters many file pages,
>> the application may become slow when it is opened.
> OK, this is an interesting information. From the above it seems that
> storage based IO refaults are order of magnitude more expensive than
> swap (zram in this case). That means that the memory reclaim should
> _in general_ prefer anonymous memory reclaim over refaulted page cache,
> right? Or is there any reason why "frozen" applications are any
> different in this case?
Frozen applications mean that the application process is no longer active,
so once its private anonymous page data is swapped out, the anonymous
pages will not be refaulted until the application becomes active again.

On the contrary, page caches are usually shared. Even if the application 
that
first read the file is no longer active, other processes may still read 
the file.
Therefore, it is not reasonable to use the proactive reclamation 
interface to
reclaim page caches without considering memory pressure.

Then, considering the recycling cost of anonymous pages and page cache,
the idea of unbalanced recycling as described above is generated.
>
> Our traditional interface to control the anon vs. file balance has been
> swappiness. It is not the best interface and it has its flaws but
> have you experimented with the global swappiness to express that
> preference? What were your observations? Please note that the behavior
We have tested this part and found that no version of the code has the
priority control over swappiness.

This means that even if we modify swappiness to 0 or 200,
we cannot achieve the goal of unbalanced reclaim if some conditions
are not met during the reclaim process. Under certain conditions,
we may mistakenly reclaim file pages, and since we usually trigger
active reclaim when there is sufficient memory(before LMKD trigger),
this will cause higher block IO.

This RFC code provide some flags with the highest priority to set
reclaim tendencies. Currently, it can only be triggered by the active
reclaim interface.
> might be really different with different kernel versions so I would
> really stress out that testing with the current Linus (or akpm) tree is
> necessary.
OK, thank you for the reminder.
>
> Anyway, the more I think about that the more I am convinced that
> explicit anon/file extension for the memory.reclaim interface is just a
> wrong way to address a more fundamental underlying problem. That is, the
> default reclaim choice over anon vs file preference should consider the
> cost of the refaulting IO. This is more a property of the underlying
> storage than a global characteristic. In other words, say you have
> mutlitple storages, one that is a network based with a high latency and
> other that is a local fast SSD. Reclaiming a page backed by the slower
> storage is going to be more expensive to refault than the one backed by
> the fast storage.  So even page cache pages are not really all the same.
>
> It is quite likely that a IO cost aspect is not really easy to integrate
> into the memory reclaim but it seems to me this is a better way to focus
> on for a better long term solution. Our existing refaulting
> infrastructure should help in that respect. Also MGLRU could fit for
> that purpose better than the traditional LRU based reclaim as the higher
> generations could be used for more more expensive pages.

Yes, your insights are very informative.

However, before our algorithm is perfected, I think it is reasonable to 
provide
different reclaim tendencies for the active reclaim interface. This will 
provide
greater flexibility for the strategy layer.
For example, in the field of mobile phones, we can consider the 
comprehensive
impact of refault IO overhead and LMKD killing when providing different 
reclaim
tendencies for the active reclaim interface.

Huang, Ying Nov. 13, 2023, 6:10 a.m. UTC | #38

Huan Yang <link@vivo.com> writes:

> 在 2023/11/10 20:24, Michal Hocko 写道:
>> On Fri 10-11-23 11:48:49, Huan Yang wrote:
>> [...]
>>> Also, When the application enters the foreground, the startup speed
>>> may be slower. Also trace show that here are a lot of block I/O.
>>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>>> little to test UFS.
>>>
>>> Therefore, if the proactive reclamation encounters many file pages,
>>> the application may become slow when it is opened.
>> OK, this is an interesting information. From the above it seems that
>> storage based IO refaults are order of magnitude more expensive than
>> swap (zram in this case). That means that the memory reclaim should
>> _in general_ prefer anonymous memory reclaim over refaulted page cache,
>> right? Or is there any reason why "frozen" applications are any
>> different in this case?
> Frozen applications mean that the application process is no longer active,
> so once its private anonymous page data is swapped out, the anonymous
> pages will not be refaulted until the application becomes active again.
>
> On the contrary, page caches are usually shared. Even if the
> application that
> first read the file is no longer active, other processes may still
> read the file.
> Therefore, it is not reasonable to use the proactive reclamation
> interface to
> reclaim page caches without considering memory pressure.

No.  Not all page caches are shared.  For example, the page caches used
for use-once streaming IO.  And, they should be reclaimed firstly.

So, your solution may work good for your specific use cases, but it's
not a general solution.  Per my understanding, you want to reclaim only
private pages to avoid impact the performance of other applications.
Privately mapped anonymous pages is easy to be identified (And I suggest
that you can find a way to avoid reclaim shared mapped anonymous pages).
There's some heuristics to identify use-once page caches in reclaiming
code.  Why doesn't it work for your situation?

[snip]

--
Best Regards,
Huang, Ying

Huan Yang Nov. 13, 2023, 6:28 a.m. UTC | #39

在 2023/11/13 14:10, Huang, Ying 写道:
> Huan Yang <link@vivo.com> writes:
>
>> 在 2023/11/10 20:24, Michal Hocko 写道:
>>> On Fri 10-11-23 11:48:49, Huan Yang wrote:
>>> [...]
>>>> Also, When the application enters the foreground, the startup speed
>>>> may be slower. Also trace show that here are a lot of block I/O.
>>>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>>>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>>>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>>>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>>>> little to test UFS.
>>>>
>>>> Therefore, if the proactive reclamation encounters many file pages,
>>>> the application may become slow when it is opened.
>>> OK, this is an interesting information. From the above it seems that
>>> storage based IO refaults are order of magnitude more expensive than
>>> swap (zram in this case). That means that the memory reclaim should
>>> _in general_ prefer anonymous memory reclaim over refaulted page cache,
>>> right? Or is there any reason why "frozen" applications are any
>>> different in this case?
>> Frozen applications mean that the application process is no longer active,
>> so once its private anonymous page data is swapped out, the anonymous
>> pages will not be refaulted until the application becomes active again.
>>
>> On the contrary, page caches are usually shared. Even if the
>> application that
>> first read the file is no longer active, other processes may still
>> read the file.
>> Therefore, it is not reasonable to use the proactive reclamation
>> interface to
>> reclaim page caches without considering memory pressure.
> No.  Not all page caches are shared.  For example, the page caches used
> for use-once streaming IO.  And, they should be reclaimed firstly.
Yes, but this part is done very well in MGLRU and does not require our 
intervention.
Moreover, the reclaim speed of clean files is very fast, but compared to it,
the reclaim speed of anonymous pages is a bit slower.
>
> So, your solution may work good for your specific use cases, but it's
Yes, this approach is not universal.
> not a general solution.  Per my understanding, you want to reclaim only
> private pages to avoid impact the performance of other applications.
> Privately mapped anonymous pages is easy to be identified (And I suggest
> that you can find a way to avoid reclaim shared mapped anonymous pages).
Yes, it is not good to reclaim shared anonymous pages, and it needs to be
identified. In the future, we will consider how to filter them.
Thanks.
> There's some heuristics to identify use-once page caches in reclaiming
> code.  Why doesn't it work for your situation?
As mentioned above, the default reclaim algorithm is suitable for recycling
file pages, but we do not need to intervene in it.
Direct reclaim or kswapd of these use-once file pages is very fast and will
not cause lag or other effects.
Our overall goal is to actively and reasonably compress unused anonymous
pages based on certain strategies, in order to increase available memory to
a certain extent, avoid lag, and prevent applications from being killed.
Therefore, using the proactive reclaim interface, combined with LRU 
algorithm
and reclaim tendencies, is a good way to achieve our goal.
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying

Huang, Ying Nov. 13, 2023, 8:05 a.m. UTC | #40

Huan Yang <link@vivo.com> writes:

> 在 2023/11/13 14:10, Huang, Ying 写道:
>> Huan Yang <link@vivo.com> writes:
>>
>>> 在 2023/11/10 20:24, Michal Hocko 写道:
>>>> On Fri 10-11-23 11:48:49, Huan Yang wrote:
>>>> [...]
>>>>> Also, When the application enters the foreground, the startup speed
>>>>> may be slower. Also trace show that here are a lot of block I/O.
>>>>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>>>>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>>>>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>>>>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>>>>> little to test UFS.
>>>>>
>>>>> Therefore, if the proactive reclamation encounters many file pages,
>>>>> the application may become slow when it is opened.
>>>> OK, this is an interesting information. From the above it seems that
>>>> storage based IO refaults are order of magnitude more expensive than
>>>> swap (zram in this case). That means that the memory reclaim should
>>>> _in general_ prefer anonymous memory reclaim over refaulted page cache,
>>>> right? Or is there any reason why "frozen" applications are any
>>>> different in this case?
>>> Frozen applications mean that the application process is no longer active,
>>> so once its private anonymous page data is swapped out, the anonymous
>>> pages will not be refaulted until the application becomes active again.
>>>
>>> On the contrary, page caches are usually shared. Even if the
>>> application that
>>> first read the file is no longer active, other processes may still
>>> read the file.
>>> Therefore, it is not reasonable to use the proactive reclamation
>>> interface to
>>> reclaim page caches without considering memory pressure.
>> No.  Not all page caches are shared.  For example, the page caches used
>> for use-once streaming IO.  And, they should be reclaimed firstly.
> Yes, but this part is done very well in MGLRU and does not require our
> intervention.
> Moreover, the reclaim speed of clean files is very fast, but compared to it,
> the reclaim speed of anonymous pages is a bit slower.
>>
>> So, your solution may work good for your specific use cases, but it's
> Yes, this approach is not universal.
>> not a general solution.  Per my understanding, you want to reclaim only
>> private pages to avoid impact the performance of other applications.
>> Privately mapped anonymous pages is easy to be identified (And I suggest
>> that you can find a way to avoid reclaim shared mapped anonymous pages).
> Yes, it is not good to reclaim shared anonymous pages, and it needs to be
> identified. In the future, we will consider how to filter them.
> Thanks.
>> There's some heuristics to identify use-once page caches in reclaiming
>> code.  Why doesn't it work for your situation?
> As mentioned above, the default reclaim algorithm is suitable for recycling
> file pages, but we do not need to intervene in it.
> Direct reclaim or kswapd of these use-once file pages is very fast and will
> not cause lag or other effects.
> Our overall goal is to actively and reasonably compress unused anonymous
> pages based on certain strategies, in order to increase available memory to
> a certain extent, avoid lag, and prevent applications from being killed.
> Therefore, using the proactive reclaim interface, combined with LRU
> algorithm
> and reclaim tendencies, is a good way to achieve our goal.

If so, why can't you just use the proactive reclaim with some large
enough swappiness?  That will reclaim use-once page caches and compress
anonymous pages.  So, more applications can be kept in memory before
passive reclaiming or killing background applications?

--
Best Regards,
Huang, Ying

Huan Yang Nov. 13, 2023, 8:26 a.m. UTC | #41

在 2023/11/13 16:05, Huang, Ying 写道:
> Huan Yang <link@vivo.com> writes:
>
>> 在 2023/11/13 14:10, Huang, Ying 写道:
>>> Huan Yang <link@vivo.com> writes:
>>>
>>>> 在 2023/11/10 20:24, Michal Hocko 写道:
>>>>> On Fri 10-11-23 11:48:49, Huan Yang wrote:
>>>>> [...]
>>>>>> Also, When the application enters the foreground, the startup speed
>>>>>> may be slower. Also trace show that here are a lot of block I/O.
>>>>>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>>>>>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>>>>>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>>>>>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>>>>>> little to test UFS.
>>>>>>
>>>>>> Therefore, if the proactive reclamation encounters many file pages,
>>>>>> the application may become slow when it is opened.
>>>>> OK, this is an interesting information. From the above it seems that
>>>>> storage based IO refaults are order of magnitude more expensive than
>>>>> swap (zram in this case). That means that the memory reclaim should
>>>>> _in general_ prefer anonymous memory reclaim over refaulted page cache,
>>>>> right? Or is there any reason why "frozen" applications are any
>>>>> different in this case?
>>>> Frozen applications mean that the application process is no longer active,
>>>> so once its private anonymous page data is swapped out, the anonymous
>>>> pages will not be refaulted until the application becomes active again.
>>>>
>>>> On the contrary, page caches are usually shared. Even if the
>>>> application that
>>>> first read the file is no longer active, other processes may still
>>>> read the file.
>>>> Therefore, it is not reasonable to use the proactive reclamation
>>>> interface to
>>>> reclaim page caches without considering memory pressure.
>>> No.  Not all page caches are shared.  For example, the page caches used
>>> for use-once streaming IO.  And, they should be reclaimed firstly.
>> Yes, but this part is done very well in MGLRU and does not require our
>> intervention.
>> Moreover, the reclaim speed of clean files is very fast, but compared to it,
>> the reclaim speed of anonymous pages is a bit slower.
>>> So, your solution may work good for your specific use cases, but it's
>> Yes, this approach is not universal.
>>> not a general solution.  Per my understanding, you want to reclaim only
>>> private pages to avoid impact the performance of other applications.
>>> Privately mapped anonymous pages is easy to be identified (And I suggest
>>> that you can find a way to avoid reclaim shared mapped anonymous pages).
>> Yes, it is not good to reclaim shared anonymous pages, and it needs to be
>> identified. In the future, we will consider how to filter them.
>> Thanks.
>>> There's some heuristics to identify use-once page caches in reclaiming
>>> code.  Why doesn't it work for your situation?
>> As mentioned above, the default reclaim algorithm is suitable for recycling
>> file pages, but we do not need to intervene in it.
>> Direct reclaim or kswapd of these use-once file pages is very fast and will
>> not cause lag or other effects.
>> Our overall goal is to actively and reasonably compress unused anonymous
>> pages based on certain strategies, in order to increase available memory to
>> a certain extent, avoid lag, and prevent applications from being killed.
>> Therefore, using the proactive reclaim interface, combined with LRU
>> algorithm
>> and reclaim tendencies, is a good way to achieve our goal.
> If so, why can't you just use the proactive reclaim with some large
> enough swappiness?  That will reclaim use-once page caches and compress
This works very well for proactive memory reclaim that is only executed 
once.
However, considering that we need to perform proactive reclaim in batches,
suppose that only 5% of the use-once page cache in this memcg can be 
reclaimed,
but we need to call proactive memory reclaim step by step, such as 5%, 
10%, 15% ... 100%.
Then, the page cache may be reclaimed due to the balancing adjustment of 
reclamation,
even if the 5% of use-once pages are reclaimed. We may still touch on 
shared file pages.
(If I misunderstood anything, please correct me.)

We previously used the two values of modifying swappiness to 200 and 0 
to adjust reclaim
tendencies. However, the debug interface showed that some file pages 
were reclaimed,
and after being actively reclaimed, some applications and the reopened 
applications that were
reclaimed had some block IO and startup lag.

This way of having incomplete control over the process maybe is not 
suitable for proactive memory
reclaim. Instead, with an proactive reclaim interface with tendencies, 
we can issue a
5% page cache trim once and then gradually reclaim anonymous pages.
> anonymous pages.  So, more applications can be kept in memory before
> passive reclaiming or killing background applications?
>
> --
> Best Regards,
> Huang, Ying

Michal Hocko Nov. 14, 2023, 9:50 a.m. UTC | #42

On Mon 13-11-23 10:17:57, Huan Yang wrote:
> 
> 在 2023/11/10 20:24, Michal Hocko 写道:
> > On Fri 10-11-23 11:48:49, Huan Yang wrote:
> > [...]
> > > Also, When the application enters the foreground, the startup speed
> > > may be slower. Also trace show that here are a lot of block I/O.
> > > (usually 1000+ IO count and 200+ms IO Time) We usually observe very
> > > little block I/O caused by zram refault.(read: 1698.39MB/s, write:
> > > 995.109MB/s), usually, it is faster than random disk reads.(read:
> > > 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
> > > little to test UFS.
> > > 
> > > Therefore, if the proactive reclamation encounters many file pages,
> > > the application may become slow when it is opened.
> > OK, this is an interesting information. From the above it seems that
> > storage based IO refaults are order of magnitude more expensive than
> > swap (zram in this case). That means that the memory reclaim should
> > _in general_ prefer anonymous memory reclaim over refaulted page cache,
> > right? Or is there any reason why "frozen" applications are any
> > different in this case?
> Frozen applications mean that the application process is no longer active,
> so once its private anonymous page data is swapped out, the anonymous
> pages will not be refaulted until the application becomes active again.

I was probably not clear in my question. It is quite clear that frozen
applications are inactive. It is not really clear why they should be
treated any differently though. Their memory will be naturally cold as
the memory is not in use so why cannot we realy on the standard memory
reclaim to deal with the implicit inactivity and you need to handle that
explicitly?

[...]
> > Our traditional interface to control the anon vs. file balance has been
> > swappiness. It is not the best interface and it has its flaws but
> > have you experimented with the global swappiness to express that
> > preference? What were your observations? Please note that the behavior
> We have tested this part and found that no version of the code has the
> priority control over swappiness.
> 
> This means that even if we modify swappiness to 0 or 200,
> we cannot achieve the goal of unbalanced reclaim if some conditions
> are not met during the reclaim process. Under certain conditions,
> we may mistakenly reclaim file pages, and since we usually trigger
> active reclaim when there is sufficient memory(before LMKD trigger),
> this will cause higher block IO.

Yes there are heuristics which might override the global swappinness but
have you investigated those cases and can show that those heuristics
could be changed?

[...]
> > It is quite likely that a IO cost aspect is not really easy to integrate
> > into the memory reclaim but it seems to me this is a better way to focus
> > on for a better long term solution. Our existing refaulting
> > infrastructure should help in that respect. Also MGLRU could fit for
> > that purpose better than the traditional LRU based reclaim as the higher
> > generations could be used for more more expensive pages.
> 
> Yes, your insights are very informative.
> 
> However, before our algorithm is perfected, I think it is reasonable
> to provide different reclaim tendencies for the active reclaim
> interface. This will provide greater flexibility for the strategy
> layer.

Flexibility is really nice but it comes with a price and interface cost
can be really high. There were several attempts to make memory reclaim
LRU type specific but I still maintain my opinion that this is not
really a good abstraction. As stated above even page cache is not all
the same. A more future proof interface should really consider the IO
refault cost rather than all anon/file.

Michal Hocko Nov. 14, 2023, 9:54 a.m. UTC | #43

On Mon 13-11-23 16:26:00, Huan Yang wrote:
[...]
> However, considering that we need to perform proactive reclaim in batches,
> suppose that only 5% of the use-once page cache in this memcg can be
> reclaimed,
> but we need to call proactive memory reclaim step by step, such as 5%, 10%,
> 15% ... 100%.

You haven't really explained this and I have asked several times IIRC.
Why do you even need to do those batches? Why cannot you simply relly on
the memory pressure triggering the memory reclaim? Do you have any
actual numbers showing that being pro-active results in smaller
latencies or anything that would show this is actually needed?

Michal Hocko Nov. 14, 2023, 9:56 a.m. UTC | #44

On Tue 14-11-23 10:54:05, Michal Hocko wrote:
> On Mon 13-11-23 16:26:00, Huan Yang wrote:
> [...]
> > However, considering that we need to perform proactive reclaim in batches,
> > suppose that only 5% of the use-once page cache in this memcg can be
> > reclaimed,
> > but we need to call proactive memory reclaim step by step, such as 5%, 10%,
> > 15% ... 100%.
> 
> You haven't really explained this and I have asked several times IIRC.
> Why do you even need to do those batches? Why cannot you simply relly on
> the memory pressure triggering the memory reclaim? Do you have any
> actual numbers showing that being pro-active results in smaller
> latencies or anything that would show this is actually needed?

Just noticed dcd2eff8-400b-4ade-a5b2-becfe26b437b@vivo.com, will reply
there.

Michal Hocko Nov. 14, 2023, 10:04 a.m. UTC | #45

On Mon 13-11-23 09:54:55, Huan Yang wrote:
> 
> 在 2023/11/10 20:32, Michal Hocko 写道:
> > On Fri 10-11-23 14:21:17, Huan Yang wrote:
> > [...]
> > > > BTW: how do you know the number of pages to be reclaimed proactively in
> > > > memcg proactive reclaiming based solution?
> > > One point here is that we are not sure how long the frozen application
> > > will be opened, it could be 10 minutes, an hour, or even days.  So we
> > > need to predict and try, gradually reclaim anonymous pages in
> > > proportion, preferably based on the LRU algorithm.  For example, if
> > > the application has been frozen for 10 minutes, reclaim 5% of
> > > anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
> > > complicated as it requires adding a mechanism for predicting failure
> > > penalties.
> > Why would make your reclaiming decisions based on time rather than the
> > actual memory demand? I can see how a pro-active reclaim could make a
> > head room for an unexpected memory pressure but applying more pressure
> > just because of inactivity sound rather dubious to me TBH. Why cannot
> > you simply wait for the external memory pressure (e.g. from kswapd) to
> > deal with that based on the demand?
> Because the current kswapd and direct memory reclamation are a passive
> memory reclamation based on the watermark, and in the event of triggering
> these reclamation scenarios, the smoothness of the phone application cannot
> be guaranteed.

OK, so you are worried about latencies on spike memory usage. 

> (We often observe that when the above reclamation is triggered, there
> is a delay in the application startup, usually accompanied by block
> I/O, and some concurrency issues caused by lock design.)

Does that mean you do not have enough head room for kswapd to keep with
the memory demand? It is really hard to discuss this without some actual
numbers or more specifics.
 
> To ensure the smoothness of application startup, we have a module in
> Android called LMKD (formerly known as lowmemorykiller). Based on a
> certain algorithm, LMKD detects if application startup may be delayed
> and proactively kills inactive applications.  (For example, based on
> factors such as refault IO and swap usage.)
> 
> However, this behavior may cause the applications we want to protect
> to be killed, which will result in users having to wait for them to
> restart when they are reopened, which may affect the user
> experience.(For example, if the user wants to reopen the application
> interface they are working on, or re-enter the order interface they
> were viewing.)

This suggests that your LMKD doesn't pick up the right victim to kill.
And I suspect this is a fundamental problem of those pro-active oom
killer solutions.

> Therefore, the above proactive reclamation interface is designed to
> compress memory types with minimal cost for upper-layer applications
> based on reasonable strategies, in order to avoid triggering LMKD or
> memory reclamation as much as possible, even if it is not balanced.

This would suggest that MADV_PAGEOUT is really what you are looking for.
If you really aim at compressing a specific type of memory then tweking
reclaim to achieve that sounds like a shortcut because madvise based
solution is more involved. But that is not a solid justification for
adding a new interface.

Huan Yang Nov. 14, 2023, 12:37 p.m. UTC | #46

在 2023/11/14 18:04, Michal Hocko 写道:
> On Mon 13-11-23 09:54:55, Huan Yang wrote:
>> 在 2023/11/10 20:32, Michal Hocko 写道:
>>> On Fri 10-11-23 14:21:17, Huan Yang wrote:
>>> [...]
>>>>> BTW: how do you know the number of pages to be reclaimed proactively in
>>>>> memcg proactive reclaiming based solution?
>>>> One point here is that we are not sure how long the frozen application
>>>> will be opened, it could be 10 minutes, an hour, or even days.  So we
>>>> need to predict and try, gradually reclaim anonymous pages in
>>>> proportion, preferably based on the LRU algorithm.  For example, if
>>>> the application has been frozen for 10 minutes, reclaim 5% of
>>>> anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
>>>> complicated as it requires adding a mechanism for predicting failure
>>>> penalties.
>>> Why would make your reclaiming decisions based on time rather than the
>>> actual memory demand? I can see how a pro-active reclaim could make a
>>> head room for an unexpected memory pressure but applying more pressure
>>> just because of inactivity sound rather dubious to me TBH. Why cannot
>>> you simply wait for the external memory pressure (e.g. from kswapd) to
>>> deal with that based on the demand?
>> Because the current kswapd and direct memory reclamation are a passive
>> memory reclamation based on the watermark, and in the event of triggering
>> these reclamation scenarios, the smoothness of the phone application cannot
>> be guaranteed.
> OK, so you are worried about latencies on spike memory usage.
>
>> (We often observe that when the above reclamation is triggered, there
>> is a delay in the application startup, usually accompanied by block
>> I/O, and some concurrency issues caused by lock design.)
> Does that mean you do not have enough head room for kswapd to keep with
Yes, but if set high watermark a little high, the power consumption will 
be very high.
We usually observe that kswapd will run frequently.
Even if we have set a low kswapd water level, kswapd CPU usage can still be
high in some extreme scenarios.(For example, when starting a large 
application that
needs to acquire a large amount of memory in a short period of time. 
)However, we will
not discuss it in detail here, the reasons are quite complex, and we 
have not yet sorted
out a complete understanding of them.
> the memory demand? It is really hard to discuss this without some actual
> numbers or more specifics.
>   
>> To ensure the smoothness of application startup, we have a module in
>> Android called LMKD (formerly known as lowmemorykiller). Based on a
>> certain algorithm, LMKD detects if application startup may be delayed
>> and proactively kills inactive applications.  (For example, based on
>> factors such as refault IO and swap usage.)
>>
>> However, this behavior may cause the applications we want to protect
>> to be killed, which will result in users having to wait for them to
>> restart when they are reopened, which may affect the user
>> experience.(For example, if the user wants to reopen the application
>> interface they are working on, or re-enter the order interface they
>> were viewing.)
> This suggests that your LMKD doesn't pick up the right victim to kill.
> And I suspect this is a fundamental problem of those pro-active oom
Yes, but, our current LMKD configuration is already very conservative, which
can cause lag in some scenarios, but we will not analyze the reasons in 
detail here.
> killer solutions.
>
>> Therefore, the above proactive reclamation interface is designed to
>> compress memory types with minimal cost for upper-layer applications
>> based on reasonable strategies, in order to avoid triggering LMKD or
>> memory reclamation as much as possible, even if it is not balanced.
> This would suggest that MADV_PAGEOUT is really what you are looking for.
Yes, I agree, especially to avoid reclaiming shared anonymous pages.

However, I did some shallow research and found that MADV_PAGEOUT does not
reclaim pages with mapcount != 1. Our applications are usually composed 
of multiple
processes, and some anonymous pages are shared among them. When the 
application
is frozen, the memory that is only shared among the processes within the 
application should
be released, but MADV_PAGEOUT seems not to be suitable for this 
scenario?(If I
misunderstood anything, please correct me.)

In addition, I still have doubts that this approach will consume a lot 
of strategy
resources, but it is worth studying.

Thanks.
> If you really aim at compressing a specific type of memory then tweking
> reclaim to achieve that sounds like a shortcut because madvise based
> solution is more involved. But that is not a solid justification for
> adding a new interface.
Yes, but this RFC is just adding an additional configuration option to 
the proactive
reclaim interface. And in the reclaim path, prioritize processing these 
requests
with reclaim tendencies. However, using `unlikely` judgment should not have
much impact.

Michal Hocko Nov. 14, 2023, 1:03 p.m. UTC | #47

On Tue 14-11-23 20:37:07, Huan Yang wrote:
> 
> 在 2023/11/14 18:04, Michal Hocko 写道:
> > On Mon 13-11-23 09:54:55, Huan Yang wrote:
> > > 在 2023/11/10 20:32, Michal Hocko 写道:
> > > > On Fri 10-11-23 14:21:17, Huan Yang wrote:
> > > > [...]
> > > > > > BTW: how do you know the number of pages to be reclaimed proactively in
> > > > > > memcg proactive reclaiming based solution?
> > > > > One point here is that we are not sure how long the frozen application
> > > > > will be opened, it could be 10 minutes, an hour, or even days.  So we
> > > > > need to predict and try, gradually reclaim anonymous pages in
> > > > > proportion, preferably based on the LRU algorithm.  For example, if
> > > > > the application has been frozen for 10 minutes, reclaim 5% of
> > > > > anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
> > > > > complicated as it requires adding a mechanism for predicting failure
> > > > > penalties.
> > > > Why would make your reclaiming decisions based on time rather than the
> > > > actual memory demand? I can see how a pro-active reclaim could make a
> > > > head room for an unexpected memory pressure but applying more pressure
> > > > just because of inactivity sound rather dubious to me TBH. Why cannot
> > > > you simply wait for the external memory pressure (e.g. from kswapd) to
> > > > deal with that based on the demand?
> > > Because the current kswapd and direct memory reclamation are a passive
> > > memory reclamation based on the watermark, and in the event of triggering
> > > these reclamation scenarios, the smoothness of the phone application cannot
> > > be guaranteed.
> > OK, so you are worried about latencies on spike memory usage.
> > 
> > > (We often observe that when the above reclamation is triggered, there
> > > is a delay in the application startup, usually accompanied by block
> > > I/O, and some concurrency issues caused by lock design.)
> > Does that mean you do not have enough head room for kswapd to keep with
>
> Yes, but if set high watermark a little high, the power consumption
> will be very high.  We usually observe that kswapd will run
> frequently.  Even if we have set a low kswapd water level, kswapd CPU
> usage can still be high in some extreme scenarios.(For example, when
> starting a large application that needs to acquire a large amount of
> memory in a short period of time.)However, we will not discuss it in
> detail here, the reasons are quite complex, and we have not yet sorted
> out a complete understanding of them.

This is definitely worth investigating further before resorting to
proposing a new interface. If the kswapd consumes CPU cycles
unproductively then we should look into why.

If there is a big peak memory demand then that surely requires CPU
capacity for the memory reclaim. The work has to be done, whether that
is in kswapd or the pro-active reclaimer context. I can imagine the
latter one could be invoked with a better timing in mind but that is not
a trivial thing to do. There are examples where this could be driven by
PSI feedback loop but from what you have mention earlier you are doing a
idle time based reclaim. Anyway, this is mostly a tuning related
discussion. I wanted to learn more about what you are trying to achieve
and so far it seems to me you are trying to workaround some issues and
a) we would like to learn about those issues and b) a new interface is
unlikely a good fit to paper over a suboptimal behavior.

> > This would suggest that MADV_PAGEOUT is really what you are looking
> > for.
> 
> Yes, I agree, especially to avoid reclaiming shared anonymous pages.
> 
> However, I did some shallow research and found that MADV_PAGEOUT does
> not reclaim pages with mapcount != 1. Our applications are usually
> composed of multiple processes, and some anonymous pages are shared
> among them. When the application is frozen, the memory that is only
> shared among the processes within the application should be released,
> but MADV_PAGEOUT seems not to be suitable for this scenario?(If I
> misunderstood anything, please correct me.)

Hmm, OK it seems that we are hitting some terminology problems. The
discussion was about private memory so far (essentially MAP_PRIVATE)
now you are talking about a shared anonymous memory. That would imply
shmem and that is indeed not supported by MADV_PAGEOUT. The reason for
that is that this poses a security risk for time based attacks. I can
imagine, though, that we could extend the behavior to support shared
mappings if they do not cross a security boundary (e.g. mapped by the
same user). This would require some analysis though.

> In addition, I still have doubts that this approach will consume a lot
> of strategy resources, but it is worth studying.

> > If you really aim at compressing a specific type of memory then
> > tweking reclaim to achieve that sounds like a shortcut because
> > madvise based solution is more involved. But that is not a solid
> > justification for adding a new interface.
> Yes, but this RFC is just adding an additional configuration option to
> the proactive reclaim interface. And in the reclaim path, prioritize
> processing these requests with reclaim tendencies. However, using
> `unlikely` judgment should not have much impact.

Just adding an adding configuration option means user interface contract
that needs to be maintained for ever. Our future reclaim algorithm migh
change (and in fact it has already changed quite a bit with MGLRU) and
explicit request for LRU type specific reclaim might not even have any
sense. See that point?

Huan Yang Nov. 15, 2023, 2:11 a.m. UTC | #48

在 2023/11/14 21:03, Michal Hocko 写道:
> On Tue 14-11-23 20:37:07, Huan Yang wrote:
>> 在 2023/11/14 18:04, Michal Hocko 写道:
>>> On Mon 13-11-23 09:54:55, Huan Yang wrote:
>>>> 在 2023/11/10 20:32, Michal Hocko 写道:
>>>>> On Fri 10-11-23 14:21:17, Huan Yang wrote:
>>>>> [...]
>>>>>>> BTW: how do you know the number of pages to be reclaimed proactively in
>>>>>>> memcg proactive reclaiming based solution?
>>>>>> One point here is that we are not sure how long the frozen application
>>>>>> will be opened, it could be 10 minutes, an hour, or even days.  So we
>>>>>> need to predict and try, gradually reclaim anonymous pages in
>>>>>> proportion, preferably based on the LRU algorithm.  For example, if
>>>>>> the application has been frozen for 10 minutes, reclaim 5% of
>>>>>> anonymous pages; 30min:25%anon, 1hour:75%, 1day:100%.  It is even more
>>>>>> complicated as it requires adding a mechanism for predicting failure
>>>>>> penalties.
>>>>> Why would make your reclaiming decisions based on time rather than the
>>>>> actual memory demand? I can see how a pro-active reclaim could make a
>>>>> head room for an unexpected memory pressure but applying more pressure
>>>>> just because of inactivity sound rather dubious to me TBH. Why cannot
>>>>> you simply wait for the external memory pressure (e.g. from kswapd) to
>>>>> deal with that based on the demand?
>>>> Because the current kswapd and direct memory reclamation are a passive
>>>> memory reclamation based on the watermark, and in the event of triggering
>>>> these reclamation scenarios, the smoothness of the phone application cannot
>>>> be guaranteed.
>>> OK, so you are worried about latencies on spike memory usage.
>>>
>>>> (We often observe that when the above reclamation is triggered, there
>>>> is a delay in the application startup, usually accompanied by block
>>>> I/O, and some concurrency issues caused by lock design.)
>>> Does that mean you do not have enough head room for kswapd to keep with
>> Yes, but if set high watermark a little high, the power consumption
>> will be very high.  We usually observe that kswapd will run
>> frequently.  Even if we have set a low kswapd water level, kswapd CPU
>> usage can still be high in some extreme scenarios.(For example, when
>> starting a large application that needs to acquire a large amount of
>> memory in a short period of time.)However, we will not discuss it in
>> detail here, the reasons are quite complex, and we have not yet sorted
>> out a complete understanding of them.
> This is definitely worth investigating further before resorting to
> proposing a new interface. If the kswapd consumes CPU cycles
> unproductively then we should look into why.
Yes, this is my current research objective.
>
> If there is a big peak memory demand then that surely requires CPU
> capacity for the memory reclaim. The work has to be done, whether that
> is in kswapd or the pro-active reclaimer context. I can imagine the
> latter one could be invoked with a better timing in mind but that is not
> a trivial thing to do. There are examples where this could be driven by
> PSI feedback loop but from what you have mention earlier you are doing a
> idle time based reclaim. Anyway, this is mostly a tuning related
> discussion. I wanted to learn more about what you are trying to achieve
> and so far it seems to me you are trying to workaround some issues and
> a) we would like to learn about those issues and b) a new interface is
> unlikely a good fit to paper over a suboptimal behavior.
Our current research goal is to find a possible dynamic balance between the
time consumption of passive memory reclamation and the application death
caused by active process killing.

The current strategy is to use proactive memory reclamation to intervene in
this process. As mentioned earlier, by actively reclaiming anonymous pages
that are deemed safe to reclaim, we can increase the currently available 
memory,
avoid lag when starting new applications, and prevent the death of resident
applications.

Through the previous discussions, it seems that we have reached a consensus
that although the active memory reclamation interface can achieve this goal,
it is not the best approach. Using MADV can both use existing methods to
achieve this goal and decide whether to reclaim based on the 
characteristics of
the anon vma, especially the anon_vma name set.

Therefore, I will also push for internal research on this approach.
>
>>> This would suggest that MADV_PAGEOUT is really what you are looking
>>> for.
>> Yes, I agree, especially to avoid reclaiming shared anonymous pages.
>>
>> However, I did some shallow research and found that MADV_PAGEOUT does
>> not reclaim pages with mapcount != 1. Our applications are usually
>> composed of multiple processes, and some anonymous pages are shared
>> among them. When the application is frozen, the memory that is only
>> shared among the processes within the application should be released,
>> but MADV_PAGEOUT seems not to be suitable for this scenario?(If I
>> misunderstood anything, please correct me.)
> Hmm, OK it seems that we are hitting some terminology problems. The
> discussion was about private memory so far (essentially MAP_PRIVATE)
> now you are talking about a shared anonymous memory. That would imply
> shmem and that is indeed not supported by MADV_PAGEOUT. The reason for
> that is that this poses a security risk for time based attacks. I can
> imagine, though, that we could extend the behavior to support shared
> mappings if they do not cross a security boundary (e.g. mapped by the
> same user). This would require some analysis though.
OK, thanks. I have communicated with our internal team and found out that
this part of the memory usage will not be particularly large.
>   
>> In addition, I still have doubts that this approach will consume a lot
>> of strategy resources, but it is worth studying.
>>> If you really aim at compressing a specific type of memory then
>>> tweking reclaim to achieve that sounds like a shortcut because
>>> madvise based solution is more involved. But that is not a solid
>>> justification for adding a new interface.
>> Yes, but this RFC is just adding an additional configuration option to
>> the proactive reclaim interface. And in the reclaim path, prioritize
>> processing these requests with reclaim tendencies. However, using
>> `unlikely` judgment should not have much impact.
> Just adding an adding configuration option means user interface contract
> that needs to be maintained for ever. Our future reclaim algorithm migh
> change (and in fact it has already changed quite a bit with MGLRU) and
> explicit request for LRU type specific reclaim might not even have any
> sense. See that point?
Yes, I get it.  This also means that if the reclaim algorithm changes, 
the current
implementation of tendencies will need to be modified accordingly, which 
requires
a certain cost to maintain.
If the current implementation of tendencies cannot prove its necessity, 
it should
be keep deep research.
This solution may be simpler for me to achieve our internal goals, but 
it may not be
the best solution.So, MADV_PAGEOUT is worth to research.

This conversation was very beneficial for me.
Thank you all very much.
>

Huang, Ying Nov. 15, 2023, 6:52 a.m. UTC | #49

Huan Yang <link@vivo.com> writes:

> 在 2023/11/13 16:05, Huang, Ying 写道:
>> Huan Yang <link@vivo.com> writes:
>>
>>> 在 2023/11/13 14:10, Huang, Ying 写道:
>>>> Huan Yang <link@vivo.com> writes:
>>>>
>>>>> 在 2023/11/10 20:24, Michal Hocko 写道:
>>>>>> On Fri 10-11-23 11:48:49, Huan Yang wrote:
>>>>>> [...]
>>>>>>> Also, When the application enters the foreground, the startup speed
>>>>>>> may be slower. Also trace show that here are a lot of block I/O.
>>>>>>> (usually 1000+ IO count and 200+ms IO Time) We usually observe very
>>>>>>> little block I/O caused by zram refault.(read: 1698.39MB/s, write:
>>>>>>> 995.109MB/s), usually, it is faster than random disk reads.(read:
>>>>>>> 48.1907MB/s write: 49.1654MB/s). This test by zram-perf and I change a
>>>>>>> little to test UFS.
>>>>>>>
>>>>>>> Therefore, if the proactive reclamation encounters many file pages,
>>>>>>> the application may become slow when it is opened.
>>>>>> OK, this is an interesting information. From the above it seems that
>>>>>> storage based IO refaults are order of magnitude more expensive than
>>>>>> swap (zram in this case). That means that the memory reclaim should
>>>>>> _in general_ prefer anonymous memory reclaim over refaulted page cache,
>>>>>> right? Or is there any reason why "frozen" applications are any
>>>>>> different in this case?
>>>>> Frozen applications mean that the application process is no longer active,
>>>>> so once its private anonymous page data is swapped out, the anonymous
>>>>> pages will not be refaulted until the application becomes active again.
>>>>>
>>>>> On the contrary, page caches are usually shared. Even if the
>>>>> application that
>>>>> first read the file is no longer active, other processes may still
>>>>> read the file.
>>>>> Therefore, it is not reasonable to use the proactive reclamation
>>>>> interface to
>>>>> reclaim page caches without considering memory pressure.
>>>> No.  Not all page caches are shared.  For example, the page caches used
>>>> for use-once streaming IO.  And, they should be reclaimed firstly.
>>> Yes, but this part is done very well in MGLRU and does not require our
>>> intervention.
>>> Moreover, the reclaim speed of clean files is very fast, but compared to it,
>>> the reclaim speed of anonymous pages is a bit slower.
>>>> So, your solution may work good for your specific use cases, but it's
>>> Yes, this approach is not universal.
>>>> not a general solution.  Per my understanding, you want to reclaim only
>>>> private pages to avoid impact the performance of other applications.
>>>> Privately mapped anonymous pages is easy to be identified (And I suggest
>>>> that you can find a way to avoid reclaim shared mapped anonymous pages).
>>> Yes, it is not good to reclaim shared anonymous pages, and it needs to be
>>> identified. In the future, we will consider how to filter them.
>>> Thanks.
>>>> There's some heuristics to identify use-once page caches in reclaiming
>>>> code.  Why doesn't it work for your situation?
>>> As mentioned above, the default reclaim algorithm is suitable for recycling
>>> file pages, but we do not need to intervene in it.
>>> Direct reclaim or kswapd of these use-once file pages is very fast and will
>>> not cause lag or other effects.
>>> Our overall goal is to actively and reasonably compress unused anonymous
>>> pages based on certain strategies, in order to increase available memory to
>>> a certain extent, avoid lag, and prevent applications from being killed.
>>> Therefore, using the proactive reclaim interface, combined with LRU
>>> algorithm
>>> and reclaim tendencies, is a good way to achieve our goal.
>> If so, why can't you just use the proactive reclaim with some large
>> enough swappiness?  That will reclaim use-once page caches and compress
> This works very well for proactive memory reclaim that is only
> executed once.
> However, considering that we need to perform proactive reclaim in batches,
> suppose that only 5% of the use-once page cache in this memcg can be
> reclaimed,
> but we need to call proactive memory reclaim step by step, such as 5%,
> 10%, 15% ... 100%.
> Then, the page cache may be reclaimed due to the balancing adjustment
> of reclamation,
> even if the 5% of use-once pages are reclaimed. We may still touch on
> shared file pages.
> (If I misunderstood anything, please correct me.)

If the proactive reclaim amount is less than the size of anonymous
pages, I think that you are safe.  For example, if the size of anonymous
pages is 100MB, the size of use-once file pages is 10MB, the size of
shared file pages is 20MB.  Then if you reclaim 100MB proactively with
swappiness=200, you will reclaim 10MB use-once file pages and 90MB
anonymous pages.  In the next time, if you reclaim 10MB proactively, you
will still not reclaim shared file pages.

> We previously used the two values of modifying swappiness to 200 and 0
> to adjust reclaim
> tendencies. However, the debug interface showed that some file pages
> were reclaimed,
> and after being actively reclaimed, some applications and the reopened
> applications that were
> reclaimed had some block IO and startup lag.

If so, please research why use-once file page heuristics not work and
try to fix it or raise the issue.

> This way of having incomplete control over the process maybe is not
> suitable for proactive memory
> reclaim. Instead, with an proactive reclaim interface with tendencies,
> we can issue a
> 5% page cache trim once and then gradually reclaim anonymous pages.
>> anonymous pages.  So, more applications can be kept in memory before
>> passive reclaiming or killing background applications?

--
Best Regards,
Huang, Ying