Message ID | 20200701152623.384AF0A7@viggo.jf.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Repair and clean up vm.zone_reclaim_mode sysctl ABI | expand |
On Wed, 1 Jul 2020, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > > I went to go add a new RECLAIM_* mode for the zone_reclaim_mode > sysctl. Like a good kernel developer, I also went to go update the > documentation. I noticed that the bits in the documentation didn't > match the bits in the #defines. > > The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, > however implicitly checked when checking 'node_reclaim_mode==0'. > The RECLAIM_ZONE #define was removed in a cleanup. That, by itself > is fine. > > But, when the bit was removed (bit 0) the _other_ bit locations also > got changed. That's not OK because the bit values are documented to > mean one specific thing and users surely rely on them meaning that one > thing and not changing from kernel to kernel. The end result is that > if someone had a script that did: > > sysctl vm.zone_reclaim_mode=1 > > That script went from doing nothing to writing out pages during > node reclaim after the commit in question. That's not great. > > Put the bits back the way they were and add a comment so something > like this is a bit harder to do again. Update the documentation to > make it clear that the first bit is ignored. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Fixes: 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE") > Cc: Ben Widawsky <ben.widawsky@intel.com> > Cc: Alex Shi <alex.shi@linux.alibaba.com> > Cc: Daniel Wagner <dwagner@suse.de> > Cc: "Tobin C. Harding" <tobin@kernel.org> > Cc: Christoph Lameter <cl@linux.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Huang Ying <ying.huang@intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Qian Cai <cai@lca.pw> > Cc: Daniel Wagner <dwagner@suse.de> > Cc: stable@vger.kernel.org Acked-by: David Rientjes <rientjes@google.com>
Dave Hansen <dave.hansen@linux.intel.com> writes: > From: Dave Hansen <dave.hansen@linux.intel.com> > > I went to go add a new RECLAIM_* mode for the zone_reclaim_mode > sysctl. Like a good kernel developer, I also went to go update the > documentation. I noticed that the bits in the documentation didn't > match the bits in the #defines. > > The VM never explicitly checks the RECLAIM_ZONE bit. The bit is, > however implicitly checked when checking 'node_reclaim_mode==0'. > The RECLAIM_ZONE #define was removed in a cleanup. That, by itself > is fine. > > But, when the bit was removed (bit 0) the _other_ bit locations also > got changed. That's not OK because the bit values are documented to > mean one specific thing and users surely rely on them meaning that one > thing and not changing from kernel to kernel. The end result is that > if someone had a script that did: > > sysctl vm.zone_reclaim_mode=1 > > That script went from doing nothing Per my understanding, this script would have enabled node reclaim for clean unmapped pages before commit 648b5cf368e0 ("mm/vmscan: remove unused RECLAIM_OFF/RECLAIM_ZONE"). So we should revise the description here? > to writing out pages during > node reclaim after the commit in question. That's not great. > > Put the bits back the way they were and add a comment so something > like this is a bit harder to do again. Update the documentation to > make it clear that the first bit is ignored. > Best Regards, Huang, Ying
On 7/2/20 4:28 AM, Huang, Ying wrote: >> But, when the bit was removed (bit 0) the _other_ bit locations also >> got changed. That's not OK because the bit values are documented to >> mean one specific thing and users surely rely on them meaning that one >> thing and not changing from kernel to kernel. The end result is that >> if someone had a script that did: >> >> sysctl vm.zone_reclaim_mode=1 >> >> That script went from doing nothing > Per my understanding, this script would have enabled node reclaim for > clean unmapped pages before commit 648b5cf368e0 ("mm/vmscan: remove > unused RECLAIM_OFF/RECLAIM_ZONE"). So we should revise the description > here? Yes, you're right. I updated the patch with the updated understanding about the implicit use of the bit but didn't update the changelog. I'll do that for v3.
diff -puN Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi Documentation/admin-guide/sysctl/vm.rst --- a/Documentation/admin-guide/sysctl/vm.rst~mm-vmscan-restore-old-zone_reclaim_mode-abi 2020-07-01 08:22:11.354955336 -0700 +++ b/Documentation/admin-guide/sysctl/vm.rst 2020-07-01 08:22:11.360955336 -0700 @@ -948,11 +948,11 @@ that benefit from having their data cach left disabled as the caching effect is likely to be more important than data locality. -zone_reclaim may be enabled if it's known that the workload is partitioned -such that each partition fits within a NUMA node and that accessing remote -memory would cause a measurable performance reduction. The page allocator -will then reclaim easily reusable pages (those page cache pages that are -currently not used) before allocating off node pages. +Consider enabling one or more zone_reclaim mode bits if it's known that the +workload is partitioned such that each partition fits within a NUMA node +and that accessing remote memory would cause a measurable performance +reduction. The page allocator will take additional actions before +allocating off node pages. Allowing zone reclaim to write out pages stops processes that are writing large amounts of data from dirtying pages on other nodes. Zone diff -puN mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi mm/vmscan.c --- a/mm/vmscan.c~mm-vmscan-restore-old-zone_reclaim_mode-abi 2020-07-01 08:22:11.356955336 -0700 +++ b/mm/vmscan.c 2020-07-01 08:22:11.362955336 -0700 @@ -4090,8 +4090,13 @@ module_init(kswapd_init) */ int node_reclaim_mode __read_mostly; -#define RECLAIM_WRITE (1<<0) /* Writeout pages during reclaim */ -#define RECLAIM_UNMAP (1<<1) /* Unmap pages during reclaim */ +/* + * These bit locations are exposed in the vm.zone_reclaim_mode sysctl + * ABI. New bits are OK, but existing bits can never change. + */ +#define RECLAIM_ZONE (1<<0) /* Run shrink_inactive_list on the zone */ +#define RECLAIM_WRITE (1<<1) /* Writeout pages during reclaim */ +#define RECLAIM_UNMAP (1<<2) /* Unmap pages during reclaim */ /* * Priority for NODE_RECLAIM. This determines the fraction of pages