Message ID | 1565075940-23121-1-git-send-email-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm/vmscan: shrink slab in node reclaim | expand |
On Tue 06-08-19 03:19:00, Yafang Shao wrote: > In the node reclaim, may_shrinkslab is 0 by default, > hence shrink_slab will never be performed in it. > While shrik_slab should be performed if the relcaimable slab is over > min slab limit. > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > shrink_node will do at least one of the two because otherwise node_reclaim > returns early. > > __node_reclaim can detect when enough slab has been reclaimed because > sc.reclaim_state.reclaimed_slab will tell us how many pages are > reclaimed in shrink slab. > > This issue is very easy to produce, first you continuously cat a random > non-exist file to produce more and more dentry, then you read big file > to produce page cache. And finally you will find that the denty will > never be shrunk in node reclaim (they can only be shrunk in kswapd until > the watermark is reached). > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > reclaim. Someone may prefer to enable it if their different workloads work > on different nodes. Considering that this is a long term behavior of a rarely used node reclaim I would rather not touch it unless some _real_ workload suffers from this behavior. Or is there any reason to fix this even though there is no evidence of real workloads suffering from the current behavior?
On Tue 06-08-19 09:35:25, Michal Hocko wrote: > On Tue 06-08-19 03:19:00, Yafang Shao wrote: > > In the node reclaim, may_shrinkslab is 0 by default, > > hence shrink_slab will never be performed in it. > > While shrik_slab should be performed if the relcaimable slab is over > > min slab limit. > > > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > > shrink_node will do at least one of the two because otherwise node_reclaim > > returns early. > > > > __node_reclaim can detect when enough slab has been reclaimed because > > sc.reclaim_state.reclaimed_slab will tell us how many pages are > > reclaimed in shrink slab. > > > > This issue is very easy to produce, first you continuously cat a random > > non-exist file to produce more and more dentry, then you read big file > > to produce page cache. And finally you will find that the denty will > > never be shrunk in node reclaim (they can only be shrunk in kswapd until > > the watermark is reached). > > > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > > reclaim. Someone may prefer to enable it if their different workloads work > > on different nodes. > > Considering that this is a long term behavior of a rarely used node > reclaim I would rather not touch it unless some _real_ workload suffers > from this behavior. Or is there any reason to fix this even though there > is no evidence of real workloads suffering from the current behavior? I have only now noticed that you have added Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") could you be more specific how that commit introduced a bug in the node reclaim? It has introduced may_shrink_slab but the direct reclaim seems to set it to 1.
On Tue, Aug 6, 2019 at 3:35 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 03:19:00, Yafang Shao wrote: > > In the node reclaim, may_shrinkslab is 0 by default, > > hence shrink_slab will never be performed in it. > > While shrik_slab should be performed if the relcaimable slab is over > > min slab limit. > > > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > > shrink_node will do at least one of the two because otherwise node_reclaim > > returns early. > > > > __node_reclaim can detect when enough slab has been reclaimed because > > sc.reclaim_state.reclaimed_slab will tell us how many pages are > > reclaimed in shrink slab. > > > > This issue is very easy to produce, first you continuously cat a random > > non-exist file to produce more and more dentry, then you read big file > > to produce page cache. And finally you will find that the denty will > > never be shrunk in node reclaim (they can only be shrunk in kswapd until > > the watermark is reached). > > > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > > reclaim. Someone may prefer to enable it if their different workloads work > > on different nodes. > > Considering that this is a long term behavior of a rarely used node > reclaim I would rather not touch it unless some _real_ workload suffers > from this behavior. Or is there any reason to fix this even though there > is no evidence of real workloads suffering from the current behavior? > -- When we do performance tuning on some workloads(especially if this workload is NUMA sensitive), sometimes we may enable it on our test environment and then do some benchmark to dicide whether or not applying it on the production envrioment. Although the result is not good enough as expected, it is really a performance tuning knob. Thanks Yafang
On Tue, Aug 6, 2019 at 3:41 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 09:35:25, Michal Hocko wrote: > > On Tue 06-08-19 03:19:00, Yafang Shao wrote: > > > In the node reclaim, may_shrinkslab is 0 by default, > > > hence shrink_slab will never be performed in it. > > > While shrik_slab should be performed if the relcaimable slab is over > > > min slab limit. > > > > > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > > > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > > > shrink_node will do at least one of the two because otherwise node_reclaim > > > returns early. > > > > > > __node_reclaim can detect when enough slab has been reclaimed because > > > sc.reclaim_state.reclaimed_slab will tell us how many pages are > > > reclaimed in shrink slab. > > > > > > This issue is very easy to produce, first you continuously cat a random > > > non-exist file to produce more and more dentry, then you read big file > > > to produce page cache. And finally you will find that the denty will > > > never be shrunk in node reclaim (they can only be shrunk in kswapd until > > > the watermark is reached). > > > > > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > > > reclaim. Someone may prefer to enable it if their different workloads work > > > on different nodes. > > > > Considering that this is a long term behavior of a rarely used node > > reclaim I would rather not touch it unless some _real_ workload suffers > > from this behavior. Or is there any reason to fix this even though there > > is no evidence of real workloads suffering from the current behavior? > > I have only now noticed that you have added > Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") > > could you be more specific how that commit introduced a bug in the node > reclaim? It has introduced may_shrink_slab but the direct reclaim seems > to set it to 1. As you said, the direct reclaim path set it to 1, but the __node_reclaim() forgot to process may_shrink_slab. Thanks Yafang
On Tue 06-08-19 16:57:22, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 3:41 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 06-08-19 09:35:25, Michal Hocko wrote: > > > On Tue 06-08-19 03:19:00, Yafang Shao wrote: > > > > In the node reclaim, may_shrinkslab is 0 by default, > > > > hence shrink_slab will never be performed in it. > > > > While shrik_slab should be performed if the relcaimable slab is over > > > > min slab limit. > > > > > > > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > > > > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > > > > shrink_node will do at least one of the two because otherwise node_reclaim > > > > returns early. > > > > > > > > __node_reclaim can detect when enough slab has been reclaimed because > > > > sc.reclaim_state.reclaimed_slab will tell us how many pages are > > > > reclaimed in shrink slab. > > > > > > > > This issue is very easy to produce, first you continuously cat a random > > > > non-exist file to produce more and more dentry, then you read big file > > > > to produce page cache. And finally you will find that the denty will > > > > never be shrunk in node reclaim (they can only be shrunk in kswapd until > > > > the watermark is reached). > > > > > > > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > > > > reclaim. Someone may prefer to enable it if their different workloads work > > > > on different nodes. > > > > > > Considering that this is a long term behavior of a rarely used node > > > reclaim I would rather not touch it unless some _real_ workload suffers > > > from this behavior. Or is there any reason to fix this even though there > > > is no evidence of real workloads suffering from the current behavior? > > > > I have only now noticed that you have added > > Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") > > > > could you be more specific how that commit introduced a bug in the node > > reclaim? It has introduced may_shrink_slab but the direct reclaim seems > > to set it to 1. > > As you said, the direct reclaim path set it to 1, but the > __node_reclaim() forgot to process may_shrink_slab. OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply get back to the original behavior by setting may_shrink_slab in that path as well?
On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 16:57:22, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 3:41 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 06-08-19 09:35:25, Michal Hocko wrote: > > > > On Tue 06-08-19 03:19:00, Yafang Shao wrote: > > > > > In the node reclaim, may_shrinkslab is 0 by default, > > > > > hence shrink_slab will never be performed in it. > > > > > While shrik_slab should be performed if the relcaimable slab is over > > > > > min slab limit. > > > > > > > > > > Add scan_control::no_pagecache so shrink_node can decide to reclaim page > > > > > cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. > > > > > shrink_node will do at least one of the two because otherwise node_reclaim > > > > > returns early. > > > > > > > > > > __node_reclaim can detect when enough slab has been reclaimed because > > > > > sc.reclaim_state.reclaimed_slab will tell us how many pages are > > > > > reclaimed in shrink slab. > > > > > > > > > > This issue is very easy to produce, first you continuously cat a random > > > > > non-exist file to produce more and more dentry, then you read big file > > > > > to produce page cache. And finally you will find that the denty will > > > > > never be shrunk in node reclaim (they can only be shrunk in kswapd until > > > > > the watermark is reached). > > > > > > > > > > Regarding vm.zone_reclaim_mode, we always set it to zero to disable node > > > > > reclaim. Someone may prefer to enable it if their different workloads work > > > > > on different nodes. > > > > > > > > Considering that this is a long term behavior of a rarely used node > > > > reclaim I would rather not touch it unless some _real_ workload suffers > > > > from this behavior. Or is there any reason to fix this even though there > > > > is no evidence of real workloads suffering from the current behavior? > > > > > > I have only now noticed that you have added > > > Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") > > > > > > could you be more specific how that commit introduced a bug in the node > > > reclaim? It has introduced may_shrink_slab but the direct reclaim seems > > > to set it to 1. > > > > As you said, the direct reclaim path set it to 1, but the > > __node_reclaim() forgot to process may_shrink_slab. > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > get back to the original behavior by setting may_shrink_slab in that > path as well? You mean do it as the commit 0ff38490c836 did before ? I haven't check in which commit the shrink_slab() is removed from __node_reclaim(). But I think introduce a flag can make it more robust, otherwise we have to modify shrink_node() and there will be more changes. Thanks Yafang
On Tue 06-08-19 17:15:05, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > > As you said, the direct reclaim path set it to 1, but the > > > __node_reclaim() forgot to process may_shrink_slab. > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > get back to the original behavior by setting may_shrink_slab in that > > path as well? > > You mean do it as the commit 0ff38490c836 did before ? > I haven't check in which commit the shrink_slab() is removed from What I've had in mind was essentially this: diff --git a/mm/vmscan.c b/mm/vmscan.c index 7889f583ced9..8011288a80e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, .reclaim_idx = gfp_zone(gfp_mask), + .may_shrinkslab = 1; }; trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, shrink_node path already does shrink slab when the flag allows that. In other words get us back to before 1c30844d2dfe because that has clearly changed the long term node reclaim behavior just recently.
On Tue, Aug 6, 2019 at 5:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > [...] > > > > As you said, the direct reclaim path set it to 1, but the > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > get back to the original behavior by setting may_shrink_slab in that > > > path as well? > > > > You mean do it as the commit 0ff38490c836 did before ? > > I haven't check in which commit the shrink_slab() is removed from > > What I've had in mind was essentially this: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7889f583ced9..8011288a80e2 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > .reclaim_idx = gfp_zone(gfp_mask), > + .may_shrinkslab = 1; > }; > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > shrink_node path already does shrink slab when the flag allows that. In > other words get us back to before 1c30844d2dfe because that has clearly > changed the long term node reclaim behavior just recently. > -- If we do it like this, then vm.min_slab_ratio will not take effect if there're enough relcaimable page cache. Seems there're bugs in the original behavior as well. Thanks Yafang
On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > [...] > > > > As you said, the direct reclaim path set it to 1, but the > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > get back to the original behavior by setting may_shrink_slab in that > > > path as well? > > > > You mean do it as the commit 0ff38490c836 did before ? > > I haven't check in which commit the shrink_slab() is removed from > > What I've had in mind was essentially this: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 7889f583ced9..8011288a80e2 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > .may_swap = 1, > .reclaim_idx = gfp_zone(gfp_mask), > + .may_shrinkslab = 1; > }; > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > shrink_node path already does shrink slab when the flag allows that. In > other words get us back to before 1c30844d2dfe because that has clearly > changed the long term node reclaim behavior just recently. I'd be fine with this change. It was not intentional to significantly change the behaviour of node reclaim in that patch.
On Tue, Aug 6, 2019 at 5:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > get back to the original behavior by setting may_shrink_slab in that > > > > path as well? > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > I haven't check in which commit the shrink_slab() is removed from > > > > What I've had in mind was essentially this: > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 7889f583ced9..8011288a80e2 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > .may_swap = 1, > > .reclaim_idx = gfp_zone(gfp_mask), > > + .may_shrinkslab = 1; > > }; > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > shrink_node path already does shrink slab when the flag allows that. In > > other words get us back to before 1c30844d2dfe because that has clearly > > changed the long term node reclaim behavior just recently. > > I'd be fine with this change. It was not intentional to significantly > change the behaviour of node reclaim in that patch. > But if we do it like this, there will be bug in the knob vm.min_slab_ratio. Right ? Thanks Yafang
On Tue 06-08-19 17:54:02, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 5:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > [...] > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > path as well? > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > What I've had in mind was essentially this: > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 7889f583ced9..8011288a80e2 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > .may_swap = 1, > > > .reclaim_idx = gfp_zone(gfp_mask), > > > + .may_shrinkslab = 1; > > > }; > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > other words get us back to before 1c30844d2dfe because that has clearly > > > changed the long term node reclaim behavior just recently. > > > > I'd be fine with this change. It was not intentional to significantly > > change the behaviour of node reclaim in that patch. > > > > But if we do it like this, there will be bug in the knob vm.min_slab_ratio. > Right ? Yes, and the answer for that is a question why do we even care? Which real life workload does suffer from the of min_slab_ratio misbehavior. Also it is much more preferred to fix an obvious bug/omission which lack of may_shrinkslab in node reclaim seem to be than a larger rewrite with a harder to see changes. Really, I wouldn't be opposing normally but node_reclaim is an odd ball rarely used and changing its behavior based on some trivial testing doesn't sound very convincing to me.
On Tue, Aug 6, 2019 at 6:28 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 17:54:02, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 5:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > [...] > > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > > path as well? > > > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > > > What I've had in mind was essentially this: > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 7889f583ced9..8011288a80e2 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > > .may_swap = 1, > > > > .reclaim_idx = gfp_zone(gfp_mask), > > > > + .may_shrinkslab = 1; > > > > }; > > > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > > other words get us back to before 1c30844d2dfe because that has clearly > > > > changed the long term node reclaim behavior just recently. > > > > > > I'd be fine with this change. It was not intentional to significantly > > > change the behaviour of node reclaim in that patch. > > > > > > > But if we do it like this, there will be bug in the knob vm.min_slab_ratio. > > Right ? > > Yes, and the answer for that is a question why do we even care? Which > real life workload does suffer from the of min_slab_ratio misbehavior. > Also it is much more preferred to fix an obvious bug/omission which > lack of may_shrinkslab in node reclaim seem to be than a larger rewrite > with a harder to see changes. > Fixing the bug in min_slab_ratio doesn't require much change, as it just introduce a new bit in scan_control which doesn't require more space and a if-branch in shrink_node() which doesn't take much cpu cycles neither, and it will not take much maintaince neither as no_pagecache is 0 by default and then we don't need to worry about what if we forget it. > Really, I wouldn't be opposing normally but node_reclaim is an odd ball > rarely used and changing its behavior based on some trivial testing > doesn't sound very convincing to me. > Well. I'm not insist if Andrew prefer your change. Thanks Yafang
On Tue 06-08-19 18:59:52, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 6:28 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 06-08-19 17:54:02, Yafang Shao wrote: > > > On Tue, Aug 6, 2019 at 5:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > [...] > > > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > > > path as well? > > > > > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > > > > > What I've had in mind was essentially this: > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 7889f583ced9..8011288a80e2 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > > > .may_swap = 1, > > > > > .reclaim_idx = gfp_zone(gfp_mask), > > > > > + .may_shrinkslab = 1; > > > > > }; > > > > > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > > > other words get us back to before 1c30844d2dfe because that has clearly > > > > > changed the long term node reclaim behavior just recently. > > > > > > > > I'd be fine with this change. It was not intentional to significantly > > > > change the behaviour of node reclaim in that patch. > > > > > > > > > > But if we do it like this, there will be bug in the knob vm.min_slab_ratio. > > > Right ? > > > > Yes, and the answer for that is a question why do we even care? Which > > real life workload does suffer from the of min_slab_ratio misbehavior. > > Also it is much more preferred to fix an obvious bug/omission which > > lack of may_shrinkslab in node reclaim seem to be than a larger rewrite > > with a harder to see changes. > > > > Fixing the bug in min_slab_ratio doesn't require much change, as it > just introduce a new bit in scan_control which doesn't require more > space > and a if-branch in shrink_node() which doesn't take much cpu cycles > neither, and it will not take much maintaince neither as no_pagecache > is 0 by default and then we don't need to worry about what if we > forget it. You are still missing my point, I am afraid. I am not saying your change is wrong or complex. I am saying that there is an established behavior (even when wrong) that node-reclaim dependent loads might depend on. Your testing doesn't really suggest you have done much testing beyond the targeted one which is quite artificial to say the least. Maybe there are workloads which do depend on proper min_slab_ratio behavior but it would be much more preferable to hear from them rather than change the behavior based on the code inspection and a microbenchmark. Is my thinking more clear now?
On Tue, Aug 06, 2019 at 05:32:54PM +0800, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 5:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > [...] > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > get back to the original behavior by setting may_shrink_slab in that > > > > path as well? > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > I haven't check in which commit the shrink_slab() is removed from > > > > What I've had in mind was essentially this: > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 7889f583ced9..8011288a80e2 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > .may_swap = 1, > > .reclaim_idx = gfp_zone(gfp_mask), > > + .may_shrinkslab = 1; > > }; > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > shrink_node path already does shrink slab when the flag allows that. In > > other words get us back to before 1c30844d2dfe because that has clearly > > changed the long term node reclaim behavior just recently. > > -- > > If we do it like this, then vm.min_slab_ratio will not take effect if > there're enough relcaimable page cache. > Seems there're bugs in the original behavior as well. > Typically that would be done as a separate patch with a standalone justification for it. The first patch should simply restore expected behaviour with a Fixes: tag noting that the change in behaviour was unintentional.
On Tue, Aug 6, 2019 at 7:09 PM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 06-08-19 18:59:52, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 6:28 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 06-08-19 17:54:02, Yafang Shao wrote: > > > > On Tue, Aug 6, 2019 at 5:50 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > > > > > On Tue, Aug 06, 2019 at 11:25:31AM +0200, Michal Hocko wrote: > > > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > [...] > > > > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > > > > path as well? > > > > > > > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > > > > > > > What I've had in mind was essentially this: > > > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > > index 7889f583ced9..8011288a80e2 100644 > > > > > > --- a/mm/vmscan.c > > > > > > +++ b/mm/vmscan.c > > > > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > > > > .may_swap = 1, > > > > > > .reclaim_idx = gfp_zone(gfp_mask), > > > > > > + .may_shrinkslab = 1; > > > > > > }; > > > > > > > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > > > > other words get us back to before 1c30844d2dfe because that has clearly > > > > > > changed the long term node reclaim behavior just recently. > > > > > > > > > > I'd be fine with this change. It was not intentional to significantly > > > > > change the behaviour of node reclaim in that patch. > > > > > > > > > > > > > But if we do it like this, there will be bug in the knob vm.min_slab_ratio. > > > > Right ? > > > > > > Yes, and the answer for that is a question why do we even care? Which > > > real life workload does suffer from the of min_slab_ratio misbehavior. > > > Also it is much more preferred to fix an obvious bug/omission which > > > lack of may_shrinkslab in node reclaim seem to be than a larger rewrite > > > with a harder to see changes. > > > > > > > Fixing the bug in min_slab_ratio doesn't require much change, as it > > just introduce a new bit in scan_control which doesn't require more > > space > > and a if-branch in shrink_node() which doesn't take much cpu cycles > > neither, and it will not take much maintaince neither as no_pagecache > > is 0 by default and then we don't need to worry about what if we > > forget it. > > You are still missing my point, I am afraid. I am not saying your change > is wrong or complex. I am saying that there is an established behavior > (even when wrong) that node-reclaim dependent loads might depend on. > Your testing doesn't really suggest you have done much testing beyond > the targeted one which is quite artificial to say the least. > > Maybe there are workloads which do depend on proper min_slab_ratio > behavior but it would be much more preferable to hear from them rather > than change the behavior based on the code inspection and a > microbenchmark. > > Is my thinking more clear now? > Thanks for your clarification. I get your point now. Thanks Yafang
On Tue, Aug 6, 2019 at 7:15 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > On Tue, Aug 06, 2019 at 05:32:54PM +0800, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 5:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > [...] > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > path as well? > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > What I've had in mind was essentially this: > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > index 7889f583ced9..8011288a80e2 100644 > > > --- a/mm/vmscan.c > > > +++ b/mm/vmscan.c > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > .may_swap = 1, > > > .reclaim_idx = gfp_zone(gfp_mask), > > > + .may_shrinkslab = 1; > > > }; > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > other words get us back to before 1c30844d2dfe because that has clearly > > > changed the long term node reclaim behavior just recently. > > > -- > > > > If we do it like this, then vm.min_slab_ratio will not take effect if > > there're enough relcaimable page cache. > > Seems there're bugs in the original behavior as well. > > > > Typically that would be done as a separate patch with a standalone > justification for it. The first patch should simply restore expected > behaviour with a Fixes: tag noting that the change in behaviour was > unintentional. > Sure, I will do it. Thanks Yafang
On Tue 06-08-19 18:59:52, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 6:28 PM Michal Hocko <mhocko@kernel.org> wrote: [...] > > Really, I wouldn't be opposing normally but node_reclaim is an odd ball > > rarely used and changing its behavior based on some trivial testing > > doesn't sound very convincing to me. > > > > Well. I'm not insist if Andrew prefer your change. Btw. feel free to use the diff I've send for the real patch with the changelog. Thanks!
On Tue, Aug 06, 2019 at 04:23:29PM +0800, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 3:35 PM Michal Hocko <mhocko@kernel.org> wrote: > > Considering that this is a long term behavior of a rarely used node > > reclaim I would rather not touch it unless some _real_ workload suffers > > from this behavior. Or is there any reason to fix this even though there > > is no evidence of real workloads suffering from the current behavior? > > -- > > When we do performance tuning on some workloads(especially if this > workload is NUMA sensitive), sometimes we may enable it on our test > environment and then do some benchmark to dicide whether or not > applying it on the production envrioment. Although the result is not > good enough as expected, it is really a performance tuning knob. So am I understanding correctly that you sometimes enable node reclaim in production workloads when you find the numbers justify it? If so, which ones?
On Tue, Aug 06, 2019 at 07:35:29PM +0800, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 7:15 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > On Tue, Aug 06, 2019 at 05:32:54PM +0800, Yafang Shao wrote: > > > On Tue, Aug 6, 2019 at 5:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > [...] > > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > > path as well? > > > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > > > What I've had in mind was essentially this: > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > index 7889f583ced9..8011288a80e2 100644 > > > > --- a/mm/vmscan.c > > > > +++ b/mm/vmscan.c > > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > > .may_swap = 1, > > > > .reclaim_idx = gfp_zone(gfp_mask), > > > > + .may_shrinkslab = 1; > > > > }; > > > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > > other words get us back to before 1c30844d2dfe because that has clearly > > > > changed the long term node reclaim behavior just recently. > > > > -- > > > > > > If we do it like this, then vm.min_slab_ratio will not take effect if > > > there're enough relcaimable page cache. > > > Seems there're bugs in the original behavior as well. > > > > > > > Typically that would be done as a separate patch with a standalone > > justification for it. The first patch should simply restore expected > > behaviour with a Fixes: tag noting that the change in behaviour was > > unintentional. > > > > Sure, I will do it. Do you plan to send the second patch? If not I think we should at least update the documentation for the admittedly obscure vm.min_slab_ratio to reflect its effect on node reclaim, which is currently none.
On Tue, Aug 6, 2019 at 11:29 PM Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > On Tue, Aug 06, 2019 at 04:23:29PM +0800, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 3:35 PM Michal Hocko <mhocko@kernel.org> wrote: > > > Considering that this is a long term behavior of a rarely used node > > > reclaim I would rather not touch it unless some _real_ workload suffers > > > from this behavior. Or is there any reason to fix this even though there > > > is no evidence of real workloads suffering from the current behavior? > > > -- > > > > When we do performance tuning on some workloads(especially if this > > workload is NUMA sensitive), sometimes we may enable it on our test > > environment and then do some benchmark to dicide whether or not > > applying it on the production envrioment. Although the result is not > > good enough as expected, it is really a performance tuning knob. > > So am I understanding correctly that you sometimes enable node reclaim in > production workloads when you find the numbers justify it? If so, which ones? We used to enable it on production environment, but it caused some latency spike(because of memory pressure), so we have to disable it now. BTW, do you plan to enable it for your workload ? Thanks Yafang
On Tue, Aug 6, 2019 at 11:59 PM Daniel Jordan <daniel.m.jordan@oracle.com> wrote: > > On Tue, Aug 06, 2019 at 07:35:29PM +0800, Yafang Shao wrote: > > On Tue, Aug 6, 2019 at 7:15 PM Mel Gorman <mgorman@techsingularity.net> wrote: > > > > > > On Tue, Aug 06, 2019 at 05:32:54PM +0800, Yafang Shao wrote: > > > > On Tue, Aug 6, 2019 at 5:25 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Tue 06-08-19 17:15:05, Yafang Shao wrote: > > > > > > On Tue, Aug 6, 2019 at 5:05 PM Michal Hocko <mhocko@kernel.org> wrote: > > > > > [...] > > > > > > > > As you said, the direct reclaim path set it to 1, but the > > > > > > > > __node_reclaim() forgot to process may_shrink_slab. > > > > > > > > > > > > > > OK, I am blind obviously. Sorry about that. Anyway, why cannot we simply > > > > > > > get back to the original behavior by setting may_shrink_slab in that > > > > > > > path as well? > > > > > > > > > > > > You mean do it as the commit 0ff38490c836 did before ? > > > > > > I haven't check in which commit the shrink_slab() is removed from > > > > > > > > > > What I've had in mind was essentially this: > > > > > > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > > > > index 7889f583ced9..8011288a80e2 100644 > > > > > --- a/mm/vmscan.c > > > > > +++ b/mm/vmscan.c > > > > > @@ -4088,6 +4093,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in > > > > > .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), > > > > > .may_swap = 1, > > > > > .reclaim_idx = gfp_zone(gfp_mask), > > > > > + .may_shrinkslab = 1; > > > > > }; > > > > > > > > > > trace_mm_vmscan_node_reclaim_begin(pgdat->node_id, order, > > > > > > > > > > shrink_node path already does shrink slab when the flag allows that. In > > > > > other words get us back to before 1c30844d2dfe because that has clearly > > > > > changed the long term node reclaim behavior just recently. > > > > > -- > > > > > > > > If we do it like this, then vm.min_slab_ratio will not take effect if > > > > there're enough relcaimable page cache. > > > > Seems there're bugs in the original behavior as well. > > > > > > > > > > Typically that would be done as a separate patch with a standalone > > > justification for it. The first patch should simply restore expected > > > behaviour with a Fixes: tag noting that the change in behaviour was > > > unintentional. > > > > > > > Sure, I will do it. > > Do you plan to send the second patch? If not I think we should at least update > the documentation for the admittedly obscure vm.min_slab_ratio to reflect its > effect on node reclaim, which is currently none. I don't have a explicit plan when to post the second patch because I'm not sure when it will be ready. If your workload depends on vm.min_slab_ratio, you could post a fix for it if you would like to. I will appreciate it. I don't think it is a good idea to document it, because this is not a limitation, while it is really a issue. Thanks Yafang
On 8/6/19 9:00 PM, Yafang Shao wrote: > We used to enable it on production environment, but it caused some > latency spike(because of memory pressure), > so we have to disable it now. > BTW, do you plan to enable it for your workload ? No, I haven't experimented with it.
On 8/6/19 9:03 PM, Yafang Shao wrote: > On Tue, Aug 6, 2019 at 11:59 PM Daniel Jordan >> Do you plan to send the second patch? If not I think we should at least update >> the documentation for the admittedly obscure vm.min_slab_ratio to reflect its >> effect on node reclaim, which is currently none. > > I don't have a explicit plan when to post the second patch because I'm > not sure when it will be ready. > If your workload depends on vm.min_slab_ratio, you could post a fix > for it if you would like to. I will appreciate it. We have no workloads that depend on this, so I'll leave it to you to post. > I don't think it is a good idea to document it, because this is not a > limitation, while it is really a issue. Sure, if it's going to be fixed, no point in doing this.
diff --git a/mm/vmscan.c b/mm/vmscan.c index 47aa215..7e2a8ac 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -91,6 +91,9 @@ struct scan_control { /* e.g. boosted watermark reclaim leaves slabs alone */ unsigned int may_shrinkslab:1; + /* In node reclaim mode, we may shrink slab only */ + unsigned int no_pagecache:1; + /* * Cgroups are not reclaimed below their configured memory.low, * unless we threaten to OOM. If any cgroups are skipped due to @@ -2831,7 +2834,9 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) reclaimed = sc->nr_reclaimed; scanned = sc->nr_scanned; - shrink_node_memcg(pgdat, memcg, sc); + + if (!sc->no_pagecache) + shrink_node_memcg(pgdat, memcg, sc); if (sc->may_shrinkslab) { shrink_slab(sc->gfp_mask, pgdat->node_id, @@ -4268,6 +4273,10 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE), .may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP), .may_swap = 1, + .may_shrinkslab = (node_page_state(pgdat, NR_SLAB_RECLAIMABLE) > + pgdat->min_slab_pages), + .no_pagecache = (node_pagecache_reclaimable(pgdat) <= + pgdat->min_unmapped_pages), .reclaim_idx = gfp_zone(gfp_mask), }; @@ -4285,15 +4294,13 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in p->flags |= PF_SWAPWRITE; set_task_reclaim_state(p, &sc.reclaim_state); - if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages) { - /* - * Free memory by calling shrink node with increasing - * priorities until we have enough memory freed. - */ - do { - shrink_node(pgdat, &sc); - } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); - } + /* + * Free memory by calling shrink node with increasing + * priorities until we have enough memory freed. + */ + do { + shrink_node(pgdat, &sc); + } while (sc.nr_reclaimed < nr_pages && --sc.priority >= 0); set_task_reclaim_state(p, NULL); current->flags &= ~PF_SWAPWRITE;
In the node reclaim, may_shrinkslab is 0 by default, hence shrink_slab will never be performed in it. While shrik_slab should be performed if the relcaimable slab is over min slab limit. Add scan_control::no_pagecache so shrink_node can decide to reclaim page cache, slab, or both as dictated by min_unmapped_pages and min_slab_pages. shrink_node will do at least one of the two because otherwise node_reclaim returns early. __node_reclaim can detect when enough slab has been reclaimed because sc.reclaim_state.reclaimed_slab will tell us how many pages are reclaimed in shrink slab. This issue is very easy to produce, first you continuously cat a random non-exist file to produce more and more dentry, then you read big file to produce page cache. And finally you will find that the denty will never be shrunk in node reclaim (they can only be shrunk in kswapd until the watermark is reached). Regarding vm.zone_reclaim_mode, we always set it to zero to disable node reclaim. Someone may prefer to enable it if their different workloads work on different nodes. [Daniel improved the changelog] Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs") Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Christoph Lameter <cl@linux.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yafang Shao <shaoyafang@didiglobal.com> --- mm/vmscan.c | 27 +++++++++++++++++---------- 1 file changed, 17 insertions(+), 10 deletions(-)