Message ID | 20221202223533.1785418-1-almasrymina@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v3] mm: Add nodes= arg to memory.reclaim | expand |
On Fri, Dec 2, 2022 at 2:37 PM Mina Almasry <almasrymina@google.com> wrote: > > The nodes= arg instructs the kernel to only scan the given nodes for > proactive reclaim. For example use cases, consider a 2 tier memory system: > > nodes 0,1 -> top tier > nodes 2,3 -> second tier > > $ echo "1m nodes=0" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory from node 0. > Since node 0 is a top tier node, demotion will be attempted first. This > is useful to direct proactive reclaim to specific nodes that are under > pressure. > > $ echo "1m nodes=2,3" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > since this tier of memory has no demotion targets the memory will be > reclaimed. > > $ echo "1m nodes=0,1" > memory.reclaim > > Instructs the kernel to reclaim memory from the top tier nodes, which can > be desirable according to the userspace policy if there is pressure on > the top tiers. Since these nodes have demotion targets, the kernel will > attempt demotion first. > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim""), the proactive reclaim interface memory.reclaim does both > reclaim and demotion. Reclaim and demotion incur different latency costs > to the jobs in the cgroup. Demoted memory would still be addressable > by the userspace at a higher latency, but reclaimed memory would need to > incur a pagefault. > > The 'nodes' arg is useful to allow the userspace to control demotion > and reclaim independently according to its policy: if the memory.reclaim > is called on a node with demotion targets, it will attempt demotion first; > if it is called on a node without demotion targets, it will only attempt > reclaim. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Mina Almasry <almasrymina@google.com> > Acked-by: Shakeel Butt <shakeelb@google.com>
> On Dec 3, 2022, at 06:35, Mina Almasry <almasrymina@google.com> wrote: > > The nodes= arg instructs the kernel to only scan the given nodes for > proactive reclaim. For example use cases, consider a 2 tier memory system: > > nodes 0,1 -> top tier > nodes 2,3 -> second tier > > $ echo "1m nodes=0" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory from node 0. > Since node 0 is a top tier node, demotion will be attempted first. This > is useful to direct proactive reclaim to specific nodes that are under > pressure. > > $ echo "1m nodes=2,3" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > since this tier of memory has no demotion targets the memory will be > reclaimed. > > $ echo "1m nodes=0,1" > memory.reclaim > > Instructs the kernel to reclaim memory from the top tier nodes, which can > be desirable according to the userspace policy if there is pressure on > the top tiers. Since these nodes have demotion targets, the kernel will > attempt demotion first. > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim""), the proactive reclaim interface memory.reclaim does both > reclaim and demotion. Reclaim and demotion incur different latency costs > to the jobs in the cgroup. Demoted memory would still be addressable > by the userspace at a higher latency, but reclaimed memory would need to > incur a pagefault. > > The 'nodes' arg is useful to allow the userspace to control demotion > and reclaim independently according to its policy: if the memory.reclaim > is called on a node with demotion targets, it will attempt demotion first; > if it is called on a node without demotion targets, it will only attempt > reclaim. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Thanks.
On Fri 02-12-22 14:35:31, Mina Almasry wrote: > The nodes= arg instructs the kernel to only scan the given nodes for > proactive reclaim. For example use cases, consider a 2 tier memory system: > > nodes 0,1 -> top tier > nodes 2,3 -> second tier > > $ echo "1m nodes=0" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory from node 0. > Since node 0 is a top tier node, demotion will be attempted first. This > is useful to direct proactive reclaim to specific nodes that are under > pressure. > > $ echo "1m nodes=2,3" > memory.reclaim > > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > since this tier of memory has no demotion targets the memory will be > reclaimed. > > $ echo "1m nodes=0,1" > memory.reclaim > > Instructs the kernel to reclaim memory from the top tier nodes, which can > be desirable according to the userspace policy if there is pressure on > the top tiers. Since these nodes have demotion targets, the kernel will > attempt demotion first. > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim""), the proactive reclaim interface memory.reclaim does both > reclaim and demotion. Reclaim and demotion incur different latency costs > to the jobs in the cgroup. Demoted memory would still be addressable > by the userspace at a higher latency, but reclaimed memory would need to > incur a pagefault. > > The 'nodes' arg is useful to allow the userspace to control demotion > and reclaim independently according to its policy: if the memory.reclaim > is called on a node with demotion targets, it will attempt demotion first; > if it is called on a node without demotion targets, it will only attempt > reclaim. > > Acked-by: Michal Hocko <mhocko@suse.com> > Signed-off-by: Mina Almasry <almasrymina@google.com> After discussion in [1] I have realized that I haven't really thought through all the consequences of this patch and therefore I am retracting my ack here. I am not nacking the patch at this statge but I also think this shouldn't be merged now and we should really consider all the consequences. Let me summarize my main concerns here as well. The proposed implementation doesn't apply the provided nodemask to the whole reclaim process. This means that demotion can happen outside of the mask so the the user request cannot really control demotion targets and that limits the interface should there be any need for a finer grained control in the future (see an example in [2]). Another problem is that this can limit future reclaim extensions because of existing assumptions of the interface [3] - specify only top-tier node to force the aging without actually reclaiming any charges and (ab)use the interface only for aging on multi-tier system. A change to the reclaim to not demote in some cases could break this usecase. My counter proposal would be to define the nodemask for memory.reclaim as a domain to constrain the charge reclaim. That means both aging and reclaim including demotion which is a part of aging. This will allow to control where to demote for balancing purposes (e.g. demote to node 2 rather than 3) which is impossible with the proposed scheme. [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD@dhcp22.suse.cz [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg@mail.gmail.com
On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote: > > On Fri 02-12-22 14:35:31, Mina Almasry wrote: > > The nodes= arg instructs the kernel to only scan the given nodes for > > proactive reclaim. For example use cases, consider a 2 tier memory system: > > > > nodes 0,1 -> top tier > > nodes 2,3 -> second tier > > > > $ echo "1m nodes=0" > memory.reclaim > > > > This instructs the kernel to attempt to reclaim 1m memory from node 0. > > Since node 0 is a top tier node, demotion will be attempted first. This > > is useful to direct proactive reclaim to specific nodes that are under > > pressure. > > > > $ echo "1m nodes=2,3" > memory.reclaim > > > > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > > since this tier of memory has no demotion targets the memory will be > > reclaimed. > > > > $ echo "1m nodes=0,1" > memory.reclaim > > > > Instructs the kernel to reclaim memory from the top tier nodes, which can > > be desirable according to the userspace policy if there is pressure on > > the top tiers. Since these nodes have demotion targets, the kernel will > > attempt demotion first. > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim""), the proactive reclaim interface memory.reclaim does both > > reclaim and demotion. Reclaim and demotion incur different latency costs > > to the jobs in the cgroup. Demoted memory would still be addressable > > by the userspace at a higher latency, but reclaimed memory would need to > > incur a pagefault. > > > > The 'nodes' arg is useful to allow the userspace to control demotion > > and reclaim independently according to its policy: if the memory.reclaim > > is called on a node with demotion targets, it will attempt demotion first; > > if it is called on a node without demotion targets, it will only attempt > > reclaim. > > > > Acked-by: Michal Hocko <mhocko@suse.com> > > Signed-off-by: Mina Almasry <almasrymina@google.com> > > After discussion in [1] I have realized that I haven't really thought > through all the consequences of this patch and therefore I am retracting > my ack here. I am not nacking the patch at this statge but I also think > this shouldn't be merged now and we should really consider all the > consequences. > > Let me summarize my main concerns here as well. The proposed > implementation doesn't apply the provided nodemask to the whole reclaim > process. This means that demotion can happen outside of the mask so the > the user request cannot really control demotion targets and that limits > the interface should there be any need for a finer grained control in > the future (see an example in [2]). > Another problem is that this can limit future reclaim extensions because > of existing assumptions of the interface [3] - specify only top-tier > node to force the aging without actually reclaiming any charges and > (ab)use the interface only for aging on multi-tier system. A change to > the reclaim to not demote in some cases could break this usecase. > I think this is correct. My use case is to request from the kernel to do demotion without reclaim in the cgroup, and the reason for that is stated in the commit message: "Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault." For jobs of some latency tiers, we would like to trigger proactive demotion (which incurs relatively low latency on the job), but not trigger proactive reclaim (which incurs a pagefault). I initially had proposed a separate interface for this, but Johannes directed me to this interface instead in [1]. In the same email Johannes also tells me that meta's reclaim stack relies on memory.reclaim triggering demotion, so it seems that I'm not the first to take a dependency on this. Additionally in [2] Johannes also says it would be great if in the long term reclaim policy and demotion policy do not diverge. [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > My counter proposal would be to define the nodemask for memory.reclaim > as a domain to constrain the charge reclaim. That means both aging and > reclaim including demotion which is a part of aging. This will allow > to control where to demote for balancing purposes (e.g. demote to node 2 > rather than 3) which is impossible with the proposed scheme. > My understanding is that with this interface in order to trigger demotion I would want to list both the top tier nodes and the bottom tier nodes on the nodemask, and since the bottom tier nodes are in the nodemask the kernel will not just trigger demotion, but will also trigger reclaim. This is very specifically not our use case and not the goal of this patch. I had also suggested adding a demotion= arg to memory.reclaim so the userspace may customize this behavior, but Johannes rejected this in [3] to adhere to the aging pipeline. All in all I like Johannes's model in [3] describing the aging pipeline and the relationship between demotion and reclaim. The nodes= arg is just a hint to the kernel that the userspace is looking for reclaim from a top tier node (which would be done by demotion according to the aging pipeline) or a bottom tier node (which would be done by reclaim according to the aging pipeline). I think this interface is aligned with this model. [3] https://lore.kernel.org/linux-mm/Y36XchdgTCsMP4jT@cmpxchg.org/ > [1] http://lkml.kernel.org/r/20221206023406.3182800-1-almasrymina@google.com > [2] http://lkml.kernel.org/r/Y5bnRtJ6sojtjgVD@dhcp22.suse.cz > [3] http://lkml.kernel.org/r/CAAPL-u8rgW-JACKUT5ChmGSJiTDABcDRjNzW_QxMjCTk9zO4sg@mail.gmail.com > -- > Michal Hocko > SUSE Labs
Mina Almasry <almasrymina@google.com> writes: > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote: >> >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: >> > The nodes= arg instructs the kernel to only scan the given nodes for >> > proactive reclaim. For example use cases, consider a 2 tier memory system: >> > >> > nodes 0,1 -> top tier >> > nodes 2,3 -> second tier >> > >> > $ echo "1m nodes=0" > memory.reclaim >> > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. >> > Since node 0 is a top tier node, demotion will be attempted first. This >> > is useful to direct proactive reclaim to specific nodes that are under >> > pressure. >> > >> > $ echo "1m nodes=2,3" > memory.reclaim >> > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, >> > since this tier of memory has no demotion targets the memory will be >> > reclaimed. >> > >> > $ echo "1m nodes=0,1" > memory.reclaim >> > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can >> > be desirable according to the userspace policy if there is pressure on >> > the top tiers. Since these nodes have demotion targets, the kernel will >> > attempt demotion first. >> > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg >> > reclaim""), the proactive reclaim interface memory.reclaim does both >> > reclaim and demotion. Reclaim and demotion incur different latency costs >> > to the jobs in the cgroup. Demoted memory would still be addressable >> > by the userspace at a higher latency, but reclaimed memory would need to >> > incur a pagefault. >> > >> > The 'nodes' arg is useful to allow the userspace to control demotion >> > and reclaim independently according to its policy: if the memory.reclaim >> > is called on a node with demotion targets, it will attempt demotion first; >> > if it is called on a node without demotion targets, it will only attempt >> > reclaim. >> > >> > Acked-by: Michal Hocko <mhocko@suse.com> >> > Signed-off-by: Mina Almasry <almasrymina@google.com> >> >> After discussion in [1] I have realized that I haven't really thought >> through all the consequences of this patch and therefore I am retracting >> my ack here. I am not nacking the patch at this statge but I also think >> this shouldn't be merged now and we should really consider all the >> consequences. >> >> Let me summarize my main concerns here as well. The proposed >> implementation doesn't apply the provided nodemask to the whole reclaim >> process. This means that demotion can happen outside of the mask so the >> the user request cannot really control demotion targets and that limits >> the interface should there be any need for a finer grained control in >> the future (see an example in [2]). >> Another problem is that this can limit future reclaim extensions because >> of existing assumptions of the interface [3] - specify only top-tier >> node to force the aging without actually reclaiming any charges and >> (ab)use the interface only for aging on multi-tier system. A change to >> the reclaim to not demote in some cases could break this usecase. >> > > I think this is correct. My use case is to request from the kernel to > do demotion without reclaim in the cgroup, and the reason for that is > stated in the commit message: > > "Reclaim and demotion incur different latency costs to the jobs in the > cgroup. Demoted memory would still be addressable by the userspace at > a higher latency, but reclaimed memory would need to incur a > pagefault." > > For jobs of some latency tiers, we would like to trigger proactive > demotion (which incurs relatively low latency on the job), but not > trigger proactive reclaim (which incurs a pagefault). I initially had > proposed a separate interface for this, but Johannes directed me to > this interface instead in [1]. In the same email Johannes also tells > me that meta's reclaim stack relies on memory.reclaim triggering > demotion, so it seems that I'm not the first to take a dependency on > this. Additionally in [2] Johannes also says it would be great if in > the long term reclaim policy and demotion policy do not diverge. > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ After these discussion, I think the solution maybe use different interfaces for "proactive demote" and "proactive reclaim". That is, reconsider "memory.demote". In this way, we will always uncharge the cgroup for "memory.reclaim". This avoid the possible confusion there. And, because demotion is considered aging, we don't need to disable demotion for "memory.reclaim", just don't count it. Best Regards, Huang, Ying
On Mon, Dec 12, 2022 at 10:32 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko <mhocko@suse.com> > >> > Signed-off-by: Mina Almasry <almasrymina@google.com> > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. +1 on memory.demote. > Best Regards, > Huang, Ying >
On Mon 12-12-22 16:54:27, Mina Almasry wrote: > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote: [...] > > Let me summarize my main concerns here as well. The proposed > > implementation doesn't apply the provided nodemask to the whole reclaim > > process. This means that demotion can happen outside of the mask so the > > the user request cannot really control demotion targets and that limits > > the interface should there be any need for a finer grained control in > > the future (see an example in [2]). > > Another problem is that this can limit future reclaim extensions because > > of existing assumptions of the interface [3] - specify only top-tier > > node to force the aging without actually reclaiming any charges and > > (ab)use the interface only for aging on multi-tier system. A change to > > the reclaim to not demote in some cases could break this usecase. > > > > I think this is correct. My use case is to request from the kernel to > do demotion without reclaim in the cgroup, and the reason for that is > stated in the commit message: > > "Reclaim and demotion incur different latency costs to the jobs in the > cgroup. Demoted memory would still be addressable by the userspace at > a higher latency, but reclaimed memory would need to incur a > pagefault." > > For jobs of some latency tiers, we would like to trigger proactive > demotion (which incurs relatively low latency on the job), but not > trigger proactive reclaim (which incurs a pagefault). I initially had > proposed a separate interface for this, but Johannes directed me to > this interface instead in [1]. In the same email Johannes also tells > me that meta's reclaim stack relies on memory.reclaim triggering > demotion, so it seems that I'm not the first to take a dependency on > this. Additionally in [2] Johannes also says it would be great if in > the long term reclaim policy and demotion policy do not diverge. I do recognize your need to control the demotion but I argue that it is a bad idea to rely on an implicit behavior of the memory reclaim and an interface which is _documented_ to primarily _reclaim_ memory. Really, consider that the current demotion implementation will change in the future and based on a newly added heuristic memory reclaim or compression would be preferred over migration to a different tier. This might completely break your current assumptions and break your usecase which relies on an implicit demotion behavior. Do you see that as a potential problem at all? What shall we do in that case? Special case memory.reclaim behavior? Now to your specific usecase. If there is a need to do a memory distribution balancing then fine but this should be a well defined interface. E.g. is there a need to not only control demotion but promotions as well? I haven't heard anybody requesting that so far but I can easily imagine that like outsourcing the memory reclaim to the userspace someone might want to do the same thing with the numa balancing because $REASONS. Should that ever happen, I am pretty sure hooking into memory.reclaim is not really a great idea. See where I am coming from? > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/
On Tue 13-12-22 14:30:57, Huang, Ying wrote: > Mina Almasry <almasrymina@google.com> writes: [...] > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. As already pointed out in my previous email, we should really think more about future requirements. Do we add memory.promote interface when there is a request to implement numa balancing into the userspace? Maybe yes but maybe the node balancing should be more generic than bound to memory tiering and apply to a more fine grained nodemask control. Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE), reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize (swap in, or read ahead) which are per mm/file. Their primary usability issue is that they are process centric and that requires a very deep understanding of the process mm layout so it is not really usable for a larger scale orchestration. The important part of those interfaces is that they do not talk about demotion because that is an implementation detail. I think we want to follow that model at least. From a higher level POV I believe we really need an interface to age&reclaim and balance memory among nodes. Are there more higher level usecases?
On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > Mina Almasry <almasrymina@google.com> writes: > > > On Mon, Dec 12, 2022 at 12:55 AM Michal Hocko <mhocko@suse.com> wrote: > >> > >> On Fri 02-12-22 14:35:31, Mina Almasry wrote: > >> > The nodes= arg instructs the kernel to only scan the given nodes for > >> > proactive reclaim. For example use cases, consider a 2 tier memory system: > >> > > >> > nodes 0,1 -> top tier > >> > nodes 2,3 -> second tier > >> > > >> > $ echo "1m nodes=0" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory from node 0. > >> > Since node 0 is a top tier node, demotion will be attempted first. This > >> > is useful to direct proactive reclaim to specific nodes that are under > >> > pressure. > >> > > >> > $ echo "1m nodes=2,3" > memory.reclaim > >> > > >> > This instructs the kernel to attempt to reclaim 1m memory in the second tier, > >> > since this tier of memory has no demotion targets the memory will be > >> > reclaimed. > >> > > >> > $ echo "1m nodes=0,1" > memory.reclaim > >> > > >> > Instructs the kernel to reclaim memory from the top tier nodes, which can > >> > be desirable according to the userspace policy if there is pressure on > >> > the top tiers. Since these nodes have demotion targets, the kernel will > >> > attempt demotion first. > >> > > >> > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > >> > reclaim""), the proactive reclaim interface memory.reclaim does both > >> > reclaim and demotion. Reclaim and demotion incur different latency costs > >> > to the jobs in the cgroup. Demoted memory would still be addressable > >> > by the userspace at a higher latency, but reclaimed memory would need to > >> > incur a pagefault. > >> > > >> > The 'nodes' arg is useful to allow the userspace to control demotion > >> > and reclaim independently according to its policy: if the memory.reclaim > >> > is called on a node with demotion targets, it will attempt demotion first; > >> > if it is called on a node without demotion targets, it will only attempt > >> > reclaim. > >> > > >> > Acked-by: Michal Hocko <mhocko@suse.com> > >> > Signed-off-by: Mina Almasry <almasrymina@google.com> > >> > >> After discussion in [1] I have realized that I haven't really thought > >> through all the consequences of this patch and therefore I am retracting > >> my ack here. I am not nacking the patch at this statge but I also think > >> this shouldn't be merged now and we should really consider all the > >> consequences. > >> > >> Let me summarize my main concerns here as well. The proposed > >> implementation doesn't apply the provided nodemask to the whole reclaim > >> process. This means that demotion can happen outside of the mask so the > >> the user request cannot really control demotion targets and that limits > >> the interface should there be any need for a finer grained control in > >> the future (see an example in [2]). > >> Another problem is that this can limit future reclaim extensions because > >> of existing assumptions of the interface [3] - specify only top-tier > >> node to force the aging without actually reclaiming any charges and > >> (ab)use the interface only for aging on multi-tier system. A change to > >> the reclaim to not demote in some cases could break this usecase. > >> > > > > I think this is correct. My use case is to request from the kernel to > > do demotion without reclaim in the cgroup, and the reason for that is > > stated in the commit message: > > > > "Reclaim and demotion incur different latency costs to the jobs in the > > cgroup. Demoted memory would still be addressable by the userspace at > > a higher latency, but reclaimed memory would need to incur a > > pagefault." > > > > For jobs of some latency tiers, we would like to trigger proactive > > demotion (which incurs relatively low latency on the job), but not > > trigger proactive reclaim (which incurs a pagefault). I initially had > > proposed a separate interface for this, but Johannes directed me to > > this interface instead in [1]. In the same email Johannes also tells > > me that meta's reclaim stack relies on memory.reclaim triggering > > demotion, so it seems that I'm not the first to take a dependency on > > this. Additionally in [2] Johannes also says it would be great if in > > the long term reclaim policy and demotion policy do not diverge. > > > > [1] https://lore.kernel.org/linux-mm/Y35fw2JSAeAddONg@cmpxchg.org/ > > [2] https://lore.kernel.org/linux-mm/Y36fIGFCFKiocAd6@cmpxchg.org/ > > After these discussion, I think the solution maybe use different > interfaces for "proactive demote" and "proactive reclaim". That is, > reconsider "memory.demote". In this way, we will always uncharge the > cgroup for "memory.reclaim". This avoid the possible confusion there. > And, because demotion is considered aging, we don't need to disable > demotion for "memory.reclaim", just don't count it. Hm, so in summary: 1) memory.reclaim would demote and reclaim like today, but it would change to only count reclaimed pages against the goal. 2) memory.demote would only demote. a) What if the demotion targets are full? Would it reclaim or fail? 3) Would memory.reclaim and memory.demote still need nodemasks? Would they return -EINVAL if a) memory.reclaim gets passed only toptier nodes or b) memory.demote gets passed any lasttier nodes?
Michal Hocko <mhocko@suse.com> writes: > On Tue 13-12-22 14:30:57, Huang, Ying wrote: >> Mina Almasry <almasrymina@google.com> writes: > [...] >> After these discussion, I think the solution maybe use different >> interfaces for "proactive demote" and "proactive reclaim". That is, >> reconsider "memory.demote". In this way, we will always uncharge the >> cgroup for "memory.reclaim". This avoid the possible confusion there. >> And, because demotion is considered aging, we don't need to disable >> demotion for "memory.reclaim", just don't count it. > > As already pointed out in my previous email, we should really think more > about future requirements. Do we add memory.promote interface when there > is a request to implement numa balancing into the userspace? Maybe yes > but maybe the node balancing should be more generic than bound to memory > tiering and apply to a more fine grained nodemask control. > > Fundamentally we already have APIs to age (MADV_COLD, MADV_FREE), > reclaim (MADV_PAGEOUT, MADV_DONTNEED) and MADV_WILLNEED to prioritize > (swap in, or read ahead) which are per mm/file. Their primary usability > issue is that they are process centric and that requires a very deep > understanding of the process mm layout so it is not really usable for a > larger scale orchestration. > The important part of those interfaces is that they do not talk about > demotion because that is an implementation detail. I think we want to > follow that model at least. From a higher level POV I believe we really > need an interface to age&reclaim and balance memory among nodes. Are > there more higher level usecases? Yes. If the high level interface can satisfy the requirements, we should use them or define them. But I guess Mina and Xu has some requirements at the level of memory tiers (demotion/promotion)? Best Regards, Huang, Ying
On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: [...] > > After these discussion, I think the solution maybe use different > > interfaces for "proactive demote" and "proactive reclaim". That is, > > reconsider "memory.demote". In this way, we will always uncharge the > > cgroup for "memory.reclaim". This avoid the possible confusion there. > > And, because demotion is considered aging, we don't need to disable > > demotion for "memory.reclaim", just don't count it. > > Hm, so in summary: > > 1) memory.reclaim would demote and reclaim like today, but it would > change to only count reclaimed pages against the goal. > > 2) memory.demote would only demote. > > a) What if the demotion targets are full? Would it reclaim or fail? > > 3) Would memory.reclaim and memory.demote still need nodemasks? Would > they return -EINVAL if a) memory.reclaim gets passed only toptier > nodes or b) memory.demote gets passed any lasttier nodes? I would also add 4) Do we want to allow to control the demotion path (e.g. which node to demote from and to) and how to achieve that? 5) Is the demotion api restricted to multi-tier systems or any numa configuration allowed as well?
On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: > I do recognize your need to control the demotion but I argue that it is > a bad idea to rely on an implicit behavior of the memory reclaim and an > interface which is _documented_ to primarily _reclaim_ memory. I think memory.reclaim should demote as part of page aging. What I'd like to avoid is *having* to manually control the aging component in the interface (e.g. making memory.reclaim *only* reclaim, and *requiring* a coordinated use of memory.demote to ensure progress.) > Really, consider that the current demotion implementation will change > in the future and based on a newly added heuristic memory reclaim or > compression would be preferred over migration to a different tier. This > might completely break your current assumptions and break your usecase > which relies on an implicit demotion behavior. Do you see that as a > potential problem at all? What shall we do in that case? Special case > memory.reclaim behavior? Shouldn't that be derived from the distance propertiers in the tier configuration? I.e. if local compression is faster than demoting to a slower node, we should maybe have a separate tier for that. Ignoring proactive reclaim or demotion commands for a second: on that node, global memory pressure should always compress first, while the oldest pages from the compression cache should demote to the other node(s) - until they eventually get swapped out. However fine-grained we make proactive reclaim control over these stages, it should at least be possible for the user to request the default behavior that global pressure follows, without jumping through hoops or requiring the coordinated use of multiple knobs. So IMO there is an argument for having a singular knob that requests comprehensive aging and reclaiming across the configured hierarchy. As far as explicit control over the individual stages goes - no idea if you would call the compression stage demotion or reclaim. The distinction still does not make much of sense to me, since reclaim is just another form of demotion. Sure, page faults have a different access latency than dax to slower memory. But you could also have 3 tiers of memory where the difference between tier 1 and 2 is much smaller than the difference between 2 and 3, and you might want to apply different demotion rates between them as well. The other argument is that demotion does not free cgroup memory, whereas reclaim does. But with multiple memory tiers of vastly different performance, isn't there also an argument for granting cgroups different shares of each memory? So that a higher priority group has access to a bigger share of the fastest memory, and lower prio cgroups are relegated to lower tiers. If we split those pools, then "demotion" will actually free memory in a cgroup. This is why I liked adding a nodes= argument to memory.reclaim the best. It doesn't encode a distinction that may not last for long. The problem comes from how to interpret the input argument and the return value, right? Could we solve this by requiring the passed nodes= to all be of the same memory tier? Then there is no confusion around what is requested and what the return value means. And if no nodes are passed, it means reclaim (from the lowest memory tier) X pages and demote as needed, then return the reclaimed pages. > Now to your specific usecase. If there is a need to do a memory > distribution balancing then fine but this should be a well defined > interface. E.g. is there a need to not only control demotion but > promotions as well? I haven't heard anybody requesting that so far > but I can easily imagine that like outsourcing the memory reclaim to > the userspace someone might want to do the same thing with the numa > balancing because $REASONS. Should that ever happen, I am pretty sure > hooking into memory.reclaim is not really a great idea. Should this ever happen, it would seem fair that that be a separate knob anyway, no? One knob to move the pipeline in one direction (aging), one knob to move it the other way.
On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > [...] > > > After these discussion, I think the solution maybe use different > > > interfaces for "proactive demote" and "proactive reclaim". That is, > > > reconsider "memory.demote". In this way, we will always uncharge the > > > cgroup for "memory.reclaim". This avoid the possible confusion there. > > > And, because demotion is considered aging, we don't need to disable > > > demotion for "memory.reclaim", just don't count it. > > > > Hm, so in summary: > > > > 1) memory.reclaim would demote and reclaim like today, but it would > > change to only count reclaimed pages against the goal. > > > > 2) memory.demote would only demote. > > If the above 2 points are agreeable then yes, this sounds good to me and does address our use case. > > a) What if the demotion targets are full? Would it reclaim or fail? > > Wei will chime in if he disagrees, but I think we _require_ that it fails, not falls back to reclaim. The interface is asking for demotion, and is called memory.demote. For such an interface to fall back to reclaim would be very confusing to userspace and may trigger reclaim on a high priority job that we want to shield from proactive reclaim. > > 3) Would memory.reclaim and memory.demote still need nodemasks? memory.demote will need a nodemask, for sure. Today the nodemask would be useful if there is a specific node in the top tier that is overloaded and we want to reduce the pressure by demoting. In the future there will be N tiers and the nodemask says which tier to demote from. I don't think memory.reclaim would need a nodemask anymore? At least I no longer see the use for it for us. > > Would > > they return -EINVAL if a) memory.reclaim gets passed only toptier > > nodes or b) memory.demote gets passed any lasttier nodes? > Honestly it would be great if memory.reclaim can force reclaim from a top tier nodes. It breaks the aginig pipeline, yes, but if the user is specifically asking for that because they decided in their usecase it's a good idea then the kernel should comply IMO. Not a strict requirement for us. Wei will chime in if he disagrees. memory.demote returning -EINVAL for lasttier nodes makes sense to me. > I would also add > 4) Do we want to allow to control the demotion path (e.g. which node to > demote from and to) and how to achieve that? We care deeply about specifying which node to demote _from_. That would be some node that is approaching pressure and we're looking for proactive saving from. So far I haven't seen any reason to control which nodes to demote _to_. The kernel deciding that based on the aging pipeline and the node distances sounds good to me. Obviously someone else may find that useful. > 5) Is the demotion api restricted to multi-tier systems or any numa > configuration allowed as well? > demotion will of course not work on single tiered systems. The interface may return some failure on such systems or not be available at all. > -- > Michal Hocko > SUSE Labs
On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: > > I do recognize your need to control the demotion but I argue that it is > > a bad idea to rely on an implicit behavior of the memory reclaim and an > > interface which is _documented_ to primarily _reclaim_ memory. > > I think memory.reclaim should demote as part of page aging. What I'd > like to avoid is *having* to manually control the aging component in > the interface (e.g. making memory.reclaim *only* reclaim, and > *requiring* a coordinated use of memory.demote to ensure progress.) > > > Really, consider that the current demotion implementation will change > > in the future and based on a newly added heuristic memory reclaim or > > compression would be preferred over migration to a different tier. This > > might completely break your current assumptions and break your usecase > > which relies on an implicit demotion behavior. Do you see that as a > > potential problem at all? What shall we do in that case? Special case > > memory.reclaim behavior? > > Shouldn't that be derived from the distance propertiers in the tier > configuration? > > I.e. if local compression is faster than demoting to a slower node, we > should maybe have a separate tier for that. Ignoring proactive reclaim > or demotion commands for a second: on that node, global memory > pressure should always compress first, while the oldest pages from the > compression cache should demote to the other node(s) - until they > eventually get swapped out. > > However fine-grained we make proactive reclaim control over these > stages, it should at least be possible for the user to request the > default behavior that global pressure follows, without jumping through > hoops or requiring the coordinated use of multiple knobs. So IMO there > is an argument for having a singular knob that requests comprehensive > aging and reclaiming across the configured hierarchy. > > As far as explicit control over the individual stages goes - no idea > if you would call the compression stage demotion or reclaim. The > distinction still does not make much of sense to me, since reclaim is > just another form of demotion. Sure, page faults have a different > access latency than dax to slower memory. But you could also have 3 > tiers of memory where the difference between tier 1 and 2 is much > smaller than the difference between 2 and 3, and you might want to > apply different demotion rates between them as well. > > The other argument is that demotion does not free cgroup memory, > whereas reclaim does. But with multiple memory tiers of vastly > different performance, isn't there also an argument for granting > cgroups different shares of each memory? So that a higher priority > group has access to a bigger share of the fastest memory, and lower > prio cgroups are relegated to lower tiers. If we split those pools, > then "demotion" will actually free memory in a cgroup. > I would also like to say I implemented something in line with that in [1]. In this patch, pages demoted from inside the nodemask to outside the nodemask count as 'reclaimed'. This, in my mind, is a very generic solution to the 'should demoted pages count as reclaim?' problem, and will work in all scenarios as long as the nodemask passed to shrink_folio_list() is set correctly by the call stack. > This is why I liked adding a nodes= argument to memory.reclaim the > best. It doesn't encode a distinction that may not last for long. > > The problem comes from how to interpret the input argument and the > return value, right? Could we solve this by requiring the passed > nodes= to all be of the same memory tier? Then there is no confusion > around what is requested and what the return value means. > I feel like I arrived at a better solution in [1], where pages demoted from inside of the nodemask to outside count as reclaimed and the rest don't. But I think we could solve this by explicit checks that nodes= arg are from the same tier, yes. > And if no nodes are passed, it means reclaim (from the lowest memory > tier) X pages and demote as needed, then return the reclaimed pages. > > > Now to your specific usecase. If there is a need to do a memory > > distribution balancing then fine but this should be a well defined > > interface. E.g. is there a need to not only control demotion but > > promotions as well? I haven't heard anybody requesting that so far > > but I can easily imagine that like outsourcing the memory reclaim to > > the userspace someone might want to do the same thing with the numa > > balancing because $REASONS. Should that ever happen, I am pretty sure > > hooking into memory.reclaim is not really a great idea. > > Should this ever happen, it would seem fair that that be a separate > knob anyway, no? One knob to move the pipeline in one direction > (aging), one knob to move it the other way. [1] https://lore.kernel.org/linux-mm/20221206023406.3182800-1-almasrymina@google.com/
Johannes Weiner <hannes@cmpxchg.org> writes: > On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: >> I do recognize your need to control the demotion but I argue that it is >> a bad idea to rely on an implicit behavior of the memory reclaim and an >> interface which is _documented_ to primarily _reclaim_ memory. > > I think memory.reclaim should demote as part of page aging. What I'd > like to avoid is *having* to manually control the aging component in > the interface (e.g. making memory.reclaim *only* reclaim, and > *requiring* a coordinated use of memory.demote to ensure progress.) > >> Really, consider that the current demotion implementation will change >> in the future and based on a newly added heuristic memory reclaim or >> compression would be preferred over migration to a different tier. This >> might completely break your current assumptions and break your usecase >> which relies on an implicit demotion behavior. Do you see that as a >> potential problem at all? What shall we do in that case? Special case >> memory.reclaim behavior? > > Shouldn't that be derived from the distance propertiers in the tier > configuration? > > I.e. if local compression is faster than demoting to a slower node, we > should maybe have a separate tier for that. Ignoring proactive reclaim > or demotion commands for a second: on that node, global memory > pressure should always compress first, while the oldest pages from the > compression cache should demote to the other node(s) - until they > eventually get swapped out. > > However fine-grained we make proactive reclaim control over these > stages, it should at least be possible for the user to request the > default behavior that global pressure follows, without jumping through > hoops or requiring the coordinated use of multiple knobs. So IMO there > is an argument for having a singular knob that requests comprehensive > aging and reclaiming across the configured hierarchy. > > As far as explicit control over the individual stages goes - no idea > if you would call the compression stage demotion or reclaim. The > distinction still does not make much of sense to me, since reclaim is > just another form of demotion. Sure, page faults have a different > access latency than dax to slower memory. But you could also have 3 > tiers of memory where the difference between tier 1 and 2 is much > smaller than the difference between 2 and 3, and you might want to > apply different demotion rates between them as well. > > The other argument is that demotion does not free cgroup memory, > whereas reclaim does. But with multiple memory tiers of vastly > different performance, isn't there also an argument for granting > cgroups different shares of each memory? So that a higher priority > group has access to a bigger share of the fastest memory, and lower > prio cgroups are relegated to lower tiers. If we split those pools, > then "demotion" will actually free memory in a cgroup. > > This is why I liked adding a nodes= argument to memory.reclaim the > best. It doesn't encode a distinction that may not last for long. > > The problem comes from how to interpret the input argument and the > return value, right? Could we solve this by requiring the passed > nodes= to all be of the same memory tier? Then there is no confusion > around what is requested and what the return value means. Yes. The definition is clear if nodes= from the same memory tier. > And if no nodes are passed, it means reclaim (from the lowest memory > tier) X pages and demote as needed, then return the reclaimed pages. It appears that the definition isn't very clear here. How many pages should be demoted? The target number is the value echoed to memory.reclaim? Or requested_number - pages_in_lowest_tier? Should we demote in as many tiers as possible or in as few tiers as possible? One possibility is to take advantage of top tier memory as much as possible. That is, try to reclaim pages in lower tiers only. >> Now to your specific usecase. If there is a need to do a memory >> distribution balancing then fine but this should be a well defined >> interface. E.g. is there a need to not only control demotion but >> promotions as well? I haven't heard anybody requesting that so far >> but I can easily imagine that like outsourcing the memory reclaim to >> the userspace someone might want to do the same thing with the numa >> balancing because $REASONS. Should that ever happen, I am pretty sure >> hooking into memory.reclaim is not really a great idea. > > Should this ever happen, it would seem fair that that be a separate > knob anyway, no? One knob to move the pipeline in one direction > (aging), one knob to move it the other way. Agree. Best Regards, Huang, Ying
Mina Almasry <almasrymina@google.com> writes: > On Tue, Dec 13, 2022 at 7:58 AM Johannes Weiner <hannes@cmpxchg.org> wrote: >> >> On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: >> > I do recognize your need to control the demotion but I argue that it is >> > a bad idea to rely on an implicit behavior of the memory reclaim and an >> > interface which is _documented_ to primarily _reclaim_ memory. >> >> I think memory.reclaim should demote as part of page aging. What I'd >> like to avoid is *having* to manually control the aging component in >> the interface (e.g. making memory.reclaim *only* reclaim, and >> *requiring* a coordinated use of memory.demote to ensure progress.) >> >> > Really, consider that the current demotion implementation will change >> > in the future and based on a newly added heuristic memory reclaim or >> > compression would be preferred over migration to a different tier. This >> > might completely break your current assumptions and break your usecase >> > which relies on an implicit demotion behavior. Do you see that as a >> > potential problem at all? What shall we do in that case? Special case >> > memory.reclaim behavior? >> >> Shouldn't that be derived from the distance propertiers in the tier >> configuration? >> >> I.e. if local compression is faster than demoting to a slower node, we >> should maybe have a separate tier for that. Ignoring proactive reclaim >> or demotion commands for a second: on that node, global memory >> pressure should always compress first, while the oldest pages from the >> compression cache should demote to the other node(s) - until they >> eventually get swapped out. >> >> However fine-grained we make proactive reclaim control over these >> stages, it should at least be possible for the user to request the >> default behavior that global pressure follows, without jumping through >> hoops or requiring the coordinated use of multiple knobs. So IMO there >> is an argument for having a singular knob that requests comprehensive >> aging and reclaiming across the configured hierarchy. >> >> As far as explicit control over the individual stages goes - no idea >> if you would call the compression stage demotion or reclaim. The >> distinction still does not make much of sense to me, since reclaim is >> just another form of demotion. Sure, page faults have a different >> access latency than dax to slower memory. But you could also have 3 >> tiers of memory where the difference between tier 1 and 2 is much >> smaller than the difference between 2 and 3, and you might want to >> apply different demotion rates between them as well. >> >> The other argument is that demotion does not free cgroup memory, >> whereas reclaim does. But with multiple memory tiers of vastly >> different performance, isn't there also an argument for granting >> cgroups different shares of each memory? So that a higher priority >> group has access to a bigger share of the fastest memory, and lower >> prio cgroups are relegated to lower tiers. If we split those pools, >> then "demotion" will actually free memory in a cgroup. >> > > I would also like to say I implemented something in line with that in [1]. > > In this patch, pages demoted from inside the nodemask to outside the > nodemask count as 'reclaimed'. This, in my mind, is a very generic > solution to the 'should demoted pages count as reclaim?' problem, and > will work in all scenarios as long as the nodemask passed to > shrink_folio_list() is set correctly by the call stack. It's still not clear that how many pages should be demoted among the nodes inside the nodemask. One possibility is to keep as many higher tier pages as possible. Best Regards, Huang, Ying
On Tue 13-12-22 11:29:45, Mina Almasry wrote: > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > > [...] > > > > After these discussion, I think the solution maybe use different > > > > interfaces for "proactive demote" and "proactive reclaim". That is, > > > > reconsider "memory.demote". In this way, we will always uncharge the > > > > cgroup for "memory.reclaim". This avoid the possible confusion there. > > > > And, because demotion is considered aging, we don't need to disable > > > > demotion for "memory.reclaim", just don't count it. > > > > > > Hm, so in summary: > > > > > > 1) memory.reclaim would demote and reclaim like today, but it would > > > change to only count reclaimed pages against the goal. > > > > > > 2) memory.demote would only demote. > > > > > If the above 2 points are agreeable then yes, this sounds good to me > and does address our use case. > > > > a) What if the demotion targets are full? Would it reclaim or fail? > > > > > Wei will chime in if he disagrees, but I think we _require_ that it > fails, not falls back to reclaim. The interface is asking for > demotion, and is called memory.demote. For such an interface to fall > back to reclaim would be very confusing to userspace and may trigger > reclaim on a high priority job that we want to shield from proactive > reclaim. But what should happen if the immediate demotion target is full but lower tiers are still usable. Should the first one demote before allowing to demote from the top tier? > > > 3) Would memory.reclaim and memory.demote still need nodemasks? > > memory.demote will need a nodemask, for sure. Today the nodemask would > be useful if there is a specific node in the top tier that is > overloaded and we want to reduce the pressure by demoting. In the > future there will be N tiers and the nodemask says which tier to > demote from. OK, so what is the exact semantic of the node mask. Does it control where to demote from or to or both? > I don't think memory.reclaim would need a nodemask anymore? At least I > no longer see the use for it for us. > > > > Would > > > they return -EINVAL if a) memory.reclaim gets passed only toptier > > > nodes or b) memory.demote gets passed any lasttier nodes? > > > > Honestly it would be great if memory.reclaim can force reclaim from a > top tier nodes. It breaks the aginig pipeline, yes, but if the user is > specifically asking for that because they decided in their usecase > it's a good idea then the kernel should comply IMO. Not a strict > requirement for us. Wei will chime in if he disagrees. That would require a nodemask to say which nodes to reclaim, no? The default behavior should be in line with what standard memory reclaim does. If the demotion is a part of that process so should be memory.reclaim part of it. If we want to have a finer control then a nodemask is really a must and then the nodemaks should constrain both agining and reclaim. > memory.demote returning -EINVAL for lasttier nodes makes sense to me. > > > I would also add > > 4) Do we want to allow to control the demotion path (e.g. which node to > > demote from and to) and how to achieve that? > > We care deeply about specifying which node to demote _from_. That > would be some node that is approaching pressure and we're looking for > proactive saving from. So far I haven't seen any reason to control > which nodes to demote _to_. The kernel deciding that based on the > aging pipeline and the node distances sounds good to me. Obviously > someone else may find that useful. Please keep in mind that the interface should be really prepared for future extensions so try to abstract from your immediate usecases. > > 5) Is the demotion api restricted to multi-tier systems or any numa > > configuration allowed as well? > > > > demotion will of course not work on single tiered systems. The > interface may return some failure on such systems or not be available > at all. Is there any strong reason for that? We do not have any interface to control NUMA balancing from userspace. Why cannot we use the interface for that purpose?
On Tue 13-12-22 16:58:50, Johannes Weiner wrote: > On Tue, Dec 13, 2022 at 09:33:24AM +0100, Michal Hocko wrote: > > I do recognize your need to control the demotion but I argue that it is > > a bad idea to rely on an implicit behavior of the memory reclaim and an > > interface which is _documented_ to primarily _reclaim_ memory. > > I think memory.reclaim should demote as part of page aging. What I'd > like to avoid is *having* to manually control the aging component in > the interface (e.g. making memory.reclaim *only* reclaim, and > *requiring* a coordinated use of memory.demote to ensure progress.) Yes, I do agree with that. Demotion is a part of the aging. I meant to say that the result of the operation should be reclaimed charges but that doesn't mean that demotion is not a part of that process. I am mostly concerned about demote only behavior that Mina is targetting and want to use memory.reclaim interface. > > Really, consider that the current demotion implementation will change > > in the future and based on a newly added heuristic memory reclaim or > > compression would be preferred over migration to a different tier. This > > might completely break your current assumptions and break your usecase > > which relies on an implicit demotion behavior. Do you see that as a > > potential problem at all? What shall we do in that case? Special case > > memory.reclaim behavior? > > Shouldn't that be derived from the distance propertiers in the tier > configuration? > > I.e. if local compression is faster than demoting to a slower node, we > should maybe have a separate tier for that. Ignoring proactive reclaim > or demotion commands for a second: on that node, global memory > pressure should always compress first, while the oldest pages from the > compression cache should demote to the other node(s) - until they > eventually get swapped out. > > However fine-grained we make proactive reclaim control over these > stages, it should at least be possible for the user to request the > default behavior that global pressure follows, without jumping through > hoops or requiring the coordinated use of multiple knobs. So IMO there > is an argument for having a singular knob that requests comprehensive > aging and reclaiming across the configured hierarchy. > > As far as explicit control over the individual stages goes - no idea > if you would call the compression stage demotion or reclaim. The > distinction still does not make much of sense to me, since reclaim is > just another form of demotion. From the external visibility POV the major difference between the two is that the reclaim decreases the overall charged memory. And there are pro-active reclaim usecases which rely on that. Demotion is mostly memory placement rebalancing. Sure still visible in per-node stats and with implications to performance but that is a different story. > Sure, page faults have a different > access latency than dax to slower memory. But you could also have 3 > tiers of memory where the difference between tier 1 and 2 is much > smaller than the difference between 2 and 3, and you might want to > apply different demotion rates between them as well. > > The other argument is that demotion does not free cgroup memory, > whereas reclaim does. But with multiple memory tiers of vastly > different performance, isn't there also an argument for granting > cgroups different shares of each memory? Yes. We have already had requests for per node limits in the past. And I do expect this will show up as a problem here as well but with a reasonable memory.reclaim and potentially memory.demote interfaces the balancing and policy making can be outsourced to the userspace . > So that a higher priority > group has access to a bigger share of the fastest memory, and lower > prio cgroups are relegated to lower tiers. If we split those pools, > then "demotion" will actually free memory in a cgroup. > > This is why I liked adding a nodes= argument to memory.reclaim the > best. It doesn't encode a distinction that may not last for long. > > The problem comes from how to interpret the input argument and the > return value, right? Could we solve this by requiring the passed > nodes= to all be of the same memory tier? Then there is no confusion > around what is requested and what the return value means. Just to make sure I am on the same page. This means that if a node mask is specified then it always implies demotion without any control over how the demotion is done, right? > And if no nodes are passed, it means reclaim (from the lowest memory > tier) X pages and demote as needed, then return the reclaimed pages. IMO this is rather constrained semantic which will completely rule out some potentially interesting usecases. E.g. fine grained control over the demotion path or enforced reclaim for node balancing. Also if we ever want a promote interface then it would better fit with demote counterpart. > > Now to your specific usecase. If there is a need to do a memory > > distribution balancing then fine but this should be a well defined > > interface. E.g. is there a need to not only control demotion but > > promotions as well? I haven't heard anybody requesting that so far > > but I can easily imagine that like outsourcing the memory reclaim to > > the userspace someone might want to do the same thing with the numa > > balancing because $REASONS. Should that ever happen, I am pretty sure > > hooking into memory.reclaim is not really a great idea. > > Should this ever happen, it would seem fair that that be a separate > knob anyway, no? One knob to move the pipeline in one direction > (aging), one knob to move it the other way. Yes, this is what I am inclining to as well.
Michal Hocko <mhocko@suse.com> writes: > On Tue 13-12-22 11:29:45, Mina Almasry wrote: >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: >> > >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: >> > [...] >> > > > After these discussion, I think the solution maybe use different >> > > > interfaces for "proactive demote" and "proactive reclaim". That is, >> > > > reconsider "memory.demote". In this way, we will always uncharge the >> > > > cgroup for "memory.reclaim". This avoid the possible confusion there. >> > > > And, because demotion is considered aging, we don't need to disable >> > > > demotion for "memory.reclaim", just don't count it. >> > > >> > > Hm, so in summary: >> > > >> > > 1) memory.reclaim would demote and reclaim like today, but it would >> > > change to only count reclaimed pages against the goal. >> > > >> > > 2) memory.demote would only demote. >> > > >> >> If the above 2 points are agreeable then yes, this sounds good to me >> and does address our use case. >> >> > > a) What if the demotion targets are full? Would it reclaim or fail? >> > > >> >> Wei will chime in if he disagrees, but I think we _require_ that it >> fails, not falls back to reclaim. The interface is asking for >> demotion, and is called memory.demote. For such an interface to fall >> back to reclaim would be very confusing to userspace and may trigger >> reclaim on a high priority job that we want to shield from proactive >> reclaim. > > But what should happen if the immediate demotion target is full but > lower tiers are still usable. Should the first one demote before > allowing to demote from the top tier? > >> > > 3) Would memory.reclaim and memory.demote still need nodemasks? >> >> memory.demote will need a nodemask, for sure. Today the nodemask would >> be useful if there is a specific node in the top tier that is >> overloaded and we want to reduce the pressure by demoting. In the >> future there will be N tiers and the nodemask says which tier to >> demote from. > > OK, so what is the exact semantic of the node mask. Does it control > where to demote from or to or both? > >> I don't think memory.reclaim would need a nodemask anymore? At least I >> no longer see the use for it for us. >> >> > > Would >> > > they return -EINVAL if a) memory.reclaim gets passed only toptier >> > > nodes or b) memory.demote gets passed any lasttier nodes? >> > >> >> Honestly it would be great if memory.reclaim can force reclaim from a >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is >> specifically asking for that because they decided in their usecase >> it's a good idea then the kernel should comply IMO. Not a strict >> requirement for us. Wei will chime in if he disagrees. > > That would require a nodemask to say which nodes to reclaim, no? The > default behavior should be in line with what standard memory reclaim > does. If the demotion is a part of that process so should be > memory.reclaim part of it. If we want to have a finer control then a > nodemask is really a must and then the nodemaks should constrain both > agining and reclaim. > >> memory.demote returning -EINVAL for lasttier nodes makes sense to me. >> >> > I would also add >> > 4) Do we want to allow to control the demotion path (e.g. which node to >> > demote from and to) and how to achieve that? >> >> We care deeply about specifying which node to demote _from_. That >> would be some node that is approaching pressure and we're looking for >> proactive saving from. So far I haven't seen any reason to control >> which nodes to demote _to_. The kernel deciding that based on the >> aging pipeline and the node distances sounds good to me. Obviously >> someone else may find that useful. > > Please keep in mind that the interface should be really prepared for > future extensions so try to abstract from your immediate usecases. I see two requirements here, one is to control the demotion source, that is, which nodes to free memory. The other is to control the demotion path. I think that we can use two different parameters for them, for example, "from=<demotion source nodes>" and "to=<demotion target nodes>". In most cases we don't need to control the demotion path. Because in current implementation, the nodes in the lower tiers in the same socket (local nodes) will be preferred. I think that this is the desired behavior in most cases. >> > 5) Is the demotion api restricted to multi-tier systems or any numa >> > configuration allowed as well? >> > >> >> demotion will of course not work on single tiered systems. The >> interface may return some failure on such systems or not be available >> at all. > > Is there any strong reason for that? We do not have any interface to > control NUMA balancing from userspace. Why cannot we use the interface > for that purpose? Do you mean to demote the cold pages from the specified source nodes to the specified target nodes in different sockets? We don't do that to avoid loop in the demotion path. If we prevent the target nodes from demoting cold pages to the source nodes at the same time, it seems doable. Best Regards, Huang, Ying
On Thu 15-12-22 13:50:14, Huang, Ying wrote: > Michal Hocko <mhocko@suse.com> writes: > > > On Tue 13-12-22 11:29:45, Mina Almasry wrote: > >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: > >> > > >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > >> > [...] > >> > > > After these discussion, I think the solution maybe use different > >> > > > interfaces for "proactive demote" and "proactive reclaim". That is, > >> > > > reconsider "memory.demote". In this way, we will always uncharge the > >> > > > cgroup for "memory.reclaim". This avoid the possible confusion there. > >> > > > And, because demotion is considered aging, we don't need to disable > >> > > > demotion for "memory.reclaim", just don't count it. > >> > > > >> > > Hm, so in summary: > >> > > > >> > > 1) memory.reclaim would demote and reclaim like today, but it would > >> > > change to only count reclaimed pages against the goal. > >> > > > >> > > 2) memory.demote would only demote. > >> > > > >> > >> If the above 2 points are agreeable then yes, this sounds good to me > >> and does address our use case. > >> > >> > > a) What if the demotion targets are full? Would it reclaim or fail? > >> > > > >> > >> Wei will chime in if he disagrees, but I think we _require_ that it > >> fails, not falls back to reclaim. The interface is asking for > >> demotion, and is called memory.demote. For such an interface to fall > >> back to reclaim would be very confusing to userspace and may trigger > >> reclaim on a high priority job that we want to shield from proactive > >> reclaim. > > > > But what should happen if the immediate demotion target is full but > > lower tiers are still usable. Should the first one demote before > > allowing to demote from the top tier? > > > >> > > 3) Would memory.reclaim and memory.demote still need nodemasks? > >> > >> memory.demote will need a nodemask, for sure. Today the nodemask would > >> be useful if there is a specific node in the top tier that is > >> overloaded and we want to reduce the pressure by demoting. In the > >> future there will be N tiers and the nodemask says which tier to > >> demote from. > > > > OK, so what is the exact semantic of the node mask. Does it control > > where to demote from or to or both? > > > >> I don't think memory.reclaim would need a nodemask anymore? At least I > >> no longer see the use for it for us. > >> > >> > > Would > >> > > they return -EINVAL if a) memory.reclaim gets passed only toptier > >> > > nodes or b) memory.demote gets passed any lasttier nodes? > >> > > >> > >> Honestly it would be great if memory.reclaim can force reclaim from a > >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is > >> specifically asking for that because they decided in their usecase > >> it's a good idea then the kernel should comply IMO. Not a strict > >> requirement for us. Wei will chime in if he disagrees. > > > > That would require a nodemask to say which nodes to reclaim, no? The > > default behavior should be in line with what standard memory reclaim > > does. If the demotion is a part of that process so should be > > memory.reclaim part of it. If we want to have a finer control then a > > nodemask is really a must and then the nodemaks should constrain both > > agining and reclaim. > > > >> memory.demote returning -EINVAL for lasttier nodes makes sense to me. > >> > >> > I would also add > >> > 4) Do we want to allow to control the demotion path (e.g. which node to > >> > demote from and to) and how to achieve that? > >> > >> We care deeply about specifying which node to demote _from_. That > >> would be some node that is approaching pressure and we're looking for > >> proactive saving from. So far I haven't seen any reason to control > >> which nodes to demote _to_. The kernel deciding that based on the > >> aging pipeline and the node distances sounds good to me. Obviously > >> someone else may find that useful. > > > > Please keep in mind that the interface should be really prepared for > > future extensions so try to abstract from your immediate usecases. > > I see two requirements here, one is to control the demotion source, that > is, which nodes to free memory. The other is to control the demotion > path. I think that we can use two different parameters for them, for > example, "from=<demotion source nodes>" and "to=<demotion target > nodes>". In most cases we don't need to control the demotion path. > Because in current implementation, the nodes in the lower tiers in the > same socket (local nodes) will be preferred. I think that this is > the desired behavior in most cases. Even if the demotion path is not really required at the moment we should keep in mind future potential extensions. E.g. when a userspace based balancing is to be implemented because the default behavior cannot capture userspace policies (one example would be enforcing a prioritization of containers when some container's demoted pages would need to be demoted further to free up a space for a different workload). > >> > 5) Is the demotion api restricted to multi-tier systems or any numa > >> > configuration allowed as well? > >> > > >> > >> demotion will of course not work on single tiered systems. The > >> interface may return some failure on such systems or not be available > >> at all. > > > > Is there any strong reason for that? We do not have any interface to > > control NUMA balancing from userspace. Why cannot we use the interface > > for that purpose? > > Do you mean to demote the cold pages from the specified source nodes to > the specified target nodes in different sockets? We don't do that to > avoid loop in the demotion path. If we prevent the target nodes from > demoting cold pages to the source nodes at the same time, it seems > doable. Loops could be avoid by properly specifying from and to nodes if this is going to be a fine grained interface to control demotion.
On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko@suse.com> wrote: > > On Tue 13-12-22 11:29:45, Mina Almasry wrote: > > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > > > [...] > > > > > After these discussion, I think the solution maybe use different > > > > > interfaces for "proactive demote" and "proactive reclaim". That is, > > > > > reconsider "memory.demote". In this way, we will always uncharge the > > > > > cgroup for "memory.reclaim". This avoid the possible confusion there. > > > > > And, because demotion is considered aging, we don't need to disable > > > > > demotion for "memory.reclaim", just don't count it. > > > > > > > > Hm, so in summary: > > > > > > > > 1) memory.reclaim would demote and reclaim like today, but it would > > > > change to only count reclaimed pages against the goal. > > > > > > > > 2) memory.demote would only demote. > > > > > > > > If the above 2 points are agreeable then yes, this sounds good to me > > and does address our use case. > > > > > > a) What if the demotion targets are full? Would it reclaim or fail? > > > > > > > > Wei will chime in if he disagrees, but I think we _require_ that it > > fails, not falls back to reclaim. The interface is asking for > > demotion, and is called memory.demote. For such an interface to fall > > back to reclaim would be very confusing to userspace and may trigger > > reclaim on a high priority job that we want to shield from proactive > > reclaim. > > But what should happen if the immediate demotion target is full but > lower tiers are still usable. Should the first one demote before > allowing to demote from the top tier? In that case, the demotion will fall back to the lower tiers. See node_get_allowed_targets() and establish_demotion_targets().. > > > > 3) Would memory.reclaim and memory.demote still need nodemasks? > > > > memory.demote will need a nodemask, for sure. Today the nodemask would > > be useful if there is a specific node in the top tier that is > > overloaded and we want to reduce the pressure by demoting. In the > > future there will be N tiers and the nodemask says which tier to > > demote from. > > OK, so what is the exact semantic of the node mask. Does it control > where to demote from or to or both? The nodemask argument proposed here is to control where to demote from. We can follow the existing kernel demotion order to select where to demote to. If the need to control the demotion destination arises, another argument can be added. > > I don't think memory.reclaim would need a nodemask anymore? At least I > > no longer see the use for it for us. > > > > > > Would > > > > they return -EINVAL if a) memory.reclaim gets passed only toptier > > > > nodes or b) memory.demote gets passed any lasttier nodes? > > > > > > > Honestly it would be great if memory.reclaim can force reclaim from a > > top tier nodes. It breaks the aginig pipeline, yes, but if the user is > > specifically asking for that because they decided in their usecase > > it's a good idea then the kernel should comply IMO. Not a strict > > requirement for us. Wei will chime in if he disagrees. > > That would require a nodemask to say which nodes to reclaim, no? The > default behavior should be in line with what standard memory reclaim > does. If the demotion is a part of that process so should be > memory.reclaim part of it. If we want to have a finer control then a > nodemask is really a must and then the nodemaks should constrain both > agining and reclaim. Given that the original meaning of memory.reclaim is to free up memory, I agree that when a nodemask is provided, the kernel should be allowed to do both aging/demotion and reclaim. Whether to allow reclaim from top-tier nodes is a kernel implementation choice. The userspace should not depend on that. Also, because the expectation of memory.reclaim is to free up the specified amount of bytes, I think if a page is demoted, but both its source and target nodes are still in the given nodemask, such a demoted page should not be counted towards the requested bytes of memory.reclaim. In the case that no nodemask is given (i.e. to free up memory from all nodes), the demoted pages should never be counted in the return value of try_to_free_mem_cgroup_pages(). Meanwhile, I'd argue that even though we want to unify demotion and reclaim, there are still significant differences between them. Demotion moves pages between two memory tiers, while reclaim can move pages to a much slower tier, e.g. disk-based files or swap. Both the page movement latencies and the reaccess latencies can be significantly different for demotion/reclaim. So it is useful for the userspace to be able to request demotion without reclaim. A separate interface, e.g. memory.demote, seems like a good choice for that. > > memory.demote returning -EINVAL for lasttier nodes makes sense to me. > > > > > I would also add > > > 4) Do we want to allow to control the demotion path (e.g. which node to > > > demote from and to) and how to achieve that? > > > > We care deeply about specifying which node to demote _from_. That > > would be some node that is approaching pressure and we're looking for > > proactive saving from. So far I haven't seen any reason to control > > which nodes to demote _to_. The kernel deciding that based on the > > aging pipeline and the node distances sounds good to me. Obviously > > someone else may find that useful. > > Please keep in mind that the interface should be really prepared for > future extensions so try to abstract from your immediate usecases. > > > > 5) Is the demotion api restricted to multi-tier systems or any numa > > > configuration allowed as well? > > > > > > > demotion will of course not work on single tiered systems. The > > interface may return some failure on such systems or not be available > > at all. > > Is there any strong reason for that? We do not have any interface to > control NUMA balancing from userspace. Why cannot we use the interface > for that purpose? A demotion interface such as memory.demote will trigger the demotion code path in the kernel, which depends on multiple memory tiers. I think what you are getting is a more general page migration interface for memcg, which will need both source and target nodes as arguments. I think this can be a great idea. It should be able to support our demotion use cases as well. > -- > Michal Hocko > SUSE Labs
Michal Hocko <mhocko@suse.com> writes: > On Thu 15-12-22 13:50:14, Huang, Ying wrote: >> Michal Hocko <mhocko@suse.com> writes: >> >> > On Tue 13-12-22 11:29:45, Mina Almasry wrote: >> >> On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: >> >> > >> >> > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: >> >> > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: >> >> > [...] >> >> > > > After these discussion, I think the solution maybe use different >> >> > > > interfaces for "proactive demote" and "proactive reclaim". That is, >> >> > > > reconsider "memory.demote". In this way, we will always uncharge the >> >> > > > cgroup for "memory.reclaim". This avoid the possible confusion there. >> >> > > > And, because demotion is considered aging, we don't need to disable >> >> > > > demotion for "memory.reclaim", just don't count it. >> >> > > >> >> > > Hm, so in summary: >> >> > > >> >> > > 1) memory.reclaim would demote and reclaim like today, but it would >> >> > > change to only count reclaimed pages against the goal. >> >> > > >> >> > > 2) memory.demote would only demote. >> >> > > >> >> >> >> If the above 2 points are agreeable then yes, this sounds good to me >> >> and does address our use case. >> >> >> >> > > a) What if the demotion targets are full? Would it reclaim or fail? >> >> > > >> >> >> >> Wei will chime in if he disagrees, but I think we _require_ that it >> >> fails, not falls back to reclaim. The interface is asking for >> >> demotion, and is called memory.demote. For such an interface to fall >> >> back to reclaim would be very confusing to userspace and may trigger >> >> reclaim on a high priority job that we want to shield from proactive >> >> reclaim. >> > >> > But what should happen if the immediate demotion target is full but >> > lower tiers are still usable. Should the first one demote before >> > allowing to demote from the top tier? >> > >> >> > > 3) Would memory.reclaim and memory.demote still need nodemasks? >> >> >> >> memory.demote will need a nodemask, for sure. Today the nodemask would >> >> be useful if there is a specific node in the top tier that is >> >> overloaded and we want to reduce the pressure by demoting. In the >> >> future there will be N tiers and the nodemask says which tier to >> >> demote from. >> > >> > OK, so what is the exact semantic of the node mask. Does it control >> > where to demote from or to or both? >> > >> >> I don't think memory.reclaim would need a nodemask anymore? At least I >> >> no longer see the use for it for us. >> >> >> >> > > Would >> >> > > they return -EINVAL if a) memory.reclaim gets passed only toptier >> >> > > nodes or b) memory.demote gets passed any lasttier nodes? >> >> > >> >> >> >> Honestly it would be great if memory.reclaim can force reclaim from a >> >> top tier nodes. It breaks the aginig pipeline, yes, but if the user is >> >> specifically asking for that because they decided in their usecase >> >> it's a good idea then the kernel should comply IMO. Not a strict >> >> requirement for us. Wei will chime in if he disagrees. >> > >> > That would require a nodemask to say which nodes to reclaim, no? The >> > default behavior should be in line with what standard memory reclaim >> > does. If the demotion is a part of that process so should be >> > memory.reclaim part of it. If we want to have a finer control then a >> > nodemask is really a must and then the nodemaks should constrain both >> > agining and reclaim. >> > >> >> memory.demote returning -EINVAL for lasttier nodes makes sense to me. >> >> >> >> > I would also add >> >> > 4) Do we want to allow to control the demotion path (e.g. which node to >> >> > demote from and to) and how to achieve that? >> >> >> >> We care deeply about specifying which node to demote _from_. That >> >> would be some node that is approaching pressure and we're looking for >> >> proactive saving from. So far I haven't seen any reason to control >> >> which nodes to demote _to_. The kernel deciding that based on the >> >> aging pipeline and the node distances sounds good to me. Obviously >> >> someone else may find that useful. >> > >> > Please keep in mind that the interface should be really prepared for >> > future extensions so try to abstract from your immediate usecases. >> >> I see two requirements here, one is to control the demotion source, that >> is, which nodes to free memory. The other is to control the demotion >> path. I think that we can use two different parameters for them, for >> example, "from=<demotion source nodes>" and "to=<demotion target >> nodes>". In most cases we don't need to control the demotion path. >> Because in current implementation, the nodes in the lower tiers in the >> same socket (local nodes) will be preferred. I think that this is >> the desired behavior in most cases. > > Even if the demotion path is not really required at the moment we should > keep in mind future potential extensions. E.g. when a userspace based > balancing is to be implemented because the default behavior cannot > capture userspace policies (one example would be enforcing a > prioritization of containers when some container's demoted pages would > need to be demoted further to free up a space for a different > workload). Yes. We should consider the potential requirements. >> >> > 5) Is the demotion api restricted to multi-tier systems or any numa >> >> > configuration allowed as well? >> >> > >> >> >> >> demotion will of course not work on single tiered systems. The >> >> interface may return some failure on such systems or not be available >> >> at all. >> > >> > Is there any strong reason for that? We do not have any interface to >> > control NUMA balancing from userspace. Why cannot we use the interface >> > for that purpose? >> >> Do you mean to demote the cold pages from the specified source nodes to >> the specified target nodes in different sockets? We don't do that to >> avoid loop in the demotion path. If we prevent the target nodes from >> demoting cold pages to the source nodes at the same time, it seems >> doable. > > Loops could be avoid by properly specifying from and to nodes if this is > going to be a fine grained interface to control demotion. Yes. Best Regards, Huang, Ying
On Thu 15-12-22 09:58:12, Wei Xu wrote: > On Wed, Dec 14, 2022 at 2:23 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Tue 13-12-22 11:29:45, Mina Almasry wrote: > > > On Tue, Dec 13, 2022 at 6:03 AM Michal Hocko <mhocko@suse.com> wrote: > > > > > > > > On Tue 13-12-22 14:30:40, Johannes Weiner wrote: > > > > > On Tue, Dec 13, 2022 at 02:30:57PM +0800, Huang, Ying wrote: > > > > [...] > > > > > > After these discussion, I think the solution maybe use different > > > > > > interfaces for "proactive demote" and "proactive reclaim". That is, > > > > > > reconsider "memory.demote". In this way, we will always uncharge the > > > > > > cgroup for "memory.reclaim". This avoid the possible confusion there. > > > > > > And, because demotion is considered aging, we don't need to disable > > > > > > demotion for "memory.reclaim", just don't count it. > > > > > > > > > > Hm, so in summary: > > > > > > > > > > 1) memory.reclaim would demote and reclaim like today, but it would > > > > > change to only count reclaimed pages against the goal. > > > > > > > > > > 2) memory.demote would only demote. > > > > > > > > > > > If the above 2 points are agreeable then yes, this sounds good to me > > > and does address our use case. > > > > > > > > a) What if the demotion targets are full? Would it reclaim or fail? > > > > > > > > > > > Wei will chime in if he disagrees, but I think we _require_ that it > > > fails, not falls back to reclaim. The interface is asking for > > > demotion, and is called memory.demote. For such an interface to fall > > > back to reclaim would be very confusing to userspace and may trigger > > > reclaim on a high priority job that we want to shield from proactive > > > reclaim. > > > > But what should happen if the immediate demotion target is full but > > lower tiers are still usable. Should the first one demote before > > allowing to demote from the top tier? > > In that case, the demotion will fall back to the lower tiers. See > node_get_allowed_targets() and establish_demotion_targets().. I am not talking about an implicit behavior that we do not want to cast into interface. If we want to allow a fine grained control over demotion then the implementation shouldn't rely on the current behavior. [...] > > Is there any strong reason for that? We do not have any interface to > > control NUMA balancing from userspace. Why cannot we use the interface > > for that purpose? > > A demotion interface such as memory.demote will trigger the demotion > code path in the kernel, which depends on multiple memory tiers. Demotion is just a fancy name of a directed migration. There is no realy dependency on the HW nor the technology. > I think what you are getting is a more general page migration > interface for memcg, which will need both source and target nodes as > arguments. I think this can be a great idea. It should be able to > support our demotion use cases as well. yes.
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 74cec76be9f2..c8ae7c897f14 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1245,17 +1245,13 @@ PAGE_SIZE multiple when read back. This is a simple interface to trigger memory reclaim in the target cgroup. - This file accepts a single key, the number of bytes to reclaim. - No nested keys are currently supported. + This file accepts a string which contains the number of bytes to + reclaim. Example:: echo "1G" > memory.reclaim - The interface can be later extended with nested keys to - configure the reclaim behavior. For example, specify the - type of memory to reclaim from (anon, file, ..). - Please note that the kernel can over or under reclaim from the target cgroup. If less bytes are reclaimed than the specified amount, -EAGAIN is returned. @@ -1267,6 +1263,13 @@ PAGE_SIZE multiple when read back. This means that the networking layer will not adapt based on reclaim induced by memory.reclaim. + This file also allows the user to specify the nodes to reclaim from, + via the 'nodes=' key, for example:: + + echo "1G nodes=0,1" > memory.reclaim + + The above instructs the kernel to reclaim memory from nodes 0,1. + memory.peak A read-only single value file which exists on non-root cgroups. diff --git a/include/linux/swap.h b/include/linux/swap.h index 0ceed49516ad..2787b84eaf12 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -418,7 +418,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options); + unsigned int reclaim_options, + nodemask_t *nodemask); extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem, gfp_t gfp_mask, bool noswap, pg_data_t *pgdat, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 23750cec0036..0f02f47a87e4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -63,6 +63,7 @@ #include <linux/resume_user_mode.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/parser.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -2392,7 +2393,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_MAY_SWAP, + NULL); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2683,7 +2685,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, psi_memstall_enter(&pflags); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, - gfp_mask, reclaim_options); + gfp_mask, reclaim_options, + NULL); psi_memstall_leave(&pflags); if (mem_cgroup_margin(mem_over_limit) >= nr_pages) @@ -3503,7 +3506,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, + NULL)) { ret = -EBUSY; break; } @@ -3614,7 +3618,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_retries--; } @@ -6407,7 +6412,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL); if (!reclaimed && !nr_retries--) break; @@ -6456,7 +6462,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, + NULL)) nr_reclaims--; continue; } @@ -6579,21 +6586,54 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of, return nbytes; } +enum { + MEMORY_RECLAIM_NODES = 0, + MEMORY_RECLAIM_NULL, +}; + +static const match_table_t if_tokens = { + { MEMORY_RECLAIM_NODES, "nodes=%s" }, + { MEMORY_RECLAIM_NULL, NULL }, +}; + static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, size_t nbytes, loff_t off) { struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); unsigned int nr_retries = MAX_RECLAIM_RETRIES; unsigned long nr_to_reclaim, nr_reclaimed = 0; - unsigned int reclaim_options; - int err; + unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP | + MEMCG_RECLAIM_PROACTIVE; + char *old_buf, *start; + substring_t args[MAX_OPT_ARGS]; + int token; + char value[256]; + nodemask_t nodemask = NODE_MASK_ALL; buf = strstrip(buf); - err = page_counter_memparse(buf, "", &nr_to_reclaim); - if (err) - return err; - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; + old_buf = buf; + nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE; + if (buf == old_buf) + return -EINVAL; + + buf = strstrip(buf); + + while ((start = strsep(&buf, " ")) != NULL) { + if (!strlen(start)) + continue; + token = match_token(start, if_tokens, args); + match_strlcpy(value, args, sizeof(value)); + switch (token) { + case MEMORY_RECLAIM_NODES: + if (nodelist_parse(value, nodemask) < 0) + return -EINVAL; + break; + default: + return -EINVAL; + } + } + while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; @@ -6610,7 +6650,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_to_reclaim - nr_reclaimed, - GFP_KERNEL, reclaim_options); + GFP_KERNEL, reclaim_options, + &nodemask); if (!reclaimed && !nr_retries--) return -EAGAIN; diff --git a/mm/vmscan.c b/mm/vmscan.c index 7b8e8e43806b..62b0c9b46bd2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6735,7 +6735,8 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg, unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, - unsigned int reclaim_options) + unsigned int reclaim_options, + nodemask_t *nodemask) { unsigned long nr_reclaimed; unsigned int noreclaim_flag; @@ -6750,6 +6751,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, .may_unmap = 1, .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), + .nodemask = nodemask, }; /* * Traverse the ZONELIST_FALLBACK zonelist of the current node to put