Message ID | 20221122203850.2765015-1-almasrymina@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC,V1] mm: Disable demotion from proactive reclaim | expand |
Hello Mina, On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > reclaim""), the proactive reclaim interface memory.reclaim does both > reclaim and demotion. This is likely fine for us for latency critical > jobs where we would want to disable proactive reclaim entirely, and is > also fine for latency tolerant jobs where we would like to both > proactively reclaim and demote. > > However, for some latency tiers in the middle we would like to demote but > not reclaim. This is because reclaim and demotion incur different latency > costs to the jobs in the cgroup. Demoted memory would still be addressable > by the userspace at a higher latency, but reclaimed memory would need to > incur a pagefault. > > To address this, I propose having reclaim-only and demotion-only > mechanisms in the kernel. There are a couple possible > interfaces to carry this out I considered: > > 1. Disable demotion in the memory.reclaim interface and add a new > demotion interface (memory.demote). > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > behavior in the kernel like so: > - demote=0 would disable demotion from this call. > - demote=1 would allow the kernel to demote if it desires. > - demote=2 would only demote if possible but not attempt any > other form of reclaim. Unfortunately, our proactive reclaim stack currently relies on memory.reclaim doing both. It may not stay like that, but I'm a bit wary of changing user-visible semantics post-facto. In patch 2, you're adding a node interface to memory.demote. Can you add this to memory.reclaim instead? This would allow you to control demotion and reclaim independently as you please: if you call it on a node with demotion targets, it will demote; if you call it on a node without one, it'll reclaim. And current users will remain unaffected.
On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > Hello Mina, > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > reclaim""), the proactive reclaim interface memory.reclaim does both > > reclaim and demotion. This is likely fine for us for latency critical > > jobs where we would want to disable proactive reclaim entirely, and is > > also fine for latency tolerant jobs where we would like to both > > proactively reclaim and demote. > > > > However, for some latency tiers in the middle we would like to demote but > > not reclaim. This is because reclaim and demotion incur different latency > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > by the userspace at a higher latency, but reclaimed memory would need to > > incur a pagefault. > > > > To address this, I propose having reclaim-only and demotion-only > > mechanisms in the kernel. There are a couple possible > > interfaces to carry this out I considered: > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > demotion interface (memory.demote). > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > behavior in the kernel like so: > > - demote=0 would disable demotion from this call. > > - demote=1 would allow the kernel to demote if it desires. > > - demote=2 would only demote if possible but not attempt any > > other form of reclaim. > > Unfortunately, our proactive reclaim stack currently relies on > memory.reclaim doing both. It may not stay like that, but I'm a bit > wary of changing user-visible semantics post-facto. > > In patch 2, you're adding a node interface to memory.demote. Can you > add this to memory.reclaim instead? This would allow you to control > demotion and reclaim independently as you please: if you call it on a > node with demotion targets, it will demote; if you call it on a node > without one, it'll reclaim. And current users will remain unaffected. Hello Johannes, thanks for taking a look! I can certainly add the "nodes=" arg to memory.reclaim and you're right, that would help in bridging the gap. However, if I understand the underlying code correctly, with only the nodes= arg the kernel will indeed attempt demotion first, but the kernel will also merrily fall back to reclaiming if it can't demote the full amount. I had hoped to have the flexibility to protect latency sensitive jobs from reclaim entirely while attempting to do demotion. There are probably ways to get around that in the userspace. I presume the userspace can check if there is available memory on the node's demotion targets, and if so, the kernel should demote-only. But I feel that wouldn't be reliable as the demotion logic may change across kernel versions. The userspace may think the kernel would demote but instead demotion failed due to whatever heuristic introduced into the new kernel version. The above is just one angle of the issue. Another angle (which Yosry would care most about I think) is that at Google we call memory.reclaim mainly when memory.current is too close to memory.max and we expect the memory usage of the cgroup to drop as a result of a success memory.reclaim call. I suspect once we take in commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), we would run into that regression, but I defer to Yosry here, he may have a solution for that in mind already. For these reasons, what do you think about adding both the "nodes=" and "demote=" args from my patch series to memory.reclaim? With these args the default memory.reclaim behavior would suit meta as-is (no change) but we would be able to configure it to address our use cases as well. As always, thanks for taking the time to review!
On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote: > > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Hello Mina, > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > reclaim and demotion. This is likely fine for us for latency critical > > > jobs where we would want to disable proactive reclaim entirely, and is > > > also fine for latency tolerant jobs where we would like to both > > > proactively reclaim and demote. > > > > > > However, for some latency tiers in the middle we would like to demote but > > > not reclaim. This is because reclaim and demotion incur different latency > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > by the userspace at a higher latency, but reclaimed memory would need to > > > incur a pagefault. > > > > > > To address this, I propose having reclaim-only and demotion-only > > > mechanisms in the kernel. There are a couple possible > > > interfaces to carry this out I considered: > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > demotion interface (memory.demote). > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > > behavior in the kernel like so: > > > - demote=0 would disable demotion from this call. > > > - demote=1 would allow the kernel to demote if it desires. > > > - demote=2 would only demote if possible but not attempt any > > > other form of reclaim. > > > > Unfortunately, our proactive reclaim stack currently relies on > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > wary of changing user-visible semantics post-facto. > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > add this to memory.reclaim instead? This would allow you to control > > demotion and reclaim independently as you please: if you call it on a > > node with demotion targets, it will demote; if you call it on a node > > without one, it'll reclaim. And current users will remain unaffected. > > Hello Johannes, thanks for taking a look! > > I can certainly add the "nodes=" arg to memory.reclaim and you're > right, that would help in bridging the gap. However, if I understand > the underlying code correctly, with only the nodes= arg the kernel > will indeed attempt demotion first, but the kernel will also merrily > fall back to reclaiming if it can't demote the full amount. I had > hoped to have the flexibility to protect latency sensitive jobs from > reclaim entirely while attempting to do demotion. > > There are probably ways to get around that in the userspace. I presume > the userspace can check if there is available memory on the node's > demotion targets, and if so, the kernel should demote-only. But I feel > that wouldn't be reliable as the demotion logic may change across > kernel versions. The userspace may think the kernel would demote but > instead demotion failed due to whatever heuristic introduced into the > new kernel version. > > The above is just one angle of the issue. Another angle (which Yosry > would care most about I think) is that at Google we call > memory.reclaim mainly when memory.current is too close to memory.max > and we expect the memory usage of the cgroup to drop as a result of a > success memory.reclaim call. I suspect once we take in commit > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > we would run into that regression, but I defer to Yosry here, he may > have a solution for that in mind already. We don't exactly rely on memory.current, but we do have a separate proactive reclaim policy today from demotion, and we do expect memory.reclaim to reclaim memory and not demote it. So it is important that we can control reclaim vs. demotion separately. Having memory.reclaim do demotions by default is not ideal for our current setup, so at least having a demote= argument to control it (no demotions, may demote, only demote) is needed. > > For these reasons, what do you think about adding both the "nodes=" > and "demote=" args from my patch series to memory.reclaim? With these > args the default memory.reclaim behavior would suit meta as-is (no > change) but we would be able to configure it to address our use cases > as well. > > As always, thanks for taking the time to review!
On Wed, Nov 23, 2022 at 01:20:57PM -0800, Mina Almasry wrote: > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > Hello Mina, > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > reclaim and demotion. This is likely fine for us for latency critical > > > jobs where we would want to disable proactive reclaim entirely, and is > > > also fine for latency tolerant jobs where we would like to both > > > proactively reclaim and demote. > > > > > > However, for some latency tiers in the middle we would like to demote but > > > not reclaim. This is because reclaim and demotion incur different latency > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > by the userspace at a higher latency, but reclaimed memory would need to > > > incur a pagefault. > > > > > > To address this, I propose having reclaim-only and demotion-only > > > mechanisms in the kernel. There are a couple possible > > > interfaces to carry this out I considered: > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > demotion interface (memory.demote). > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > > behavior in the kernel like so: > > > - demote=0 would disable demotion from this call. > > > - demote=1 would allow the kernel to demote if it desires. > > > - demote=2 would only demote if possible but not attempt any > > > other form of reclaim. > > > > Unfortunately, our proactive reclaim stack currently relies on > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > wary of changing user-visible semantics post-facto. > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > add this to memory.reclaim instead? This would allow you to control > > demotion and reclaim independently as you please: if you call it on a > > node with demotion targets, it will demote; if you call it on a node > > without one, it'll reclaim. And current users will remain unaffected. > > Hello Johannes, thanks for taking a look! > > I can certainly add the "nodes=" arg to memory.reclaim and you're > right, that would help in bridging the gap. However, if I understand > the underlying code correctly, with only the nodes= arg the kernel > will indeed attempt demotion first, but the kernel will also merrily > fall back to reclaiming if it can't demote the full amount. I had > hoped to have the flexibility to protect latency sensitive jobs from > reclaim entirely while attempting to do demotion. The fallback to reclaim actually strikes me as wrong. Think of reclaim as 'demoting' the pages to the storage tier. If we have a RAM -> CXL -> storage hierarchy, we should demote from RAM to CXL and from CXL to storage. If we reclaim a page from RAM, it means we 'demote' it directly from RAM to storage, bypassing potentially a huge amount of pages colder than it in CXL. That doesn't seem right. If demotion fails, IMO it shouldn't satisfy the reclaim request by breaking the layering. Rather it should deflect that pressure to the lower layers to make room. This makes sure we maintain an aging pipeline that honors the memory tier hierarchy. So I'm hesitant to design cgroup controls around the current behavior. > The above is just one angle of the issue. Another angle (which Yosry > would care most about I think) is that at Google we call > memory.reclaim mainly when memory.current is too close to memory.max > and we expect the memory usage of the cgroup to drop as a result of a > success memory.reclaim call. I suspect once we take in commit > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > we would run into that regression, but I defer to Yosry here, he may > have a solution for that in mind already. IMO it should both demote and reclaim. Simliar to how memory.reclaim on a non-tiered memory system would both deactivate active pages and reclaim inactive pages.
On Wed, Nov 23, 2022 at 01:35:13PM -0800, Yosry Ahmed wrote: > On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote: > > > > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > Hello Mina, > > > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > > reclaim and demotion. This is likely fine for us for latency critical > > > > jobs where we would want to disable proactive reclaim entirely, and is > > > > also fine for latency tolerant jobs where we would like to both > > > > proactively reclaim and demote. > > > > > > > > However, for some latency tiers in the middle we would like to demote but > > > > not reclaim. This is because reclaim and demotion incur different latency > > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > > by the userspace at a higher latency, but reclaimed memory would need to > > > > incur a pagefault. > > > > > > > > To address this, I propose having reclaim-only and demotion-only > > > > mechanisms in the kernel. There are a couple possible > > > > interfaces to carry this out I considered: > > > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > > demotion interface (memory.demote). > > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > > > behavior in the kernel like so: > > > > - demote=0 would disable demotion from this call. > > > > - demote=1 would allow the kernel to demote if it desires. > > > > - demote=2 would only demote if possible but not attempt any > > > > other form of reclaim. > > > > > > Unfortunately, our proactive reclaim stack currently relies on > > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > > wary of changing user-visible semantics post-facto. > > > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > > add this to memory.reclaim instead? This would allow you to control > > > demotion and reclaim independently as you please: if you call it on a > > > node with demotion targets, it will demote; if you call it on a node > > > without one, it'll reclaim. And current users will remain unaffected. > > > > Hello Johannes, thanks for taking a look! > > > > I can certainly add the "nodes=" arg to memory.reclaim and you're > > right, that would help in bridging the gap. However, if I understand > > the underlying code correctly, with only the nodes= arg the kernel > > will indeed attempt demotion first, but the kernel will also merrily > > fall back to reclaiming if it can't demote the full amount. I had > > hoped to have the flexibility to protect latency sensitive jobs from > > reclaim entirely while attempting to do demotion. > > > > There are probably ways to get around that in the userspace. I presume > > the userspace can check if there is available memory on the node's > > demotion targets, and if so, the kernel should demote-only. But I feel > > that wouldn't be reliable as the demotion logic may change across > > kernel versions. The userspace may think the kernel would demote but > > instead demotion failed due to whatever heuristic introduced into the > > new kernel version. > > > > The above is just one angle of the issue. Another angle (which Yosry > > would care most about I think) is that at Google we call > > memory.reclaim mainly when memory.current is too close to memory.max > > and we expect the memory usage of the cgroup to drop as a result of a > > success memory.reclaim call. I suspect once we take in commit > > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > > we would run into that regression, but I defer to Yosry here, he may > > have a solution for that in mind already. > > We don't exactly rely on memory.current, but we do have a separate > proactive reclaim policy today from demotion, and we do expect > memory.reclaim to reclaim memory and not demote it. So it is important > that we can control reclaim vs. demotion separately. Having > memory.reclaim do demotions by default is not ideal for our current > setup, so at least having a demote= argument to control it (no > demotions, may demote, only demote) is needed. With a nodemask you should be able to only reclaim by specifying terminal memory tiers that do that, and leave out higher tiers that demote. That said, it would actually be nice if reclaim policy wouldn't have to differ from demotion policy longer term. Ultimately it comes down to mapping age to memory tier, right? Such that hot pages are in RAM, warm pages are in CXL, cold pages are in storage. If you apply equal presure on all tiers, it's access frequency that should determine which RAM pages to demote, and which CXL pages to reclaim. If RAM pages are hot and refuse demotion, and CXL pages are cold in comparison, CXL should clear out. If RAM pages are warm, they should get demoted to CXL but not reclaimed further from there (and rotate instead). Do we know what's preventing this from happening today? What's the reason you want to control them independently?
On Wed, Nov 23, 2022 at 1:57 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Nov 23, 2022 at 01:20:57PM -0800, Mina Almasry wrote: > > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > Hello Mina, > > > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > > reclaim and demotion. This is likely fine for us for latency critical > > > > jobs where we would want to disable proactive reclaim entirely, and is > > > > also fine for latency tolerant jobs where we would like to both > > > > proactively reclaim and demote. > > > > > > > > However, for some latency tiers in the middle we would like to demote but > > > > not reclaim. This is because reclaim and demotion incur different latency > > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > > by the userspace at a higher latency, but reclaimed memory would need to > > > > incur a pagefault. > > > > > > > > To address this, I propose having reclaim-only and demotion-only > > > > mechanisms in the kernel. There are a couple possible > > > > interfaces to carry this out I considered: > > > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > > demotion interface (memory.demote). > > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > > > behavior in the kernel like so: > > > > - demote=0 would disable demotion from this call. > > > > - demote=1 would allow the kernel to demote if it desires. > > > > - demote=2 would only demote if possible but not attempt any > > > > other form of reclaim. > > > > > > Unfortunately, our proactive reclaim stack currently relies on > > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > > wary of changing user-visible semantics post-facto. > > > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > > add this to memory.reclaim instead? This would allow you to control > > > demotion and reclaim independently as you please: if you call it on a > > > node with demotion targets, it will demote; if you call it on a node > > > without one, it'll reclaim. And current users will remain unaffected. > > > > Hello Johannes, thanks for taking a look! > > > > I can certainly add the "nodes=" arg to memory.reclaim and you're > > right, that would help in bridging the gap. However, if I understand > > the underlying code correctly, with only the nodes= arg the kernel > > will indeed attempt demotion first, but the kernel will also merrily > > fall back to reclaiming if it can't demote the full amount. I had > > hoped to have the flexibility to protect latency sensitive jobs from > > reclaim entirely while attempting to do demotion. > > The fallback to reclaim actually strikes me as wrong. > > Think of reclaim as 'demoting' the pages to the storage tier. If we > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > CXL and from CXL to storage. If we reclaim a page from RAM, it means > we 'demote' it directly from RAM to storage, bypassing potentially a > huge amount of pages colder than it in CXL. That doesn't seem right. > Ah, I see. When you put it like that it makes a lot of sense. Reclaim would be just another type of demotion as you put it, i.e. demoting from the lowest memory tier to storage. I assume in your model demoting from the lowest memory tier to storage includes all of swapping, writing back of dirty pages, and discarding clean file pages. All these pages (anon, clean file pages, or dirty file pages) should first be demoted down the memory tiers until finally they get 'demoted' to storage. i.e. reclaimed. > If demotion fails, IMO it shouldn't satisfy the reclaim request by > breaking the layering. Rather it should deflect that pressure to the > lower layers to make room. This makes sure we maintain an aging > pipeline that honors the memory tier hierarchy. > Also got it. I believe the pseudo code would be roughly like a bubble sort algorithm of sorts, where the coldest pages are demoted to the next memory tier, and finally reclaimed from the final memory tier: demoted_pages = 0; retry: memory_tier = lowest_tier: for (; memory_tier < 0; memory_tier--) { if (memory_tier == lowest_tier) demoted_pages += demote_to_storage(); else demote_pages += demote_to_the_next_memory_tier(memory_tier); } if (demoted_pages < pages_to_demote) goto retry; > So I'm hesitant to design cgroup controls around the current behavior. > Thanks for taking the time to explain this. I think if it sounds good to folks I'll add the "nodes=" arg as you described now. Reworking the reclaim algorithm for memory tiering would be a bigger change in need of its own patchset. I think the nodes= arg by itself would help bridge the gap quite a bit. I surmise for Google we can : 1. Force reclaim by: echo "<size> nodes=<lowest memory tier nodes>" > memory.reclaim 2. Almost force demotion by: echo "<size> nodes=<highest memory tier nodes>" > memory.reclaim In case #2 the kernel may still reclaim it if it can't demote the full amount, but that is as you put it more of a bug that should be fixed. However, even in a world where the reclaim code works as you described, I wonder if we still need some kind of demote= arg. The issue is that demoting from the lowest memory tier to storage incurs more of a latency impact to the application than demoting between the other memory tiers, and that's because the other memory tiers are directly addressable, while pages demote to storage incur a pagefault. Not sure if that's a big concern at the moment. > > The above is just one angle of the issue. Another angle (which Yosry > > would care most about I think) is that at Google we call > > memory.reclaim mainly when memory.current is too close to memory.max > > and we expect the memory usage of the cgroup to drop as a result of a > > success memory.reclaim call. I suspect once we take in commit > > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > > we would run into that regression, but I defer to Yosry here, he may > > have a solution for that in mind already. > > IMO it should both demote and reclaim. Simliar to how memory.reclaim > on a non-tiered memory system would both deactivate active pages and > reclaim inactive pages.
On Wed, Nov 23, 2022 at 2:30 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Nov 23, 2022 at 01:35:13PM -0800, Yosry Ahmed wrote: > > On Wed, Nov 23, 2022 at 1:21 PM Mina Almasry <almasrymina@google.com> wrote: > > > > > > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > > > Hello Mina, > > > > > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > > > reclaim and demotion. This is likely fine for us for latency critical > > > > > jobs where we would want to disable proactive reclaim entirely, and is > > > > > also fine for latency tolerant jobs where we would like to both > > > > > proactively reclaim and demote. > > > > > > > > > > However, for some latency tiers in the middle we would like to demote but > > > > > not reclaim. This is because reclaim and demotion incur different latency > > > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > > > by the userspace at a higher latency, but reclaimed memory would need to > > > > > incur a pagefault. > > > > > > > > > > To address this, I propose having reclaim-only and demotion-only > > > > > mechanisms in the kernel. There are a couple possible > > > > > interfaces to carry this out I considered: > > > > > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > > > demotion interface (memory.demote). > > > > > 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion > > > > > behavior in the kernel like so: > > > > > - demote=0 would disable demotion from this call. > > > > > - demote=1 would allow the kernel to demote if it desires. > > > > > - demote=2 would only demote if possible but not attempt any > > > > > other form of reclaim. > > > > > > > > Unfortunately, our proactive reclaim stack currently relies on > > > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > > > wary of changing user-visible semantics post-facto. > > > > > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > > > add this to memory.reclaim instead? This would allow you to control > > > > demotion and reclaim independently as you please: if you call it on a > > > > node with demotion targets, it will demote; if you call it on a node > > > > without one, it'll reclaim. And current users will remain unaffected. > > > > > > Hello Johannes, thanks for taking a look! > > > > > > I can certainly add the "nodes=" arg to memory.reclaim and you're > > > right, that would help in bridging the gap. However, if I understand > > > the underlying code correctly, with only the nodes= arg the kernel > > > will indeed attempt demotion first, but the kernel will also merrily > > > fall back to reclaiming if it can't demote the full amount. I had > > > hoped to have the flexibility to protect latency sensitive jobs from > > > reclaim entirely while attempting to do demotion. > > > > > > There are probably ways to get around that in the userspace. I presume > > > the userspace can check if there is available memory on the node's > > > demotion targets, and if so, the kernel should demote-only. But I feel > > > that wouldn't be reliable as the demotion logic may change across > > > kernel versions. The userspace may think the kernel would demote but > > > instead demotion failed due to whatever heuristic introduced into the > > > new kernel version. > > > > > > The above is just one angle of the issue. Another angle (which Yosry > > > would care most about I think) is that at Google we call > > > memory.reclaim mainly when memory.current is too close to memory.max > > > and we expect the memory usage of the cgroup to drop as a result of a > > > success memory.reclaim call. I suspect once we take in commit > > > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > > > we would run into that regression, but I defer to Yosry here, he may > > > have a solution for that in mind already. > > > > We don't exactly rely on memory.current, but we do have a separate > > proactive reclaim policy today from demotion, and we do expect > > memory.reclaim to reclaim memory and not demote it. So it is important > > that we can control reclaim vs. demotion separately. Having > > memory.reclaim do demotions by default is not ideal for our current > > setup, so at least having a demote= argument to control it (no > > demotions, may demote, only demote) is needed. > > With a nodemask you should be able to only reclaim by specifying > terminal memory tiers that do that, and leave out higher tiers that > demote. > > That said, it would actually be nice if reclaim policy wouldn't have > to differ from demotion policy longer term. Ultimately it comes down > to mapping age to memory tier, right? Such that hot pages are in RAM, > warm pages are in CXL, cold pages are in storage. If you apply equal > presure on all tiers, it's access frequency that should determine > which RAM pages to demote, and which CXL pages to reclaim. If RAM > pages are hot and refuse demotion, and CXL pages are cold in > comparison, CXL should clear out. If RAM pages are warm, they should > get demoted to CXL but not reclaimed further from there (and rotate > instead). > > Do we know what's preventing this from happening today? What's the > reason you want to control them independently? The motivation was giving user space more flexibility to design their policies. However, as you point out, the current behavior of falling back to reclaiming when we cannot demote is not ideal, and maybe we should not design policies around it. We can always revisit this if a use case arises where a clear distinction needs to be drawn between reclaiming and demotion policies.
Hi, Johannes, Johannes Weiner <hannes@cmpxchg.org> writes: [...] > > The fallback to reclaim actually strikes me as wrong. > > Think of reclaim as 'demoting' the pages to the storage tier. If we > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > CXL and from CXL to storage. If we reclaim a page from RAM, it means > we 'demote' it directly from RAM to storage, bypassing potentially a > huge amount of pages colder than it in CXL. That doesn't seem right. > > If demotion fails, IMO it shouldn't satisfy the reclaim request by > breaking the layering. Rather it should deflect that pressure to the > lower layers to make room. This makes sure we maintain an aging > pipeline that honors the memory tier hierarchy. Yes. I think that we should avoid to fall back to reclaim as much as possible too. Now, when we allocate memory for demotion (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger kswapd reclaim on lower tier node to free some memory to avoid fall back to reclaim on current (higher tier) node. This may be not good enough, for example, the following patch from Hasan may help via waking up kswapd earlier. https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ Do you know what is the next step plan for this patch? Should we do even more? From another point of view, I still think that we can use falling back to reclaim as the last resort to avoid OOM in some special situations, for example, most pages in the lowest tier node are mlock() or too hot to be reclaimed. > So I'm hesitant to design cgroup controls around the current behavior. > Best Regards, Huang, Ying
On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Johannes, > > Johannes Weiner <hannes@cmpxchg.org> writes: > [...] > > > > The fallback to reclaim actually strikes me as wrong. > > > > Think of reclaim as 'demoting' the pages to the storage tier. If we > > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > > CXL and from CXL to storage. If we reclaim a page from RAM, it means > > we 'demote' it directly from RAM to storage, bypassing potentially a > > huge amount of pages colder than it in CXL. That doesn't seem right. > > > > If demotion fails, IMO it shouldn't satisfy the reclaim request by > > breaking the layering. Rather it should deflect that pressure to the > > lower layers to make room. This makes sure we maintain an aging > > pipeline that honors the memory tier hierarchy. > > Yes. I think that we should avoid to fall back to reclaim as much as > possible too. Now, when we allocate memory for demotion > (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > kswapd reclaim on lower tier node to free some memory to avoid fall back > to reclaim on current (higher tier) node. This may be not good enough, > for example, the following patch from Hasan may help via waking up > kswapd earlier. For the ideal case, I do agree with Johannes to demote the page tier by tier rather than reclaiming them from the higher tiers. But I also agree with your premature OOM concern. > > https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > > Do you know what is the next step plan for this patch? > > Should we do even more? In my initial implementation I implemented a simple throttle logic when the demotion is not going to succeed if the demotion target has not enough free memory (just check the watermark) to make migration succeed without doing any reclamation. Shall we resurrect that? Waking kswapd sooner is fine to me, but it may be not enough, for example, the kswapd may not keep up so remature OOM may happen on higher tiers or reclaim may still happen. I think throttling the reclaimer/demoter until kswapd makes progress could avoid both. And since the lower tiers memory typically is quite larger than the higher tiers, so the throttle should happen very rarely IMHO. > > From another point of view, I still think that we can use falling back > to reclaim as the last resort to avoid OOM in some special situations, > for example, most pages in the lowest tier node are mlock() or too hot > to be reclaimed. > > > So I'm hesitant to design cgroup controls around the current behavior. > > > > Best Regards, > Huang, Ying >
Yang Shi <shy828301@gmail.com> writes: > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Johannes, >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> [...] >> > >> > The fallback to reclaim actually strikes me as wrong. >> > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> > breaking the layering. Rather it should deflect that pressure to the >> > lower layers to make room. This makes sure we maintain an aging >> > pipeline that honors the memory tier hierarchy. >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> possible too. Now, when we allocate memory for demotion >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> to reclaim on current (higher tier) node. This may be not good enough, >> for example, the following patch from Hasan may help via waking up >> kswapd earlier. > > For the ideal case, I do agree with Johannes to demote the page tier > by tier rather than reclaiming them from the higher tiers. But I also > agree with your premature OOM concern. > >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> Do you know what is the next step plan for this patch? >> >> Should we do even more? > > In my initial implementation I implemented a simple throttle logic > when the demotion is not going to succeed if the demotion target has > not enough free memory (just check the watermark) to make migration > succeed without doing any reclamation. Shall we resurrect that? Can you share the link to your throttle patch? Or paste it here? > Waking kswapd sooner is fine to me, but it may be not enough, for > example, the kswapd may not keep up so remature OOM may happen on > higher tiers or reclaim may still happen. I think throttling the > reclaimer/demoter until kswapd makes progress could avoid both. And > since the lower tiers memory typically is quite larger than the higher > tiers, so the throttle should happen very rarely IMHO. > >> >> From another point of view, I still think that we can use falling back >> to reclaim as the last resort to avoid OOM in some special situations, >> for example, most pages in the lowest tier node are mlock() or too hot >> to be reclaimed. >> >> > So I'm hesitant to design cgroup controls around the current behavior. Best Regards, Huang, Ying
On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Hi, Johannes, > >> > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> [...] > >> > > >> > The fallback to reclaim actually strikes me as wrong. > >> > > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> > > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> > breaking the layering. Rather it should deflect that pressure to the > >> > lower layers to make room. This makes sure we maintain an aging > >> > pipeline that honors the memory tier hierarchy. > >> > >> Yes. I think that we should avoid to fall back to reclaim as much as > >> possible too. Now, when we allocate memory for demotion > >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> to reclaim on current (higher tier) node. This may be not good enough, > >> for example, the following patch from Hasan may help via waking up > >> kswapd earlier. > > > > For the ideal case, I do agree with Johannes to demote the page tier > > by tier rather than reclaiming them from the higher tiers. But I also > > agree with your premature OOM concern. > > > >> > >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> > >> Do you know what is the next step plan for this patch? > >> > >> Should we do even more? > > > > In my initial implementation I implemented a simple throttle logic > > when the demotion is not going to succeed if the demotion target has > > not enough free memory (just check the watermark) to make migration > > succeed without doing any reclamation. Shall we resurrect that? > > Can you share the link to your throttle patch? Or paste it here? I just found this on the mailing list. https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ But it didn't have the throttling logic, I may not submit that version to the mailing list since we decided to drop this and merge mine and Dave's. Anyway it is not hard to add the throttling logic, we already have a few throttling cases in vmscan, for example, "mm/vmscan: throttle reclaim until some writeback completes if congested". > > > Waking kswapd sooner is fine to me, but it may be not enough, for > > example, the kswapd may not keep up so remature OOM may happen on > > higher tiers or reclaim may still happen. I think throttling the > > reclaimer/demoter until kswapd makes progress could avoid both. And > > since the lower tiers memory typically is quite larger than the higher > > tiers, so the throttle should happen very rarely IMHO. > > > >> > >> From another point of view, I still think that we can use falling back > >> to reclaim as the last resort to avoid OOM in some special situations, > >> for example, most pages in the lowest tier node are mlock() or too hot > >> to be reclaimed. > >> > >> > So I'm hesitant to design cgroup controls around the current behavior. > > Best Regards, > Huang, Ying
Hello Ying, On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: > Johannes Weiner <hannes@cmpxchg.org> writes: > > The fallback to reclaim actually strikes me as wrong. > > > > Think of reclaim as 'demoting' the pages to the storage tier. If we > > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > > CXL and from CXL to storage. If we reclaim a page from RAM, it means > > we 'demote' it directly from RAM to storage, bypassing potentially a > > huge amount of pages colder than it in CXL. That doesn't seem right. > > > > If demotion fails, IMO it shouldn't satisfy the reclaim request by > > breaking the layering. Rather it should deflect that pressure to the > > lower layers to make room. This makes sure we maintain an aging > > pipeline that honors the memory tier hierarchy. > > Yes. I think that we should avoid to fall back to reclaim as much as > possible too. Now, when we allocate memory for demotion > (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > kswapd reclaim on lower tier node to free some memory to avoid fall back > to reclaim on current (higher tier) node. This may be not good enough, > for example, the following patch from Hasan may help via waking up > kswapd earlier. > > https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > > Do you know what is the next step plan for this patch? > > Should we do even more? > > From another point of view, I still think that we can use falling back > to reclaim as the last resort to avoid OOM in some special situations, > for example, most pages in the lowest tier node are mlock() or too hot > to be reclaimed. If they're hotter than reclaim candidates on the toptier, shouldn't they get promoted instead and make room that way? We may have to tweak the watermark logic a bit to facilitate that (allow promotions where regular allocations already fail?). But this sort of resorting would be preferable to age inversions. The mlock scenario sounds possible. In that case, it wouldn't be an aging inversion, since there is nothing colder on the CXL node. Maybe a bypass check should explicitly consult the demotion target watermarks against its evictable pages (similar to the file_is_tiny check in prepare_scan_count)? Because in any other scenario, if there is a bug in the promo/demo coordination, I think we'd rather have the OOM than deal with age inversions causing intermittent performance issues that are incredibly hard to track down.
On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Johannes, > > Johannes Weiner <hannes@cmpxchg.org> writes: > [...] > > > > The fallback to reclaim actually strikes me as wrong. > > > > Think of reclaim as 'demoting' the pages to the storage tier. If we > > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > > CXL and from CXL to storage. If we reclaim a page from RAM, it means > > we 'demote' it directly from RAM to storage, bypassing potentially a > > huge amount of pages colder than it in CXL. That doesn't seem right. > > > > If demotion fails, IMO it shouldn't satisfy the reclaim request by > > breaking the layering. Rather it should deflect that pressure to the > > lower layers to make room. This makes sure we maintain an aging > > pipeline that honors the memory tier hierarchy. > > Yes. I think that we should avoid to fall back to reclaim as much as > possible too. Now, when we allocate memory for demotion > (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger I may be missing something but as far I can tell reclaim is disabled for allocations from lower tier memory: https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583 I think this is maybe a good thing when doing proactive demotion. In this case we probably don't want to try to reclaim from lower tier nodes and instead fail the proactive demotion. However I can see this being desirable when the top tier nodes are under real memory pressure to deflect that pressure to the lower tier nodes. > kswapd reclaim on lower tier node to free some memory to avoid fall back > to reclaim on current (higher tier) node. This may be not good enough, > for example, the following patch from Hasan may help via waking up > kswapd earlier. > > https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > > Do you know what is the next step plan for this patch? > > Should we do even more? > > From another point of view, I still think that we can use falling back > to reclaim as the last resort to avoid OOM in some special situations, > for example, most pages in the lowest tier node are mlock() or too hot > to be reclaimed. > > > So I'm hesitant to design cgroup controls around the current behavior. I sent RFC v2 patch: https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u Please take a look when convenient. Thanks! > > > > Best Regards, > Huang, Ying >
Johannes Weiner <hannes@cmpxchg.org> writes: > Hello Ying, > > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: >> Johannes Weiner <hannes@cmpxchg.org> writes: >> > The fallback to reclaim actually strikes me as wrong. >> > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> > breaking the layering. Rather it should deflect that pressure to the >> > lower layers to make room. This makes sure we maintain an aging >> > pipeline that honors the memory tier hierarchy. >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> possible too. Now, when we allocate memory for demotion >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> to reclaim on current (higher tier) node. This may be not good enough, >> for example, the following patch from Hasan may help via waking up >> kswapd earlier. >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> Do you know what is the next step plan for this patch? >> >> Should we do even more? >> >> From another point of view, I still think that we can use falling back >> to reclaim as the last resort to avoid OOM in some special situations, >> for example, most pages in the lowest tier node are mlock() or too hot >> to be reclaimed. > > If they're hotter than reclaim candidates on the toptier, shouldn't > they get promoted instead and make room that way? We may have to tweak > the watermark logic a bit to facilitate that (allow promotions where > regular allocations already fail?). But this sort of resorting would > be preferable to age inversions. Now it's legal to enable demotion and disable promotion. Yes, this is wrong configuration in general. But should we trigger OOM for these users? And now promotion only works for default NUMA policy (and MPOL_BIND to both promotion source and target nodes with MPOL_F_NUMA_BALANCING). If we use some other NUMA policy, the pages cannot be promoted too. > The mlock scenario sounds possible. In that case, it wouldn't be an > aging inversion, since there is nothing colder on the CXL node. > > Maybe a bypass check should explicitly consult the demotion target > watermarks against its evictable pages (similar to the file_is_tiny > check in prepare_scan_count)? Yes. This sounds doable. > Because in any other scenario, if there is a bug in the promo/demo > coordination, I think we'd rather have the OOM than deal with age > inversions causing intermittent performance issues that are incredibly > hard to track down. Previously, I thought that people will always prefer performance regression than OOM. Apparently, I am wrong. Anyway, I think that we need to reduce the possibility of OOM or falling back to reclaim as much as possible firstly. Do you agree? One possibility, can we fall back to reclaim only if the sc->priority is small enough (even 0)? Best Regards, Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yang Shi <shy828301@gmail.com> writes: >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Hi, Johannes, >> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> [...] >> >> > >> >> > The fallback to reclaim actually strikes me as wrong. >> >> > >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> >> > >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> >> > breaking the layering. Rather it should deflect that pressure to the >> >> > lower layers to make room. This makes sure we maintain an aging >> >> > pipeline that honors the memory tier hierarchy. >> >> >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> >> possible too. Now, when we allocate memory for demotion >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> >> to reclaim on current (higher tier) node. This may be not good enough, >> >> for example, the following patch from Hasan may help via waking up >> >> kswapd earlier. >> > >> > For the ideal case, I do agree with Johannes to demote the page tier >> > by tier rather than reclaiming them from the higher tiers. But I also >> > agree with your premature OOM concern. >> > >> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> >> >> Do you know what is the next step plan for this patch? >> >> >> >> Should we do even more? >> > >> > In my initial implementation I implemented a simple throttle logic >> > when the demotion is not going to succeed if the demotion target has >> > not enough free memory (just check the watermark) to make migration >> > succeed without doing any reclamation. Shall we resurrect that? >> >> Can you share the link to your throttle patch? Or paste it here? > > I just found this on the mailing list. > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ Per my understanding, this patch will avoid demoting if there's no free space on demotion target? If so, I think that we should trigger kswapd reclaiming on demotion target before that. And we can simply avoid to fall back to reclaim firstly, then avoid to scan as an improvement as that in your patch above. Best Regards, Huang, Ying > But it didn't have the throttling logic, I may not submit that version > to the mailing list since we decided to drop this and merge mine and > Dave's. > > Anyway it is not hard to add the throttling logic, we already have a > few throttling cases in vmscan, for example, "mm/vmscan: throttle > reclaim until some writeback completes if congested". >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for >> > example, the kswapd may not keep up so remature OOM may happen on >> > higher tiers or reclaim may still happen. I think throttling the >> > reclaimer/demoter until kswapd makes progress could avoid both. And >> > since the lower tiers memory typically is quite larger than the higher >> > tiers, so the throttle should happen very rarely IMHO. >> > >> >> >> >> From another point of view, I still think that we can use falling back >> >> to reclaim as the last resort to avoid OOM in some special situations, >> >> for example, most pages in the lowest tier node are mlock() or too hot >> >> to be reclaimed. >> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior. >> >> Best Regards, >> Huang, Ying
Mina Almasry <almasrymina@google.com> writes: > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Johannes, >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> [...] >> > >> > The fallback to reclaim actually strikes me as wrong. >> > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> > breaking the layering. Rather it should deflect that pressure to the >> > lower layers to make room. This makes sure we maintain an aging >> > pipeline that honors the memory tier hierarchy. >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> possible too. Now, when we allocate memory for demotion >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > > I may be missing something but as far I can tell reclaim is disabled > for allocations from lower tier memory: > https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583 #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) We have GFP_NOWAIT set in gfp. > I think this is maybe a good thing when doing proactive demotion. In > this case we probably don't want to try to reclaim from lower tier > nodes and instead fail the proactive demotion. Do you have some real use cases for this? If so, we can tweak the logic. > However I can see this being desirable when the top tier nodes are > under real memory pressure to deflect that pressure to the lower tier > nodes. Yes. Best Regards, Huang, Ying >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> to reclaim on current (higher tier) node. This may be not good enough, >> for example, the following patch from Hasan may help via waking up >> kswapd earlier. >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> Do you know what is the next step plan for this patch? >> >> Should we do even more? >> >> From another point of view, I still think that we can use falling back >> to reclaim as the last resort to avoid OOM in some special situations, >> for example, most pages in the lowest tier node are mlock() or too hot >> to be reclaimed. >> >> > So I'm hesitant to design cgroup controls around the current behavior. > > I sent RFC v2 patch: > https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u > > Please take a look when convenient. Thanks! > >> > >> >> Best Regards, >> Huang, Ying >>
On Tue, Nov 29, 2022 at 9:40 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Hi, Johannes, > >> > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> [...] > >> > > >> > The fallback to reclaim actually strikes me as wrong. > >> > > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> > > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> > breaking the layering. Rather it should deflect that pressure to the > >> > lower layers to make room. This makes sure we maintain an aging > >> > pipeline that honors the memory tier hierarchy. > >> > >> Yes. I think that we should avoid to fall back to reclaim as much as > >> possible too. Now, when we allocate memory for demotion > >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > > > > I may be missing something but as far I can tell reclaim is disabled > > for allocations from lower tier memory: > > https://elixir.bootlin.com/linux/v6.1-rc7/source/mm/vmscan.c#L1583 > > #define GFP_NOWAIT (__GFP_KSWAPD_RECLAIM) > > We have GFP_NOWAIT set in gfp. > Ah, thanks. I missed that. > > I think this is maybe a good thing when doing proactive demotion. In > > this case we probably don't want to try to reclaim from lower tier > > nodes and instead fail the proactive demotion. > > Do you have some real use cases for this? If so, we can tweak the > logic. > Nothing real at the moment. I was thinking this may be something desirable to tune at some point. > > However I can see this being desirable when the top tier nodes are > > under real memory pressure to deflect that pressure to the lower tier > > nodes. > > Yes. > > Best Regards, > Huang, Ying > > >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> to reclaim on current (higher tier) node. This may be not good enough, > >> for example, the following patch from Hasan may help via waking up > >> kswapd earlier. > >> > >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> > >> Do you know what is the next step plan for this patch? > >> > >> Should we do even more? > >> > >> From another point of view, I still think that we can use falling back > >> to reclaim as the last resort to avoid OOM in some special situations, > >> for example, most pages in the lowest tier node are mlock() or too hot > >> to be reclaimed. > >> > >> > So I'm hesitant to design cgroup controls around the current behavior. > > > > I sent RFC v2 patch: > > https://lore.kernel.org/linux-mm/20221130020328.1009347-1-almasrymina@google.com/T/#u > > > > Please take a look when convenient. Thanks! > > > >> > > >> > >> Best Regards, > >> Huang, Ying > >> >
On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yang Shi <shy828301@gmail.com> writes: > >> > >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Hi, Johannes, > >> >> > >> >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> >> [...] > >> >> > > >> >> > The fallback to reclaim actually strikes me as wrong. > >> >> > > >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> >> > > >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> >> > breaking the layering. Rather it should deflect that pressure to the > >> >> > lower layers to make room. This makes sure we maintain an aging > >> >> > pipeline that honors the memory tier hierarchy. > >> >> > >> >> Yes. I think that we should avoid to fall back to reclaim as much as > >> >> possible too. Now, when we allocate memory for demotion > >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> >> to reclaim on current (higher tier) node. This may be not good enough, > >> >> for example, the following patch from Hasan may help via waking up > >> >> kswapd earlier. > >> > > >> > For the ideal case, I do agree with Johannes to demote the page tier > >> > by tier rather than reclaiming them from the higher tiers. But I also > >> > agree with your premature OOM concern. > >> > > >> >> > >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> >> > >> >> Do you know what is the next step plan for this patch? > >> >> > >> >> Should we do even more? > >> > > >> > In my initial implementation I implemented a simple throttle logic > >> > when the demotion is not going to succeed if the demotion target has > >> > not enough free memory (just check the watermark) to make migration > >> > succeed without doing any reclamation. Shall we resurrect that? > >> > >> Can you share the link to your throttle patch? Or paste it here? > > > > I just found this on the mailing list. > > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ > > Per my understanding, this patch will avoid demoting if there's no free > space on demotion target? If so, I think that we should trigger kswapd > reclaiming on demotion target before that. And we can simply avoid to > fall back to reclaim firstly, then avoid to scan as an improvement as > that in your patch above. Yes, it should. The rough idea looks like: if (the demote target is contended) wake up kswapd reclaim_throttle(VMSCAN_THROTTLE_DEMOTION) retry demotion The kswapd is responsible for clearing the contention flag. > > Best Regards, > Huang, Ying > > > But it didn't have the throttling logic, I may not submit that version > > to the mailing list since we decided to drop this and merge mine and > > Dave's. > > > > Anyway it is not hard to add the throttling logic, we already have a > > few throttling cases in vmscan, for example, "mm/vmscan: throttle > > reclaim until some writeback completes if congested". > >> > >> > Waking kswapd sooner is fine to me, but it may be not enough, for > >> > example, the kswapd may not keep up so remature OOM may happen on > >> > higher tiers or reclaim may still happen. I think throttling the > >> > reclaimer/demoter until kswapd makes progress could avoid both. And > >> > since the lower tiers memory typically is quite larger than the higher > >> > tiers, so the throttle should happen very rarely IMHO. > >> > > >> >> > >> >> From another point of view, I still think that we can use falling back > >> >> to reclaim as the last resort to avoid OOM in some special situations, > >> >> for example, most pages in the lowest tier node are mlock() or too hot > >> >> to be reclaimed. > >> >> > >> >> > So I'm hesitant to design cgroup controls around the current behavior. > >> > >> Best Regards, > >> Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yang Shi <shy828301@gmail.com> writes: >> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yang Shi <shy828301@gmail.com> writes: >> >> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Hi, Johannes, >> >> >> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> >> [...] >> >> >> > >> >> >> > The fallback to reclaim actually strikes me as wrong. >> >> >> > >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> >> >> > >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> >> >> > breaking the layering. Rather it should deflect that pressure to the >> >> >> > lower layers to make room. This makes sure we maintain an aging >> >> >> > pipeline that honors the memory tier hierarchy. >> >> >> >> >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> >> >> possible too. Now, when we allocate memory for demotion >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> >> >> to reclaim on current (higher tier) node. This may be not good enough, >> >> >> for example, the following patch from Hasan may help via waking up >> >> >> kswapd earlier. >> >> > >> >> > For the ideal case, I do agree with Johannes to demote the page tier >> >> > by tier rather than reclaiming them from the higher tiers. But I also >> >> > agree with your premature OOM concern. >> >> > >> >> >> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> >> >> >> >> Do you know what is the next step plan for this patch? >> >> >> >> >> >> Should we do even more? >> >> > >> >> > In my initial implementation I implemented a simple throttle logic >> >> > when the demotion is not going to succeed if the demotion target has >> >> > not enough free memory (just check the watermark) to make migration >> >> > succeed without doing any reclamation. Shall we resurrect that? >> >> >> >> Can you share the link to your throttle patch? Or paste it here? >> > >> > I just found this on the mailing list. >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ >> >> Per my understanding, this patch will avoid demoting if there's no free >> space on demotion target? If so, I think that we should trigger kswapd >> reclaiming on demotion target before that. And we can simply avoid to >> fall back to reclaim firstly, then avoid to scan as an improvement as >> that in your patch above. > > Yes, it should. The rough idea looks like: > > if (the demote target is contended) > wake up kswapd > reclaim_throttle(VMSCAN_THROTTLE_DEMOTION) > retry demotion > > The kswapd is responsible for clearing the contention flag. We may do this, at least for demotion in kswapd. But I think that this could be the second step optimization after we make correct choice between demotion/reclaim. What if the pages in demotion target is too hot to be reclaimed first? Should we reclaim in fast memory node to avoid OOM? Best Regards, Huang, Ying >> >> > But it didn't have the throttling logic, I may not submit that version >> > to the mailing list since we decided to drop this and merge mine and >> > Dave's. >> > >> > Anyway it is not hard to add the throttling logic, we already have a >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle >> > reclaim until some writeback completes if congested". >> >> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for >> >> > example, the kswapd may not keep up so remature OOM may happen on >> >> > higher tiers or reclaim may still happen. I think throttling the >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And >> >> > since the lower tiers memory typically is quite larger than the higher >> >> > tiers, so the throttle should happen very rarely IMHO. >> >> > >> >> >> >> >> >> From another point of view, I still think that we can use falling back >> >> >> to reclaim as the last resort to avoid OOM in some special situations, >> >> >> for example, most pages in the lowest tier node are mlock() or too hot >> >> >> to be reclaimed. >> >> >> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior. >> >> >> >> Best Regards, >> >> Huang, Ying
On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote: > > Johannes Weiner <hannes@cmpxchg.org> writes: > > > Hello Ying, > > > > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> > The fallback to reclaim actually strikes me as wrong. > >> > > >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> > > >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> > breaking the layering. Rather it should deflect that pressure to the > >> > lower layers to make room. This makes sure we maintain an aging > >> > pipeline that honors the memory tier hierarchy. > >> > >> Yes. I think that we should avoid to fall back to reclaim as much as > >> possible too. Now, when we allocate memory for demotion > >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> to reclaim on current (higher tier) node. This may be not good enough, > >> for example, the following patch from Hasan may help via waking up > >> kswapd earlier. > >> > >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> > >> Do you know what is the next step plan for this patch? > >> > >> Should we do even more? > >> > >> From another point of view, I still think that we can use falling back > >> to reclaim as the last resort to avoid OOM in some special situations, > >> for example, most pages in the lowest tier node are mlock() or too hot > >> to be reclaimed. > > > > If they're hotter than reclaim candidates on the toptier, shouldn't > > they get promoted instead and make room that way? We may have to tweak > > the watermark logic a bit to facilitate that (allow promotions where > > regular allocations already fail?). But this sort of resorting would > > be preferable to age inversions. > > Now it's legal to enable demotion and disable promotion. Yes, this is > wrong configuration in general. But should we trigger OOM for these > users? > > And now promotion only works for default NUMA policy (and MPOL_BIND to > both promotion source and target nodes with MPOL_F_NUMA_BALANCING). If > we use some other NUMA policy, the pages cannot be promoted too. > > > The mlock scenario sounds possible. In that case, it wouldn't be an > > aging inversion, since there is nothing colder on the CXL node. > > > > Maybe a bypass check should explicitly consult the demotion target > > watermarks against its evictable pages (similar to the file_is_tiny > > check in prepare_scan_count)? > > Yes. This sounds doable. > > > Because in any other scenario, if there is a bug in the promo/demo > > coordination, I think we'd rather have the OOM than deal with age > > inversions causing intermittent performance issues that are incredibly > > hard to track down. > > Previously, I thought that people will always prefer performance > regression than OOM. Apparently, I am wrong. > > Anyway, I think that we need to reduce the possibility of OOM or falling > back to reclaim as much as possible firstly. Do you agree? > I've been discussing this with a few folks here. I think FWIW general feeling here is that demoting from top tier nodes is preferred, except in extreme circumstances we would indeed like to run with a performance issue rather than OOM a customer VM. I wonder if there is another way to debug mis-tiered pages rather than trigger an oom to debug. One thing I think/hope we can trivially agree on is that proactive reclaim/demotion is _not_ an extreme circumstance. I would like me or someone from the team to follow up with a patch that disables fallback to reclaim on proactive reclaim/demotion (sc->proactive). > One possibility, can we fall back to reclaim only if the sc->priority is > small enough (even 0)? > This makes sense to me. > Best Regards, > Huang, Ying >
On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang@intel.com> wrote: > > Yang Shi <shy828301@gmail.com> writes: > > > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Yang Shi <shy828301@gmail.com> writes: > >> > >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> > >> >> Yang Shi <shy828301@gmail.com> writes: > >> >> > >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: > >> >> >> > >> >> >> Hi, Johannes, > >> >> >> > >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> >> >> [...] > >> >> >> > > >> >> >> > The fallback to reclaim actually strikes me as wrong. > >> >> >> > > >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> >> >> > > >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> >> >> > breaking the layering. Rather it should deflect that pressure to the > >> >> >> > lower layers to make room. This makes sure we maintain an aging > >> >> >> > pipeline that honors the memory tier hierarchy. > >> >> >> > >> >> >> Yes. I think that we should avoid to fall back to reclaim as much as > >> >> >> possible too. Now, when we allocate memory for demotion > >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> >> >> to reclaim on current (higher tier) node. This may be not good enough, > >> >> >> for example, the following patch from Hasan may help via waking up > >> >> >> kswapd earlier. > >> >> > > >> >> > For the ideal case, I do agree with Johannes to demote the page tier > >> >> > by tier rather than reclaiming them from the higher tiers. But I also > >> >> > agree with your premature OOM concern. > >> >> > > >> >> >> > >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> >> >> > >> >> >> Do you know what is the next step plan for this patch? > >> >> >> > >> >> >> Should we do even more? > >> >> > > >> >> > In my initial implementation I implemented a simple throttle logic > >> >> > when the demotion is not going to succeed if the demotion target has > >> >> > not enough free memory (just check the watermark) to make migration > >> >> > succeed without doing any reclamation. Shall we resurrect that? > >> >> > >> >> Can you share the link to your throttle patch? Or paste it here? > >> > > >> > I just found this on the mailing list. > >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ > >> > >> Per my understanding, this patch will avoid demoting if there's no free > >> space on demotion target? If so, I think that we should trigger kswapd > >> reclaiming on demotion target before that. And we can simply avoid to > >> fall back to reclaim firstly, then avoid to scan as an improvement as > >> that in your patch above. > > > > Yes, it should. The rough idea looks like: > > > > if (the demote target is contended) > > wake up kswapd > > reclaim_throttle(VMSCAN_THROTTLE_DEMOTION) > > retry demotion > > > > The kswapd is responsible for clearing the contention flag. > > We may do this, at least for demotion in kswapd. But I think that this > could be the second step optimization after we make correct choice > between demotion/reclaim. What if the pages in demotion target is too > hot to be reclaimed first? Should we reclaim in fast memory node to > avoid OOM? IMHO we can't avoid reclaiming from the fast nodes entirely if we prioritize avoiding OOMs. But it should happen very very rarely with the throttling logic or other methods. BTW did you run any test to see how many times vmscan reclaims from fast nodes instead of demotion with the current implementation for some typical workloads? > > Best Regards, > Huang, Ying > > >> > >> > But it didn't have the throttling logic, I may not submit that version > >> > to the mailing list since we decided to drop this and merge mine and > >> > Dave's. > >> > > >> > Anyway it is not hard to add the throttling logic, we already have a > >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle > >> > reclaim until some writeback completes if congested". > >> >> > >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for > >> >> > example, the kswapd may not keep up so remature OOM may happen on > >> >> > higher tiers or reclaim may still happen. I think throttling the > >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And > >> >> > since the lower tiers memory typically is quite larger than the higher > >> >> > tiers, so the throttle should happen very rarely IMHO. > >> >> > > >> >> >> > >> >> >> From another point of view, I still think that we can use falling back > >> >> >> to reclaim as the last resort to avoid OOM in some special situations, > >> >> >> for example, most pages in the lowest tier node are mlock() or too hot > >> >> >> to be reclaimed. > >> >> >> > >> >> >> > So I'm hesitant to design cgroup controls around the current behavior. > >> >> > >> >> Best Regards, > >> >> Huang, Ying
Yang Shi <shy828301@gmail.com> writes: > On Wed, Nov 30, 2022 at 5:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Yang Shi <shy828301@gmail.com> writes: >> >> > On Tue, Nov 29, 2022 at 9:33 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Yang Shi <shy828301@gmail.com> writes: >> >> >> >> > On Mon, Nov 28, 2022 at 4:54 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> Yang Shi <shy828301@gmail.com> writes: >> >> >> >> >> >> > On Wed, Nov 23, 2022 at 9:52 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> >> >> >> >> Hi, Johannes, >> >> >> >> >> >> >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> >> >> [...] >> >> >> >> > >> >> >> >> > The fallback to reclaim actually strikes me as wrong. >> >> >> >> > >> >> >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> >> >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> >> >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> >> >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> >> >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> >> >> >> > >> >> >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> >> >> >> > breaking the layering. Rather it should deflect that pressure to the >> >> >> >> > lower layers to make room. This makes sure we maintain an aging >> >> >> >> > pipeline that honors the memory tier hierarchy. >> >> >> >> >> >> >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> >> >> >> possible too. Now, when we allocate memory for demotion >> >> >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> >> >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> >> >> >> to reclaim on current (higher tier) node. This may be not good enough, >> >> >> >> for example, the following patch from Hasan may help via waking up >> >> >> >> kswapd earlier. >> >> >> > >> >> >> > For the ideal case, I do agree with Johannes to demote the page tier >> >> >> > by tier rather than reclaiming them from the higher tiers. But I also >> >> >> > agree with your premature OOM concern. >> >> >> > >> >> >> >> >> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> >> >> >> >> >> >> Do you know what is the next step plan for this patch? >> >> >> >> >> >> >> >> Should we do even more? >> >> >> > >> >> >> > In my initial implementation I implemented a simple throttle logic >> >> >> > when the demotion is not going to succeed if the demotion target has >> >> >> > not enough free memory (just check the watermark) to make migration >> >> >> > succeed without doing any reclamation. Shall we resurrect that? >> >> >> >> >> >> Can you share the link to your throttle patch? Or paste it here? >> >> > >> >> > I just found this on the mailing list. >> >> > https://lore.kernel.org/linux-mm/1560468577-101178-8-git-send-email-yang.shi@linux.alibaba.com/ >> >> >> >> Per my understanding, this patch will avoid demoting if there's no free >> >> space on demotion target? If so, I think that we should trigger kswapd >> >> reclaiming on demotion target before that. And we can simply avoid to >> >> fall back to reclaim firstly, then avoid to scan as an improvement as >> >> that in your patch above. >> > >> > Yes, it should. The rough idea looks like: >> > >> > if (the demote target is contended) >> > wake up kswapd >> > reclaim_throttle(VMSCAN_THROTTLE_DEMOTION) >> > retry demotion >> > >> > The kswapd is responsible for clearing the contention flag. >> >> We may do this, at least for demotion in kswapd. But I think that this >> could be the second step optimization after we make correct choice >> between demotion/reclaim. What if the pages in demotion target is too >> hot to be reclaimed first? Should we reclaim in fast memory node to >> avoid OOM? > > IMHO we can't avoid reclaiming from the fast nodes entirely if we > prioritize avoiding OOMs. Yes. I think so too. > But it should happen very very rarely with the throttling logic or > other methods. Yes. I think that this is possible. > BTW did you run any test to see how many times vmscan reclaims from > fast nodes instead of demotion with the current implementation for > some typical workloads? No. I haven't done that. Best Regards, Huang, Ying >> >> >> >> >> > But it didn't have the throttling logic, I may not submit that version >> >> > to the mailing list since we decided to drop this and merge mine and >> >> > Dave's. >> >> > >> >> > Anyway it is not hard to add the throttling logic, we already have a >> >> > few throttling cases in vmscan, for example, "mm/vmscan: throttle >> >> > reclaim until some writeback completes if congested". >> >> >> >> >> >> > Waking kswapd sooner is fine to me, but it may be not enough, for >> >> >> > example, the kswapd may not keep up so remature OOM may happen on >> >> >> > higher tiers or reclaim may still happen. I think throttling the >> >> >> > reclaimer/demoter until kswapd makes progress could avoid both. And >> >> >> > since the lower tiers memory typically is quite larger than the higher >> >> >> > tiers, so the throttle should happen very rarely IMHO. >> >> >> > >> >> >> >> >> >> >> >> From another point of view, I still think that we can use falling back >> >> >> >> to reclaim as the last resort to avoid OOM in some special situations, >> >> >> >> for example, most pages in the lowest tier node are mlock() or too hot >> >> >> >> to be reclaimed. >> >> >> >> >> >> >> >> > So I'm hesitant to design cgroup controls around the current behavior. >> >> >> >> >> >> Best Regards, >> >> >> Huang, Ying
Mina Almasry <almasrymina@google.com> writes: > On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> > Hello Ying, >> > >> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: >> >> Johannes Weiner <hannes@cmpxchg.org> writes: >> >> > The fallback to reclaim actually strikes me as wrong. >> >> > >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. >> >> > >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by >> >> > breaking the layering. Rather it should deflect that pressure to the >> >> > lower layers to make room. This makes sure we maintain an aging >> >> > pipeline that honors the memory tier hierarchy. >> >> >> >> Yes. I think that we should avoid to fall back to reclaim as much as >> >> possible too. Now, when we allocate memory for demotion >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back >> >> to reclaim on current (higher tier) node. This may be not good enough, >> >> for example, the following patch from Hasan may help via waking up >> >> kswapd earlier. >> >> >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ >> >> >> >> Do you know what is the next step plan for this patch? >> >> >> >> Should we do even more? >> >> >> >> From another point of view, I still think that we can use falling back >> >> to reclaim as the last resort to avoid OOM in some special situations, >> >> for example, most pages in the lowest tier node are mlock() or too hot >> >> to be reclaimed. >> > >> > If they're hotter than reclaim candidates on the toptier, shouldn't >> > they get promoted instead and make room that way? We may have to tweak >> > the watermark logic a bit to facilitate that (allow promotions where >> > regular allocations already fail?). But this sort of resorting would >> > be preferable to age inversions. >> >> Now it's legal to enable demotion and disable promotion. Yes, this is >> wrong configuration in general. But should we trigger OOM for these >> users? >> >> And now promotion only works for default NUMA policy (and MPOL_BIND to >> both promotion source and target nodes with MPOL_F_NUMA_BALANCING). If >> we use some other NUMA policy, the pages cannot be promoted too. >> >> > The mlock scenario sounds possible. In that case, it wouldn't be an >> > aging inversion, since there is nothing colder on the CXL node. >> > >> > Maybe a bypass check should explicitly consult the demotion target >> > watermarks against its evictable pages (similar to the file_is_tiny >> > check in prepare_scan_count)? >> >> Yes. This sounds doable. >> >> > Because in any other scenario, if there is a bug in the promo/demo >> > coordination, I think we'd rather have the OOM than deal with age >> > inversions causing intermittent performance issues that are incredibly >> > hard to track down. >> >> Previously, I thought that people will always prefer performance >> regression than OOM. Apparently, I am wrong. >> >> Anyway, I think that we need to reduce the possibility of OOM or falling >> back to reclaim as much as possible firstly. Do you agree? >> > > I've been discussing this with a few folks here. I think FWIW general > feeling here is that demoting from top tier nodes is preferred, except > in extreme circumstances we would indeed like to run with a > performance issue rather than OOM a customer VM. I wonder if there is > another way to debug mis-tiered pages rather than trigger an oom to > debug. > > One thing I think/hope we can trivially agree on is that proactive > reclaim/demotion is _not_ an extreme circumstance. I would like me or > someone from the team to follow up with a patch that disables fallback > to reclaim on proactive reclaim/demotion (sc->proactive). Yes. This makes sense to me. Best Regards, Huang, Ying >> One possibility, can we fall back to reclaim only if the sc->priority is >> small enough (even 0)? >> > > This makes sense to me. > >> Best Regards, >> Huang, Ying >>
On Thu, Dec 1, 2022 at 6:02 PM Huang, Ying <ying.huang@intel.com> wrote: > > Mina Almasry <almasrymina@google.com> writes: > > > On Tue, Nov 29, 2022 at 7:56 PM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> > >> > Hello Ying, > >> > > >> > On Thu, Nov 24, 2022 at 01:51:20PM +0800, Huang, Ying wrote: > >> >> Johannes Weiner <hannes@cmpxchg.org> writes: > >> >> > The fallback to reclaim actually strikes me as wrong. > >> >> > > >> >> > Think of reclaim as 'demoting' the pages to the storage tier. If we > >> >> > have a RAM -> CXL -> storage hierarchy, we should demote from RAM to > >> >> > CXL and from CXL to storage. If we reclaim a page from RAM, it means > >> >> > we 'demote' it directly from RAM to storage, bypassing potentially a > >> >> > huge amount of pages colder than it in CXL. That doesn't seem right. > >> >> > > >> >> > If demotion fails, IMO it shouldn't satisfy the reclaim request by > >> >> > breaking the layering. Rather it should deflect that pressure to the > >> >> > lower layers to make room. This makes sure we maintain an aging > >> >> > pipeline that honors the memory tier hierarchy. > >> >> > >> >> Yes. I think that we should avoid to fall back to reclaim as much as > >> >> possible too. Now, when we allocate memory for demotion > >> >> (alloc_demote_page()), __GFP_KSWAPD_RECLAIM is used. So, we will trigger > >> >> kswapd reclaim on lower tier node to free some memory to avoid fall back > >> >> to reclaim on current (higher tier) node. This may be not good enough, > >> >> for example, the following patch from Hasan may help via waking up > >> >> kswapd earlier. > >> >> > >> >> https://lore.kernel.org/linux-mm/b45b9bf7cd3e21bca61d82dcd1eb692cd32c122c.1637778851.git.hasanalmaruf@fb.com/ > >> >> > >> >> Do you know what is the next step plan for this patch? > >> >> > >> >> Should we do even more? > >> >> > >> >> From another point of view, I still think that we can use falling back > >> >> to reclaim as the last resort to avoid OOM in some special situations, > >> >> for example, most pages in the lowest tier node are mlock() or too hot > >> >> to be reclaimed. > >> > > >> > If they're hotter than reclaim candidates on the toptier, shouldn't > >> > they get promoted instead and make room that way? We may have to tweak > >> > the watermark logic a bit to facilitate that (allow promotions where > >> > regular allocations already fail?). But this sort of resorting would > >> > be preferable to age inversions. > >> > >> Now it's legal to enable demotion and disable promotion. Yes, this is > >> wrong configuration in general. But should we trigger OOM for these > >> users? > >> > >> And now promotion only works for default NUMA policy (and MPOL_BIND to > >> both promotion source and target nodes with MPOL_F_NUMA_BALANCING). If > >> we use some other NUMA policy, the pages cannot be promoted too. > >> > >> > The mlock scenario sounds possible. In that case, it wouldn't be an > >> > aging inversion, since there is nothing colder on the CXL node. > >> > > >> > Maybe a bypass check should explicitly consult the demotion target > >> > watermarks against its evictable pages (similar to the file_is_tiny > >> > check in prepare_scan_count)? > >> > >> Yes. This sounds doable. > >> > >> > Because in any other scenario, if there is a bug in the promo/demo > >> > coordination, I think we'd rather have the OOM than deal with age > >> > inversions causing intermittent performance issues that are incredibly > >> > hard to track down. > >> > >> Previously, I thought that people will always prefer performance > >> regression than OOM. Apparently, I am wrong. > >> > >> Anyway, I think that we need to reduce the possibility of OOM or falling > >> back to reclaim as much as possible firstly. Do you agree? > >> > > > > I've been discussing this with a few folks here. I think FWIW general > > feeling here is that demoting from top tier nodes is preferred, except > > in extreme circumstances we would indeed like to run with a > > performance issue rather than OOM a customer VM. I wonder if there is > > another way to debug mis-tiered pages rather than trigger an oom to > > debug. > > > > One thing I think/hope we can trivially agree on is that proactive > > reclaim/demotion is _not_ an extreme circumstance. I would like me or > > someone from the team to follow up with a patch that disables fallback > > to reclaim on proactive reclaim/demotion (sc->proactive). > > Yes. This makes sense to me. > Glad to hear it. Patch is already sent for review btw: https://lore.kernel.org/linux-mm/20221201233317.1394958-1-almasrymina@google.com/T/ > Best Regards, > Huang, Ying > > >> One possibility, can we fall back to reclaim only if the sc->priority is > >> small enough (even 0)? > >> > > > > This makes sense to me. > > > >> Best Regards, > >> Huang, Ying > >> >
diff --git a/include/linux/swap.h b/include/linux/swap.h index fec6647a289a..f768171c2dc2 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -416,6 +416,12 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order, #define MEMCG_RECLAIM_MAY_SWAP (1 << 1) #define MEMCG_RECLAIM_PROACTIVE (1 << 2) +#define MEMCG_RECLAIM_MAY_DEMOTE (1 << 3) +#define MEMCG_RECLAIM_ONLY_DEMOTE (1 << 4) +#define MEMCG_RECLAIM_DEFAULT \ + (MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_MAY_DEMOTE) +#define MEMCG_RECLAIM_NO_SWAP (MEMCG_RECLAIM_DEFAULT & ~MEMCG_RECLAIM_MAY_SWAP) + extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, diff --git a/mm/memcontrol.c b/mm/memcontrol.c index f412c903ee4f..fd4ff1c865a2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2392,7 +2392,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg, psi_memstall_enter(&pflags); nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, - MEMCG_RECLAIM_MAY_SWAP); + MEMCG_RECLAIM_DEFAULT); psi_memstall_leave(&pflags); } while ((memcg = parent_mem_cgroup(memcg)) && !mem_cgroup_is_root(memcg)); @@ -2637,7 +2637,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; unsigned long nr_reclaimed; bool passed_oom = false; - unsigned int reclaim_options = MEMCG_RECLAIM_MAY_SWAP; + unsigned int reclaim_options = MEMCG_RECLAIM_DEFAULT; bool drained = false; bool raised_max_event = false; unsigned long pflags; @@ -3503,7 +3503,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg, } if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) { + memsw ? MEMCG_RECLAIM_NO_SWAP : + MEMCG_RECLAIM_DEFAULT)) { ret = -EBUSY; break; } @@ -3614,7 +3615,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg) return -EINTR; if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, - MEMCG_RECLAIM_MAY_SWAP)) + MEMCG_RECLAIM_DEFAULT)) nr_retries--; } @@ -6407,7 +6408,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, } reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP); + GFP_KERNEL, MEMCG_RECLAIM_DEFAULT); if (!reclaimed && !nr_retries--) break; @@ -6456,7 +6457,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (nr_reclaims) { if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max, - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP)) + GFP_KERNEL, MEMCG_RECLAIM_DEFAULT)) nr_reclaims--; continue; } @@ -6593,7 +6594,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, if (err) return err; - reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE; + reclaim_options = MEMCG_RECLAIM_DEFAULT | MEMCG_RECLAIM_PROACTIVE; + reclaim_options &= ~MEMCG_RECLAIM_MAY_DEMOTE; while (nr_reclaimed < nr_to_reclaim) { unsigned long reclaimed; diff --git a/mm/vmscan.c b/mm/vmscan.c index d8751e403599..dea05ad8ece5 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -132,8 +132,14 @@ struct scan_control { /* The file folios on the current node are dangerously low */ unsigned int file_is_tiny:1; - /* Always discard instead of demoting to lower tier memory */ - unsigned int no_demotion:1; + /* + * Configure discard instead of demoting to lower tier memory: + * + * demotion = 0 -> no demotion + * demotion = 1 -> may demote + * demotion = 2 -> only demote + */ + unsigned int demotion; #ifdef CONFIG_LRU_GEN /* help kswapd make better choices among multiple memcgs */ @@ -542,7 +548,7 @@ static bool can_demote(int nid, struct scan_control *sc) { if (!numa_demotion_enabled) return false; - if (sc && sc->no_demotion) + if (sc && !sc->demotion) return false; if (next_demotion_node(nid) == NUMA_NO_NODE) return false; @@ -2674,7 +2680,7 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list, .may_writepage = 1, .may_unmap = 1, .may_swap = 1, - .no_demotion = 1, + .demotion = 0, }; nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &dummy_stat, false); @@ -6726,6 +6732,13 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, */ struct zonelist *zonelist = node_zonelist(numa_node_id(), sc.gfp_mask); + if (reclaim_options & MEMCG_RECLAIM_ONLY_DEMOTE) + sc.demotion = 2; + else if (reclaim_options & MEMCG_RECLAIM_MAY_DEMOTE) + sc.demotion = 1; + else + sc.demotion = 0; + set_task_reclaim_state(current, &sc.reclaim_state); trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); noreclaim_flag = memalloc_noreclaim_save();
Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. This is likely fine for us for latency critical jobs where we would want to disable proactive reclaim entirely, and is also fine for latency tolerant jobs where we would like to both proactively reclaim and demote. However, for some latency tiers in the middle we would like to demote but not reclaim. This is because reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. To address this, I propose having reclaim-only and demotion-only mechanisms in the kernel. There are a couple possible interfaces to carry this out I considered: 1. Disable demotion in the memory.reclaim interface and add a new demotion interface (memory.demote). 2. Extend memory.reclaim with a "demote=<int>" flag to configure the demotion behavior in the kernel like so: - demote=0 would disable demotion from this call. - demote=1 would allow the kernel to demote if it desires. - demote=2 would only demote if possible but not attempt any other form of reclaim. I've implemented option #1 in this RFC to demonstrate a sample and would love feedback on the usecase and approach. Additionally, when triggering proactive demotion it is preferrable to have a nodelist argument that allows the userspace to proactively demote from specific memory tiers according to its policy. The userspace may want to demote from specific nodes that are under pressure, and may want to demote from specific tiers that are under pressure. An example of this use case would be in a 3 memory tier system, the userspace may want to demote from the second tier without disturbing the hot memory in the top tier. The current RFC series is missing updates to docs and selftests, but if the general approach and usecase is acceptable I plan to add these with a PATCH V1 review. First patch in this series disables demotion from the proactive reclaim interface memory.reclaim. Follow up patches in the series implement the memory.demote interface and its nodeslist arg. Signed-off-by: Mina Almasry <almasrymina@google.com> --- include/linux/swap.h | 6 ++++++ mm/memcontrol.c | 16 +++++++++------- mm/vmscan.c | 21 +++++++++++++++++---- 3 files changed, 32 insertions(+), 11 deletions(-) -- 2.38.1.584.g0f3c55d4c2-goog