Message ID | 20191016221148.F9CCD155@viggo.jf.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Migrate Pages in lieu of discard | expand |
On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > We're starting to see systems with more and more kinds of memory such > as Intel's implementation of persistent memory. > > Let's say you have a system with some DRAM and some persistent memory. > Today, once DRAM fills up, reclaim will start and some of the DRAM > contents will be thrown out. Allocations will, at some point, start > falling over to the slower persistent memory. > > That has two nasty properties. First, the newer allocations can end > up in the slower persistent memory. Second, reclaimed data in DRAM > are just discarded even if there are gobs of space in persistent > memory that could be used. > > This set implements a solution to these problems. At the end of the > reclaim process in shrink_page_list() just before the last page > refcount is dropped, the page is migrated to persistent memory instead > of being dropped. > > While I've talked about a DRAM/PMEM pairing, this approach would > function in any environment where memory tiers exist. > > This is not perfect. It "strands" pages in slower memory and never > brings them back to fast DRAM. Other things need to be built to > promote hot pages back to DRAM. > > This is part of a larger patch set. If you want to apply these or > play with them, I'd suggest using the tree from here. It includes > autonuma-based hot page promotion back to DRAM: > > http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com > > This is also all based on an upstream mechanism that allows > persistent memory to be onlined and used as if it were volatile: > > http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com The memory cgroup part of the story is missing here. Since PMEM is treated as slow DRAM, shouldn't its usage be accounted to the corresponding memcg's memory/memsw counters and the migration should not happen for memcg limit reclaim? Otherwise some jobs can hog the whole PMEM. Also what happens when PMEM is full? Can the memory migrated to PMEM be reclaimed (or discarded)? Shakeel
On 10/16/19 8:45 PM, Shakeel Butt wrote: > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: >> This set implements a solution to these problems. At the end of the >> reclaim process in shrink_page_list() just before the last page >> refcount is dropped, the page is migrated to persistent memory instead >> of being dropped. ..> The memory cgroup part of the story is missing here. Since PMEM is > treated as slow DRAM, shouldn't its usage be accounted to the > corresponding memcg's memory/memsw counters and the migration should > not happen for memcg limit reclaim? Otherwise some jobs can hog the > whole PMEM. My expectation (and I haven't confirmed this) is that the any memory use is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg limit reclaim and global reclaim both end up doing migrations and neither should have a net effect on the counters. There is certainly a problem here because DRAM is a more valuable resource vs. PMEM, and memcg accounts for them as if they were equally valuable. I really want to see memcg account for this cost discrepancy at some point, but I'm not quite sure what form it would take. Any feedback from you heavy memcg users out there would be much appreciated. > Also what happens when PMEM is full? Can the memory migrated to PMEM > be reclaimed (or discarded)? Yep. The "migration path" can be as long as you want, but once the data hits a "terminal node" it will stop getting migrated and normal discard at the end of reclaim happens.
On Thu, Oct 17, 2019 at 7:14 AM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > We're starting to see systems with more and more kinds of memory such > as Intel's implementation of persistent memory. > > Let's say you have a system with some DRAM and some persistent memory. > Today, once DRAM fills up, reclaim will start and some of the DRAM > contents will be thrown out. Allocations will, at some point, start > falling over to the slower persistent memory. > > That has two nasty properties. First, the newer allocations can end > up in the slower persistent memory. Second, reclaimed data in DRAM > are just discarded even if there are gobs of space in persistent > memory that could be used. > > This set implements a solution to these problems. At the end of the > reclaim process in shrink_page_list() just before the last page > refcount is dropped, the page is migrated to persistent memory instead > of being dropped. > > While I've talked about a DRAM/PMEM pairing, this approach would > function in any environment where memory tiers exist. > > This is not perfect. It "strands" pages in slower memory and never > brings them back to fast DRAM. Other things need to be built to > promote hot pages back to DRAM. > > This is part of a larger patch set. If you want to apply these or > play with them, I'd suggest using the tree from here. It includes > autonuma-based hot page promotion back to DRAM: > > http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com > > This is also all based on an upstream mechanism that allows > persistent memory to be onlined and used as if it were volatile: > > http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com > We prototyped something very similar to this patch series in the past. One problem that came up is that if you get into direct reclaim, because persistent memory can have pretty low write throughput, you can end up stalling users for a pretty long time while migrating pages. To mitigate that, we tried changing background reclaim to start migrating much earlier (but not otherwise reclaiming), however it drastically increased the code complexity and still had the chance of not being able to catch up with pressure. Because of that, we moved to a solution based on the proactive reclaim of idle pages, that was presented at LSFMM earlier this year: https://lwn.net/Articles/787611/ . -- Suleiman
On 10/17/19 9:01 AM, Suleiman Souhlal wrote: > One problem that came up is that if you get into direct reclaim, > because persistent memory can have pretty low write throughput, you > can end up stalling users for a pretty long time while migrating > pages. Basically, you're saying that memory load spikes turn into latency spikes? FWIW, we have been benchmarking this sucker with benchmarks that claim to care about latency. In general, compared to DRAM, we do see worse latency, but nothing catastrophic yet. I'd be interested if you have any workloads that act as reasonable proxies for your latency requirements. > Because of that, we moved to a solution based on the proactive reclaim > of idle pages, that was presented at LSFMM earlier this year: > https://lwn.net/Articles/787611/ . I saw the presentation. The feedback in the room as I remember it was that proactive reclaim essentially replaced the existing reclaim mechanism, to which the audience was not receptive. Have folks opinions changed on that, or are you looking for other solutions?
On Thu, Oct 17, 2019 at 9:32 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/17/19 9:01 AM, Suleiman Souhlal wrote: > > One problem that came up is that if you get into direct reclaim, > > because persistent memory can have pretty low write throughput, you > > can end up stalling users for a pretty long time while migrating > > pages. > > Basically, you're saying that memory load spikes turn into latency spikes? > > FWIW, we have been benchmarking this sucker with benchmarks that claim > to care about latency. In general, compared to DRAM, we do see worse > latency, but nothing catastrophic yet. I'd be interested if you have > any workloads that act as reasonable proxies for your latency requirements. > > > Because of that, we moved to a solution based on the proactive reclaim > > of idle pages, that was presented at LSFMM earlier this year: > > https://lwn.net/Articles/787611/ . > > I saw the presentation. The feedback in the room as I remember it was > that proactive reclaim essentially replaced the existing reclaim > mechanism, to which the audience was not receptive. Have folks opinions > changed on that, or are you looking for other solutions? > I am currently working on a solution which shares the mechanisms between regular and proactive reclaim. The interested users/admins can setup proactive reclaim otherwise the regular reclaim will work on low memory. I will have something in one/two months and will post the patches. Shakeel
On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/16/19 8:45 PM, Shakeel Butt wrote: > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > >> This set implements a solution to these problems. At the end of the > >> reclaim process in shrink_page_list() just before the last page > >> refcount is dropped, the page is migrated to persistent memory instead > >> of being dropped. > ..> The memory cgroup part of the story is missing here. Since PMEM is > > treated as slow DRAM, shouldn't its usage be accounted to the > > corresponding memcg's memory/memsw counters and the migration should > > not happen for memcg limit reclaim? Otherwise some jobs can hog the > > whole PMEM. > > My expectation (and I haven't confirmed this) is that the any memory use > is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg > limit reclaim and global reclaim both end up doing migrations and > neither should have a net effect on the counters. > Hmm I didn't see the memcg charge migration in the code on demotion. So, in the code [patch 3] the counters are being decremented as DRAM is freed but not incremented for PMEM. > There is certainly a problem here because DRAM is a more valuable > resource vs. PMEM, and memcg accounts for them as if they were equally > valuable. I really want to see memcg account for this cost discrepancy > at some point, but I'm not quite sure what form it would take. Any > feedback from you heavy memcg users out there would be much appreciated. > There are two apparent use-cases i.e. explicit (apps moving their pages to PMEM to reduce cost) and implicit (admin moves cold pages to PMEM transparently to the apps) for the PMEM. In the implicit case, I see both DRAM and PMEM as same resource from the perspective of memcg limits i.e. same memcg counter, something like cgroup v1's memsw). For the explicit case, maybe separate counters make sense like cgroup v2's memory and swap. > > Also what happens when PMEM is full? Can the memory migrated to PMEM > > be reclaimed (or discarded)? > > Yep. The "migration path" can be as long as you want, but once the data > hits a "terminal node" it will stop getting migrated and normal discard > at the end of reclaim happens. I might have missed it but I didn't see the migrated pages inserted back to LRUs. If they are not in LRU, the reclaimer will never see them.
On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/16/19 8:45 PM, Shakeel Butt wrote: > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > >> This set implements a solution to these problems. At the end of the > >> reclaim process in shrink_page_list() just before the last page > >> refcount is dropped, the page is migrated to persistent memory instead > >> of being dropped. > ..> The memory cgroup part of the story is missing here. Since PMEM is > > treated as slow DRAM, shouldn't its usage be accounted to the > > corresponding memcg's memory/memsw counters and the migration should > > not happen for memcg limit reclaim? Otherwise some jobs can hog the > > whole PMEM. > > My expectation (and I haven't confirmed this) is that the any memory use > is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg > limit reclaim and global reclaim both end up doing migrations and > neither should have a net effect on the counters. Yes, your expectation is correct. As long as PMEM is a NUMA node, it is treated as regular memory by memcg. But, I don't think memcg limit reclaim should do migration since limit reclaim is used to reduce memory usage, but migration doesn't reduce usage, it just moves memory from one node to the other. In my implementation, I just skip migration for memcg limit reclaim, please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/ > > There is certainly a problem here because DRAM is a more valuable > resource vs. PMEM, and memcg accounts for them as if they were equally > valuable. I really want to see memcg account for this cost discrepancy > at some point, but I'm not quite sure what form it would take. Any > feedback from you heavy memcg users out there would be much appreciated. We did have some demands to control the ratio between DRAM and PMEM as I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM and PMEM respectively or something similar. > > > Also what happens when PMEM is full? Can the memory migrated to PMEM > > be reclaimed (or discarded)? > > Yep. The "migration path" can be as long as you want, but once the data > hits a "terminal node" it will stop getting migrated and normal discard > at the end of reclaim happens. I recalled I had a hallway conversation with Keith about this in LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't think exporting migration path to userspace (or letting user to define migration path) and having multiple migration stops are good ideas in general. >
On 10/17/19 9:58 AM, Shakeel Butt wrote: >> My expectation (and I haven't confirmed this) is that the any memory use >> is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg >> limit reclaim and global reclaim both end up doing migrations and >> neither should have a net effect on the counters. >> > Hmm I didn't see the memcg charge migration in the code on demotion. > So, in the code [patch 3] the counters are being decremented as DRAM > is freed but not incremented for PMEM. I had assumed that the migration code was doing this for me. I'll go make sure either way.
On 10/17/19 10:20 AM, Yang Shi wrote: > On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote: >> My expectation (and I haven't confirmed this) is that the any memory use >> is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg >> limit reclaim and global reclaim both end up doing migrations and >> neither should have a net effect on the counters. > > Yes, your expectation is correct. As long as PMEM is a NUMA node, it > is treated as regular memory by memcg. But, I don't think memcg limit > reclaim should do migration since limit reclaim is used to reduce > memory usage, but migration doesn't reduce usage, it just moves memory > from one node to the other. > > In my implementation, I just skip migration for memcg limit reclaim, > please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/ Ahh, got it. That does make sense. I might have to steal your implementation.
On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 10/16/19 8:45 PM, Shakeel Butt wrote: > > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > >> This set implements a solution to these problems. At the end of the > > >> reclaim process in shrink_page_list() just before the last page > > >> refcount is dropped, the page is migrated to persistent memory instead > > >> of being dropped. > > ..> The memory cgroup part of the story is missing here. Since PMEM is > > > treated as slow DRAM, shouldn't its usage be accounted to the > > > corresponding memcg's memory/memsw counters and the migration should > > > not happen for memcg limit reclaim? Otherwise some jobs can hog the > > > whole PMEM. > > > > My expectation (and I haven't confirmed this) is that the any memory use > > is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg > > limit reclaim and global reclaim both end up doing migrations and > > neither should have a net effect on the counters. > > Yes, your expectation is correct. As long as PMEM is a NUMA node, it > is treated as regular memory by memcg. But, I don't think memcg limit > reclaim should do migration since limit reclaim is used to reduce > memory usage, but migration doesn't reduce usage, it just moves memory > from one node to the other. > > In my implementation, I just skip migration for memcg limit reclaim, > please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/ > > > > > There is certainly a problem here because DRAM is a more valuable > > resource vs. PMEM, and memcg accounts for them as if they were equally > > valuable. I really want to see memcg account for this cost discrepancy > > at some point, but I'm not quite sure what form it would take. Any > > feedback from you heavy memcg users out there would be much appreciated. > > We did have some demands to control the ratio between DRAM and PMEM as > I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM > and PMEM respectively or something similar. > Can you please describe how you plan to use this ratio? Are applications supposed to use this ratio or the admins will be adjusting this ratio? Also should it dynamically updated based on the workload i.e. as the working set or hot pages grows we want more DRAM and as cold pages grows we want more PMEM? Basically I am trying to see if we have something like smart auto-numa balancing to fulfill your use-case. > > > > > Also what happens when PMEM is full? Can the memory migrated to PMEM > > > be reclaimed (or discarded)? > > > > Yep. The "migration path" can be as long as you want, but once the data > > hits a "terminal node" it will stop getting migrated and normal discard > > at the end of reclaim happens. > > I recalled I had a hallway conversation with Keith about this in > LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't > think exporting migration path to userspace (or letting user to define > migration path) and having multiple migration stops are good ideas in > general. > > >
On Wed 16-10-19 15:11:48, Dave Hansen wrote: > We're starting to see systems with more and more kinds of memory such > as Intel's implementation of persistent memory. > > Let's say you have a system with some DRAM and some persistent memory. > Today, once DRAM fills up, reclaim will start and some of the DRAM > contents will be thrown out. Allocations will, at some point, start > falling over to the slower persistent memory. > > That has two nasty properties. First, the newer allocations can end > up in the slower persistent memory. Second, reclaimed data in DRAM > are just discarded even if there are gobs of space in persistent > memory that could be used. > > This set implements a solution to these problems. At the end of the > reclaim process in shrink_page_list() just before the last page > refcount is dropped, the page is migrated to persistent memory instead > of being dropped. > > While I've talked about a DRAM/PMEM pairing, this approach would > function in any environment where memory tiers exist. > > This is not perfect. It "strands" pages in slower memory and never > brings them back to fast DRAM. Other things need to be built to > promote hot pages back to DRAM. > > This is part of a larger patch set. If you want to apply these or > play with them, I'd suggest using the tree from here. It includes > autonuma-based hot page promotion back to DRAM: > > http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com > > This is also all based on an upstream mechanism that allows > persistent memory to be onlined and used as if it were volatile: > > http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com How does this compare to http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com?
On Fri, Oct 18, 2019 at 1:32 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/17/19 9:01 AM, Suleiman Souhlal wrote: > > One problem that came up is that if you get into direct reclaim, > > because persistent memory can have pretty low write throughput, you > > can end up stalling users for a pretty long time while migrating > > pages. > > Basically, you're saying that memory load spikes turn into latency spikes? Yes, exactly. > FWIW, we have been benchmarking this sucker with benchmarks that claim > to care about latency. In general, compared to DRAM, we do see worse > latency, but nothing catastrophic yet. I'd be interested if you have > any workloads that act as reasonable proxies for your latency requirements. Sorry, I don't know of any specific workloads I can share. :-( Maybe Jonathan or Shakeel have something more. I realize it's not very useful without giving specific examples, but even disregarding persistent memory, we've had latency issues with direct reclaim when using zswap. It's been such a problem that we're conducting experiments with not doing zswap compression in direct reclaim (but still doing it proactively). The low write throughput of persistent memory would make this worse. I think the case where we're most likely to run into this is when the machine is close to OOM situation and we end up thrashing rather than OOM killing. Somewhat related, I noticed that this patch series ratelimits migrations from persistent memory to DRAM, but it might also make sense to ratelimit migrations from DRAM to persistent memory. If all the write bandwidth is taken by migrations, there might not be any more available for applications accessing pages in persistent memory, resulting in higher latency. Another issue we ran into, that I think might also apply to this patch series, is that because kernel memory can't be allocated on persistent memory, it's possible for all of DRAM to get filled by user memory and have kernel allocations fail even though there is still a lot of free persistent memory. This is easy to trigger, just start an application that is bigger than DRAM. To mitigate that, we introduced a new watermark for DRAM zones above which user memory can't be allocated, to leave some space for kernel allocations. -- Suleiman
On 10/18/19 12:44 AM, Michal Hocko wrote: > How does this compare to > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com It's a _bit_ more tied to persistent memory and it appears a bit more tied to two tiers rather something arbitrarily deep. They're pretty similar conceptually although there are quite a few differences. For instance, what I posted has a static mapping for the migration path. If node A is in reclaim, we always try to allocate pages on node B. There are no restrictions on what those nodes can be. In Yang Shi's apporach, there's a dynamic search for a target migration node on each migration that follows the normal alloc fallback path. This ends up making migration nodes special. There are also some different choices that are pretty arbitrary. For instance, when you allocation a migration target page, should you cause memory pressure on the target? To be honest, though, I don't see anything fatally flawed with it. It's probably a useful exercise to factor out the common bits from the two sets and see what we can agree on being absolutely necessary.
On 10/18/19 1:11 AM, Suleiman Souhlal wrote: > Another issue we ran into, that I think might also apply to this patch > series, is that because kernel memory can't be allocated on persistent > memory, it's possible for all of DRAM to get filled by user memory and > have kernel allocations fail even though there is still a lot of free > persistent memory. This is easy to trigger, just start an application > that is bigger than DRAM. Why doesn't this happen on everyone's laptops where DRAM is contended between userspace and kernel allocations? Does the OOM killer trigger fast enough to save us? > To mitigate that, we introduced a new watermark for DRAM zones above > which user memory can't be allocated, to leave some space for kernel > allocations. I'd be curious why the existing users of ZONE_MOVABLE don't have to do this? Are there just no users of ZONE_MOVABLE?
On Sat, Oct 19, 2019 at 12:10 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/18/19 1:11 AM, Suleiman Souhlal wrote: > > Another issue we ran into, that I think might also apply to this patch > > series, is that because kernel memory can't be allocated on persistent > > memory, it's possible for all of DRAM to get filled by user memory and > > have kernel allocations fail even though there is still a lot of free > > persistent memory. This is easy to trigger, just start an application > > that is bigger than DRAM. > > Why doesn't this happen on everyone's laptops where DRAM is contended > between userspace and kernel allocations? Does the OOM killer trigger > fast enough to save us? Well in this case, there is plenty of free persistent memory on the machine, but not any free DRAM to allocate kernel memory. In the situation I'm describing, we end up OOMing when we, in my opinion, shouldn't. > > To mitigate that, we introduced a new watermark for DRAM zones above > > which user memory can't be allocated, to leave some space for kernel > > allocations. > > I'd be curious why the existing users of ZONE_MOVABLE don't have to do > this? Are there just no users of ZONE_MOVABLE? That's an excellent question for which I don't currently have an answer. I haven't had the chance to test your patch series, and it's possible that it doesn't suffer from the issue. -- Suleiman
On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 10/18/19 12:44 AM, Michal Hocko wrote: > > How does this compare to > > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com > > It's a _bit_ more tied to persistent memory and it appears a bit more > tied to two tiers rather something arbitrarily deep. They're pretty > similar conceptually although there are quite a few differences. My patches do assume two tiers for now but it is not hard to extend to multiple tiers. Since it is a RFC so I didn't make it that complicated. However, IMHO I really don't think supporting multiple tiers by making the migration path configurable to admins or users is a good choice. Memory migration caused by compaction or reclaim (not via syscall) should be transparent to the users, it is the kernel internal activity. It shouldn't be exposed to the end users. I prefer firmware or OS build the migration path personally. > > For instance, what I posted has a static mapping for the migration path. > If node A is in reclaim, we always try to allocate pages on node B. > There are no restrictions on what those nodes can be. In Yang Shi's > apporach, there's a dynamic search for a target migration node on each > migration that follows the normal alloc fallback path. This ends up > making migration nodes special. The reason that I didn't pursue static mapping is that the node might be offlined or onlined, so you have to keep the mapping right every time the node state is changed. Dynamic search just returns the closest migration target node no matter what the topology is. It should be not time consuming. Actually, my patches don't restrict the migration target node has to be PMEM, it could be any memory lower than DRAM, but it just happens PMEM is the only available media. My patch's commit log explains this point. Again I really prefer the firmware or HMAT or ACPI driver could build the migration path in kernel. In addition, DRAM node is definitely excluded from migration target since I don't think doing such migration between DRAM nodes is a good idea in general. > > There are also some different choices that are pretty arbitrary. For > instance, when you allocation a migration target page, should you cause > memory pressure on the target? Yes, those are definitely arbitrary. We do need sort of a lot of details in the future by figuring out how real life workload work. > > To be honest, though, I don't see anything fatally flawed with it. It's > probably a useful exercise to factor out the common bits from the two > sets and see what we can agree on being absolutely necessary. Sure, that definitely would help us move forward. >
On Thu, Oct 17, 2019 at 3:58 PM Shakeel Butt <shakeelb@google.com> wrote: > > On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote: > > > > On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > > > On 10/16/19 8:45 PM, Shakeel Butt wrote: > > > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote: > > > >> This set implements a solution to these problems. At the end of the > > > >> reclaim process in shrink_page_list() just before the last page > > > >> refcount is dropped, the page is migrated to persistent memory instead > > > >> of being dropped. > > > ..> The memory cgroup part of the story is missing here. Since PMEM is > > > > treated as slow DRAM, shouldn't its usage be accounted to the > > > > corresponding memcg's memory/memsw counters and the migration should > > > > not happen for memcg limit reclaim? Otherwise some jobs can hog the > > > > whole PMEM. > > > > > > My expectation (and I haven't confirmed this) is that the any memory use > > > is accounted to the owning cgroup, whether it is DRAM or PMEM. memcg > > > limit reclaim and global reclaim both end up doing migrations and > > > neither should have a net effect on the counters. > > > > Yes, your expectation is correct. As long as PMEM is a NUMA node, it > > is treated as regular memory by memcg. But, I don't think memcg limit > > reclaim should do migration since limit reclaim is used to reduce > > memory usage, but migration doesn't reduce usage, it just moves memory > > from one node to the other. > > > > In my implementation, I just skip migration for memcg limit reclaim, > > please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/ > > > > > > > > There is certainly a problem here because DRAM is a more valuable > > > resource vs. PMEM, and memcg accounts for them as if they were equally > > > valuable. I really want to see memcg account for this cost discrepancy > > > at some point, but I'm not quite sure what form it would take. Any > > > feedback from you heavy memcg users out there would be much appreciated. > > > > We did have some demands to control the ratio between DRAM and PMEM as > > I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM > > and PMEM respectively or something similar. > > > > Can you please describe how you plan to use this ratio? Are > applications supposed to use this ratio or the admins will be > adjusting this ratio? Also should it dynamically updated based on the > workload i.e. as the working set or hot pages grows we want more DRAM > and as cold pages grows we want more PMEM? Basically I am trying to > see if we have something like smart auto-numa balancing to fulfill > your use-case. We thought it should be controlled by admins and transparent to the end users. The ratio is fixed, but the memory could be moved between DRAM and PMEM dynamically as long as it doesn't exceed the ratio so that we could keep warmer data in DRAM and colder data in PMEM. I talked this about in LSF/MM, please check this out: https://lwn.net/Articles/787418/ > > > > > > > > Also what happens when PMEM is full? Can the memory migrated to PMEM > > > > be reclaimed (or discarded)? > > > > > > Yep. The "migration path" can be as long as you want, but once the data > > > hits a "terminal node" it will stop getting migrated and normal discard > > > at the end of reclaim happens. > > > > I recalled I had a hallway conversation with Keith about this in > > LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't > > think exporting migration path to userspace (or letting user to define > > migration path) and having multiple migration stops are good ideas in > > general. > > > > >
On Fri, Oct 18, 2019 at 2:40 PM Yang Shi <shy828301@gmail.com> wrote: > > On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 10/18/19 12:44 AM, Michal Hocko wrote: > > > How does this compare to > > > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com > > > > It's a _bit_ more tied to persistent memory and it appears a bit more > > tied to two tiers rather something arbitrarily deep. They're pretty > > similar conceptually although there are quite a few differences. > > My patches do assume two tiers for now but it is not hard to extend to > multiple tiers. Since it is a RFC so I didn't make it that > complicated. > > However, IMHO I really don't think supporting multiple tiers by making > the migration path configurable to admins or users is a good choice. It's an optional override not a user requirement. > Memory migration caused by compaction or reclaim (not via syscall) > should be transparent to the users, it is the kernel internal > activity. It shouldn't be exposed to the end users. > > I prefer firmware or OS build the migration path personally. The OS can't, it can only trust platform firmware to tell it the memory properties. The BIOS likely gets the tables right most of the time, and the OS can assume they are correct, but when things inevitably go wrong a user override is needed. That override is more usable as an explicit migration path rather than requiring users to manually craft and inject custom ACPI tables. I otherwise do not see the substance behind this objection to a migration path override.
On Fri 18-10-19 07:54:20, Dave Hansen wrote: > On 10/18/19 12:44 AM, Michal Hocko wrote: > > How does this compare to > > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com > > It's a _bit_ more tied to persistent memory and it appears a bit more > tied to two tiers rather something arbitrarily deep. They're pretty > similar conceptually although there are quite a few differences. > > For instance, what I posted has a static mapping for the migration path. > If node A is in reclaim, we always try to allocate pages on node B. > There are no restrictions on what those nodes can be. In Yang Shi's > apporach, there's a dynamic search for a target migration node on each > migration that follows the normal alloc fallback path. This ends up > making migration nodes special. As we have discussed at LSFMM this year and there seemed to be a goog consensus on that, the resulting implementation should be as pmem neutral as possible. After all node migration mode sounds like a reasonable feature even without pmem. So I would be more inclined to the normal alloc fallback path rather than a very specific and static migration fallback path. If that turns out impractical then sure let's come up with something more specific but I think there is quite a long route there because we do not really have much of an experience with this so far. > There are also some different choices that are pretty arbitrary. For > instance, when you allocation a migration target page, should you cause > memory pressure on the target? Those are details to really sort out and they require some experimentation to. > To be honest, though, I don't see anything fatally flawed with it. It's > probably a useful exercise to factor out the common bits from the two > sets and see what we can agree on being absolutely necessary. Makes sense. What would that be? Is there a real consensus on having the new node_reclaim mode to be the configuration mechanism? Do we want to support generic NUMA without any PMEM in place as well for starter? Thanks!