mbox series

[0/4,RFC] Migrate Pages in lieu of discard

Message ID 20191016221148.F9CCD155@viggo.jf.intel.com (mailing list archive)
Headers show
Series Migrate Pages in lieu of discard | expand

Message

Dave Hansen Oct. 16, 2019, 10:11 p.m. UTC
We're starting to see systems with more and more kinds of memory such
as Intel's implementation of persistent memory.

Let's say you have a system with some DRAM and some persistent memory.
Today, once DRAM fills up, reclaim will start and some of the DRAM
contents will be thrown out.  Allocations will, at some point, start
falling over to the slower persistent memory.

That has two nasty properties.  First, the newer allocations can end
up in the slower persistent memory.  Second, reclaimed data in DRAM
are just discarded even if there are gobs of space in persistent
memory that could be used.

This set implements a solution to these problems.  At the end of the
reclaim process in shrink_page_list() just before the last page
refcount is dropped, the page is migrated to persistent memory instead
of being dropped.

While I've talked about a DRAM/PMEM pairing, this approach would
function in any environment where memory tiers exist.

This is not perfect.  It "strands" pages in slower memory and never
brings them back to fast DRAM.  Other things need to be built to
promote hot pages back to DRAM.

This is part of a larger patch set.  If you want to apply these or
play with them, I'd suggest using the tree from here.  It includes
autonuma-based hot page promotion back to DRAM:

	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com

This is also all based on an upstream mechanism that allows
persistent memory to be onlined and used as if it were volatile:

	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

Comments

Shakeel Butt Oct. 17, 2019, 3:45 a.m. UTC | #1
On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

The memory cgroup part of the story is missing here. Since PMEM is
treated as slow DRAM, shouldn't its usage be accounted to the
corresponding memcg's memory/memsw counters and the migration should
not happen for memcg limit reclaim? Otherwise some jobs can hog the
whole PMEM.

Also what happens when PMEM is full? Can the memory migrated to PMEM
be reclaimed (or discarded)?

Shakeel
Dave Hansen Oct. 17, 2019, 2:26 p.m. UTC | #2
On 10/16/19 8:45 PM, Shakeel Butt wrote:
> On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> This set implements a solution to these problems.  At the end of the
>> reclaim process in shrink_page_list() just before the last page
>> refcount is dropped, the page is migrated to persistent memory instead
>> of being dropped.
..> The memory cgroup part of the story is missing here. Since PMEM is
> treated as slow DRAM, shouldn't its usage be accounted to the
> corresponding memcg's memory/memsw counters and the migration should
> not happen for memcg limit reclaim? Otherwise some jobs can hog the
> whole PMEM.

My expectation (and I haven't confirmed this) is that the any memory use
is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
limit reclaim and global reclaim both end up doing migrations and
neither should have a net effect on the counters.

There is certainly a problem here because DRAM is a more valuable
resource vs. PMEM, and memcg accounts for them as if they were equally
valuable.  I really want to see memcg account for this cost discrepancy
at some point, but I'm not quite sure what form it would take.  Any
feedback from you heavy memcg users out there would be much appreciated.

> Also what happens when PMEM is full? Can the memory migrated to PMEM
> be reclaimed (or discarded)?

Yep.  The "migration path" can be as long as you want, but once the data
hits a "terminal node" it will stop getting migrated and normal discard
at the end of reclaim happens.
Suleiman Souhlal Oct. 17, 2019, 4:01 p.m. UTC | #3
On Thu, Oct 17, 2019 at 7:14 AM Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
>
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
>
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
>
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
>
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
>
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
>
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
>
>         http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
>
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
>
>         http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com
>

We prototyped something very similar to this patch series in the past.

One problem that came up is that if you get into direct reclaim,
because persistent memory can have pretty low write throughput, you
can end up stalling users for a pretty long time while migrating
pages.

To mitigate that, we tried changing background reclaim to start
migrating much earlier (but not otherwise reclaiming), however it
drastically increased the code complexity and still had the chance of
not being able to catch up with pressure.

Because of that, we moved to a solution based on the proactive reclaim
of idle pages, that was presented at LSFMM earlier this year:
https://lwn.net/Articles/787611/ .

-- Suleiman
Dave Hansen Oct. 17, 2019, 4:32 p.m. UTC | #4
On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> One problem that came up is that if you get into direct reclaim,
> because persistent memory can have pretty low write throughput, you
> can end up stalling users for a pretty long time while migrating
> pages.

Basically, you're saying that memory load spikes turn into latency spikes?

FWIW, we have been benchmarking this sucker with benchmarks that claim
to care about latency.  In general, compared to DRAM, we do see worse
latency, but nothing catastrophic yet.  I'd be interested if you have
any workloads that act as reasonable proxies for your latency requirements.

> Because of that, we moved to a solution based on the proactive reclaim
> of idle pages, that was presented at LSFMM earlier this year:
> https://lwn.net/Articles/787611/ .

I saw the presentation.  The feedback in the room as I remember it was
that proactive reclaim essentially replaced the existing reclaim
mechanism, to which the audience was not receptive.  Have folks opinions
changed on that, or are you looking for other solutions?
Shakeel Butt Oct. 17, 2019, 4:39 p.m. UTC | #5
On Thu, Oct 17, 2019 at 9:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> > One problem that came up is that if you get into direct reclaim,
> > because persistent memory can have pretty low write throughput, you
> > can end up stalling users for a pretty long time while migrating
> > pages.
>
> Basically, you're saying that memory load spikes turn into latency spikes?
>
> FWIW, we have been benchmarking this sucker with benchmarks that claim
> to care about latency.  In general, compared to DRAM, we do see worse
> latency, but nothing catastrophic yet.  I'd be interested if you have
> any workloads that act as reasonable proxies for your latency requirements.
>
> > Because of that, we moved to a solution based on the proactive reclaim
> > of idle pages, that was presented at LSFMM earlier this year:
> > https://lwn.net/Articles/787611/ .
>
> I saw the presentation.  The feedback in the room as I remember it was
> that proactive reclaim essentially replaced the existing reclaim
> mechanism, to which the audience was not receptive.  Have folks opinions
> changed on that, or are you looking for other solutions?
>

I am currently working on a solution which shares the mechanisms
between regular and proactive reclaim. The interested users/admins can
setup proactive reclaim otherwise the regular reclaim will work on low
memory. I will have something in one/two months and will post the
patches.

Shakeel
Shakeel Butt Oct. 17, 2019, 4:58 p.m. UTC | #6
On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >> This set implements a solution to these problems.  At the end of the
> >> reclaim process in shrink_page_list() just before the last page
> >> refcount is dropped, the page is migrated to persistent memory instead
> >> of being dropped.
> ..> The memory cgroup part of the story is missing here. Since PMEM is
> > treated as slow DRAM, shouldn't its usage be accounted to the
> > corresponding memcg's memory/memsw counters and the migration should
> > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > whole PMEM.
>
> My expectation (and I haven't confirmed this) is that the any memory use
> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> limit reclaim and global reclaim both end up doing migrations and
> neither should have a net effect on the counters.
>

Hmm I didn't see the memcg charge migration in the code on demotion.
So, in the code [patch 3] the counters are being decremented as DRAM
is freed but not incremented for PMEM.

> There is certainly a problem here because DRAM is a more valuable
> resource vs. PMEM, and memcg accounts for them as if they were equally
> valuable.  I really want to see memcg account for this cost discrepancy
> at some point, but I'm not quite sure what form it would take.  Any
> feedback from you heavy memcg users out there would be much appreciated.
>

There are two apparent use-cases i.e. explicit (apps moving their
pages to PMEM to reduce cost) and implicit (admin moves cold pages to
PMEM transparently to the apps) for the PMEM. In the implicit case, I
see both DRAM and PMEM as same resource from the perspective of memcg
limits i.e. same memcg counter, something like cgroup v1's  memsw).
For the explicit case, maybe separate counters make sense like cgroup
v2's memory and swap.

> > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > be reclaimed (or discarded)?
>
> Yep.  The "migration path" can be as long as you want, but once the data
> hits a "terminal node" it will stop getting migrated and normal discard
> at the end of reclaim happens.

I might have missed it but I didn't see the migrated pages inserted
back to LRUs. If they are not in LRU, the reclaimer will never see
them.
Yang Shi Oct. 17, 2019, 5:20 p.m. UTC | #7
On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >> This set implements a solution to these problems.  At the end of the
> >> reclaim process in shrink_page_list() just before the last page
> >> refcount is dropped, the page is migrated to persistent memory instead
> >> of being dropped.
> ..> The memory cgroup part of the story is missing here. Since PMEM is
> > treated as slow DRAM, shouldn't its usage be accounted to the
> > corresponding memcg's memory/memsw counters and the migration should
> > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > whole PMEM.
>
> My expectation (and I haven't confirmed this) is that the any memory use
> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> limit reclaim and global reclaim both end up doing migrations and
> neither should have a net effect on the counters.

Yes, your expectation is correct. As long as PMEM is a NUMA node, it
is treated as regular memory by memcg. But, I don't think memcg limit
reclaim should do migration since limit reclaim is used to reduce
memory usage, but migration doesn't reduce usage, it just moves memory
from one node to the other.

In my implementation, I just skip migration for memcg limit reclaim,
please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/

>
> There is certainly a problem here because DRAM is a more valuable
> resource vs. PMEM, and memcg accounts for them as if they were equally
> valuable.  I really want to see memcg account for this cost discrepancy
> at some point, but I'm not quite sure what form it would take.  Any
> feedback from you heavy memcg users out there would be much appreciated.

We did have some demands to control the ratio between DRAM and PMEM as
I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
and PMEM respectively or something similar.

>
> > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > be reclaimed (or discarded)?
>
> Yep.  The "migration path" can be as long as you want, but once the data
> hits a "terminal node" it will stop getting migrated and normal discard
> at the end of reclaim happens.

I recalled I had a hallway conversation with Keith about this in
LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
think exporting migration path to userspace (or letting user to define
migration path) and having multiple migration stops are good ideas in
general.

>
Dave Hansen Oct. 17, 2019, 8:51 p.m. UTC | #8
On 10/17/19 9:58 AM, Shakeel Butt wrote:
>> My expectation (and I haven't confirmed this) is that the any memory use
>> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
>> limit reclaim and global reclaim both end up doing migrations and
>> neither should have a net effect on the counters.
>>
> Hmm I didn't see the memcg charge migration in the code on demotion.
> So, in the code [patch 3] the counters are being decremented as DRAM
> is freed but not incremented for PMEM.

I had assumed that the migration code was doing this for me.  I'll go
make sure either way.
Dave Hansen Oct. 17, 2019, 9:05 p.m. UTC | #9
On 10/17/19 10:20 AM, Yang Shi wrote:
> On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
>> My expectation (and I haven't confirmed this) is that the any memory use
>> is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
>> limit reclaim and global reclaim both end up doing migrations and
>> neither should have a net effect on the counters.
> 
> Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> is treated as regular memory by memcg. But, I don't think memcg limit
> reclaim should do migration since limit reclaim is used to reduce
> memory usage, but migration doesn't reduce usage, it just moves memory
> from one node to the other.
> 
> In my implementation, I just skip migration for memcg limit reclaim,
> please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/

Ahh, got it.  That does make sense.  I might have to steal your
implementation.
Shakeel Butt Oct. 17, 2019, 10:58 p.m. UTC | #10
On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > >> This set implements a solution to these problems.  At the end of the
> > >> reclaim process in shrink_page_list() just before the last page
> > >> refcount is dropped, the page is migrated to persistent memory instead
> > >> of being dropped.
> > ..> The memory cgroup part of the story is missing here. Since PMEM is
> > > treated as slow DRAM, shouldn't its usage be accounted to the
> > > corresponding memcg's memory/memsw counters and the migration should
> > > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > > whole PMEM.
> >
> > My expectation (and I haven't confirmed this) is that the any memory use
> > is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> > limit reclaim and global reclaim both end up doing migrations and
> > neither should have a net effect on the counters.
>
> Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> is treated as regular memory by memcg. But, I don't think memcg limit
> reclaim should do migration since limit reclaim is used to reduce
> memory usage, but migration doesn't reduce usage, it just moves memory
> from one node to the other.
>
> In my implementation, I just skip migration for memcg limit reclaim,
> please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/
>
> >
> > There is certainly a problem here because DRAM is a more valuable
> > resource vs. PMEM, and memcg accounts for them as if they were equally
> > valuable.  I really want to see memcg account for this cost discrepancy
> > at some point, but I'm not quite sure what form it would take.  Any
> > feedback from you heavy memcg users out there would be much appreciated.
>
> We did have some demands to control the ratio between DRAM and PMEM as
> I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
> and PMEM respectively or something similar.
>

Can you please describe how you plan to use this ratio? Are
applications supposed to use this ratio or the admins will be
adjusting this ratio? Also should it dynamically updated based on the
workload i.e. as the working set or hot pages grows we want more DRAM
and as cold pages grows we want more PMEM? Basically I am trying to
see if we have something like smart auto-numa balancing to fulfill
your use-case.

> >
> > > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > > be reclaimed (or discarded)?
> >
> > Yep.  The "migration path" can be as long as you want, but once the data
> > hits a "terminal node" it will stop getting migrated and normal discard
> > at the end of reclaim happens.
>
> I recalled I had a hallway conversation with Keith about this in
> LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
> think exporting migration path to userspace (or letting user to define
> migration path) and having multiple migration stops are good ideas in
> general.
>
> >
Michal Hocko Oct. 18, 2019, 7:44 a.m. UTC | #11
On Wed 16-10-19 15:11:48, Dave Hansen wrote:
> We're starting to see systems with more and more kinds of memory such
> as Intel's implementation of persistent memory.
> 
> Let's say you have a system with some DRAM and some persistent memory.
> Today, once DRAM fills up, reclaim will start and some of the DRAM
> contents will be thrown out.  Allocations will, at some point, start
> falling over to the slower persistent memory.
> 
> That has two nasty properties.  First, the newer allocations can end
> up in the slower persistent memory.  Second, reclaimed data in DRAM
> are just discarded even if there are gobs of space in persistent
> memory that could be used.
> 
> This set implements a solution to these problems.  At the end of the
> reclaim process in shrink_page_list() just before the last page
> refcount is dropped, the page is migrated to persistent memory instead
> of being dropped.
> 
> While I've talked about a DRAM/PMEM pairing, this approach would
> function in any environment where memory tiers exist.
> 
> This is not perfect.  It "strands" pages in slower memory and never
> brings them back to fast DRAM.  Other things need to be built to
> promote hot pages back to DRAM.
> 
> This is part of a larger patch set.  If you want to apply these or
> play with them, I'd suggest using the tree from here.  It includes
> autonuma-based hot page promotion back to DRAM:
> 
> 	http://lkml.kernel.org/r/c3d6de4d-f7c3-b505-2e64-8ee5f70b2118@intel.com
> 
> This is also all based on an upstream mechanism that allows
> persistent memory to be onlined and used as if it were volatile:
> 
> 	http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

How does this compare to
http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com?
Suleiman Souhlal Oct. 18, 2019, 8:11 a.m. UTC | #12
On Fri, Oct 18, 2019 at 1:32 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/17/19 9:01 AM, Suleiman Souhlal wrote:
> > One problem that came up is that if you get into direct reclaim,
> > because persistent memory can have pretty low write throughput, you
> > can end up stalling users for a pretty long time while migrating
> > pages.
>
> Basically, you're saying that memory load spikes turn into latency spikes?

Yes, exactly.

> FWIW, we have been benchmarking this sucker with benchmarks that claim
> to care about latency.  In general, compared to DRAM, we do see worse
> latency, but nothing catastrophic yet.  I'd be interested if you have
> any workloads that act as reasonable proxies for your latency requirements.

Sorry, I don't know of any specific workloads I can share. :-(
Maybe Jonathan or Shakeel have something more.

I realize it's not very useful without giving specific examples, but
even disregarding persistent memory, we've had latency issues with
direct reclaim when using zswap. It's been such a problem that we're
conducting experiments with not doing zswap compression in direct
reclaim (but still doing it proactively).
The low write throughput of persistent memory would make this worse.

I think the case where we're most likely to run into this is when the
machine is close to OOM situation and we end up thrashing rather than
OOM killing.

Somewhat related, I noticed that this patch series ratelimits
migrations from persistent memory to DRAM, but it might also make
sense to ratelimit migrations from DRAM to persistent memory. If all
the write bandwidth is taken by migrations, there might not be any
more available for applications accessing pages in persistent memory,
resulting in higher latency.


Another issue we ran into, that I think might also apply to this patch
series, is that because kernel memory can't be allocated on persistent
memory, it's possible for all of DRAM to get filled by user memory and
have kernel allocations fail even though there is still a lot of free
persistent memory. This is easy to trigger, just start an application
that is bigger than DRAM.
To mitigate that, we introduced a new watermark for DRAM zones above
which user memory can't be allocated, to leave some space for kernel
allocations.

-- Suleiman
Dave Hansen Oct. 18, 2019, 2:54 p.m. UTC | #13
On 10/18/19 12:44 AM, Michal Hocko wrote:
> How does this compare to
> http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com

It's a _bit_ more tied to persistent memory and it appears a bit more
tied to two tiers rather something arbitrarily deep.  They're pretty
similar conceptually although there are quite a few differences.

For instance, what I posted has a static mapping for the migration path.
 If node A is in reclaim, we always try to allocate pages on node B.
There are no restrictions on what those nodes can be.  In Yang Shi's
apporach, there's a dynamic search for a target migration node on each
migration that follows the normal alloc fallback path.  This ends up
making migration nodes special.

There are also some different choices that are pretty arbitrary.  For
instance, when you allocation a migration target page, should you cause
memory pressure on the target?

To be honest, though, I don't see anything fatally flawed with it.  It's
probably a useful exercise to factor out the common bits from the two
sets and see what we can agree on being absolutely necessary.
Dave Hansen Oct. 18, 2019, 3:10 p.m. UTC | #14
On 10/18/19 1:11 AM, Suleiman Souhlal wrote:
> Another issue we ran into, that I think might also apply to this patch
> series, is that because kernel memory can't be allocated on persistent
> memory, it's possible for all of DRAM to get filled by user memory and
> have kernel allocations fail even though there is still a lot of free
> persistent memory. This is easy to trigger, just start an application
> that is bigger than DRAM.

Why doesn't this happen on everyone's laptops where DRAM is contended
between userspace and kernel allocations?  Does the OOM killer trigger
fast enough to save us?

> To mitigate that, we introduced a new watermark for DRAM zones above
> which user memory can't be allocated, to leave some space for kernel
> allocations.

I'd be curious why the existing users of ZONE_MOVABLE don't have to do
this?  Are there just no users of ZONE_MOVABLE?
Suleiman Souhlal Oct. 18, 2019, 3:39 p.m. UTC | #15
On Sat, Oct 19, 2019 at 12:10 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/18/19 1:11 AM, Suleiman Souhlal wrote:
> > Another issue we ran into, that I think might also apply to this patch
> > series, is that because kernel memory can't be allocated on persistent
> > memory, it's possible for all of DRAM to get filled by user memory and
> > have kernel allocations fail even though there is still a lot of free
> > persistent memory. This is easy to trigger, just start an application
> > that is bigger than DRAM.
>
> Why doesn't this happen on everyone's laptops where DRAM is contended
> between userspace and kernel allocations?  Does the OOM killer trigger
> fast enough to save us?

Well in this case, there is plenty of free persistent memory on the
machine, but not any free DRAM to allocate kernel memory.
In the situation I'm describing, we end up OOMing when we, in my
opinion, shouldn't.

> > To mitigate that, we introduced a new watermark for DRAM zones above
> > which user memory can't be allocated, to leave some space for kernel
> > allocations.
>
> I'd be curious why the existing users of ZONE_MOVABLE don't have to do
> this?  Are there just no users of ZONE_MOVABLE?

That's an excellent question for which I don't currently have an answer.

I haven't had the chance to test your patch series, and it's possible
that it doesn't suffer from the issue.

-- Suleiman
Yang Shi Oct. 18, 2019, 9:39 p.m. UTC | #16
On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/18/19 12:44 AM, Michal Hocko wrote:
> > How does this compare to
> > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
>
> It's a _bit_ more tied to persistent memory and it appears a bit more
> tied to two tiers rather something arbitrarily deep.  They're pretty
> similar conceptually although there are quite a few differences.

My patches do assume two tiers for now but it is not hard to extend to
multiple tiers. Since it is a RFC so I didn't make it that
complicated.

However, IMHO I really don't think supporting multiple tiers by making
the migration path configurable to admins or users is a good choice.
Memory migration caused by compaction or reclaim (not via syscall)
should be transparent to the users, it is the kernel internal
activity. It shouldn't be exposed to the end users.

I prefer firmware or OS build the migration path personally.

>
> For instance, what I posted has a static mapping for the migration path.
>  If node A is in reclaim, we always try to allocate pages on node B.
> There are no restrictions on what those nodes can be.  In Yang Shi's
> apporach, there's a dynamic search for a target migration node on each
> migration that follows the normal alloc fallback path.  This ends up
> making migration nodes special.

The reason that I didn't pursue static mapping is that the node might
be offlined or onlined, so you have to keep the mapping right every
time the node state is changed. Dynamic search just returns the
closest migration target node no matter what the topology is. It
should be not time consuming.

Actually, my patches don't restrict the migration target node has to
be PMEM, it could be any memory lower than DRAM, but it just happens
PMEM is the only available media. My patch's commit log explains this
point. Again I really prefer the firmware or HMAT or ACPI driver could
build the migration path in kernel.

In addition, DRAM node is definitely excluded from migration target
since I don't think doing such migration between DRAM nodes is a good
idea in general.

>
> There are also some different choices that are pretty arbitrary.  For
> instance, when you allocation a migration target page, should you cause
> memory pressure on the target?

Yes, those are definitely arbitrary. We do need sort of a lot of
details in the future by figuring out how real life workload work.

>
> To be honest, though, I don't see anything fatally flawed with it.  It's
> probably a useful exercise to factor out the common bits from the two
> sets and see what we can agree on being absolutely necessary.

Sure, that definitely would help us move forward.

>
Yang Shi Oct. 18, 2019, 9:44 p.m. UTC | #17
On Thu, Oct 17, 2019 at 3:58 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Thu, Oct 17, 2019 at 10:20 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Oct 17, 2019 at 7:26 AM Dave Hansen <dave.hansen@intel.com> wrote:
> > >
> > > On 10/16/19 8:45 PM, Shakeel Butt wrote:
> > > > On Wed, Oct 16, 2019 at 3:49 PM Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > > >> This set implements a solution to these problems.  At the end of the
> > > >> reclaim process in shrink_page_list() just before the last page
> > > >> refcount is dropped, the page is migrated to persistent memory instead
> > > >> of being dropped.
> > > ..> The memory cgroup part of the story is missing here. Since PMEM is
> > > > treated as slow DRAM, shouldn't its usage be accounted to the
> > > > corresponding memcg's memory/memsw counters and the migration should
> > > > not happen for memcg limit reclaim? Otherwise some jobs can hog the
> > > > whole PMEM.
> > >
> > > My expectation (and I haven't confirmed this) is that the any memory use
> > > is accounted to the owning cgroup, whether it is DRAM or PMEM.  memcg
> > > limit reclaim and global reclaim both end up doing migrations and
> > > neither should have a net effect on the counters.
> >
> > Yes, your expectation is correct. As long as PMEM is a NUMA node, it
> > is treated as regular memory by memcg. But, I don't think memcg limit
> > reclaim should do migration since limit reclaim is used to reduce
> > memory usage, but migration doesn't reduce usage, it just moves memory
> > from one node to the other.
> >
> > In my implementation, I just skip migration for memcg limit reclaim,
> > please see: https://lore.kernel.org/linux-mm/1560468577-101178-7-git-send-email-yang.shi@linux.alibaba.com/
> >
> > >
> > > There is certainly a problem here because DRAM is a more valuable
> > > resource vs. PMEM, and memcg accounts for them as if they were equally
> > > valuable.  I really want to see memcg account for this cost discrepancy
> > > at some point, but I'm not quite sure what form it would take.  Any
> > > feedback from you heavy memcg users out there would be much appreciated.
> >
> > We did have some demands to control the ratio between DRAM and PMEM as
> > I mentioned in LSF/MM. Mel Gorman did suggest make memcg account DRAM
> > and PMEM respectively or something similar.
> >
>
> Can you please describe how you plan to use this ratio? Are
> applications supposed to use this ratio or the admins will be
> adjusting this ratio? Also should it dynamically updated based on the
> workload i.e. as the working set or hot pages grows we want more DRAM
> and as cold pages grows we want more PMEM? Basically I am trying to
> see if we have something like smart auto-numa balancing to fulfill
> your use-case.

We thought it should be controlled by admins and transparent to the
end users. The ratio is fixed, but the memory could be moved between
DRAM and PMEM dynamically as long as it doesn't exceed the ratio so
that we could keep warmer data in DRAM and colder data in PMEM.

I talked this about in LSF/MM, please check this out:
https://lwn.net/Articles/787418/

>
> > >
> > > > Also what happens when PMEM is full? Can the memory migrated to PMEM
> > > > be reclaimed (or discarded)?
> > >
> > > Yep.  The "migration path" can be as long as you want, but once the data
> > > hits a "terminal node" it will stop getting migrated and normal discard
> > > at the end of reclaim happens.
> >
> > I recalled I had a hallway conversation with Keith about this in
> > LSF/MM. We all agree there should be not a cycle. But, IMHO, I don't
> > think exporting migration path to userspace (or letting user to define
> > migration path) and having multiple migration stops are good ideas in
> > general.
> >
> > >
Dan Williams Oct. 18, 2019, 9:55 p.m. UTC | #18
On Fri, Oct 18, 2019 at 2:40 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 7:54 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 10/18/19 12:44 AM, Michal Hocko wrote:
> > > How does this compare to
> > > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
> >
> > It's a _bit_ more tied to persistent memory and it appears a bit more
> > tied to two tiers rather something arbitrarily deep.  They're pretty
> > similar conceptually although there are quite a few differences.
>
> My patches do assume two tiers for now but it is not hard to extend to
> multiple tiers. Since it is a RFC so I didn't make it that
> complicated.
>
> However, IMHO I really don't think supporting multiple tiers by making
> the migration path configurable to admins or users is a good choice.

It's an optional override not a user requirement.

> Memory migration caused by compaction or reclaim (not via syscall)
> should be transparent to the users, it is the kernel internal
> activity. It shouldn't be exposed to the end users.
>
> I prefer firmware or OS build the migration path personally.

The OS can't, it can only trust platform firmware to tell it the
memory properties.

The BIOS likely gets the tables right most of the time, and the OS can
assume they are correct, but when things inevitably go wrong a user
override is needed. That override is more usable as an explicit
migration path rather than requiring users to manually craft and
inject custom ACPI tables. I otherwise do not see the substance behind
this objection to a migration path override.
Michal Hocko Oct. 22, 2019, 1:49 p.m. UTC | #19
On Fri 18-10-19 07:54:20, Dave Hansen wrote:
> On 10/18/19 12:44 AM, Michal Hocko wrote:
> > How does this compare to
> > http://lkml.kernel.org/r/1560468577-101178-1-git-send-email-yang.shi@linux.alibaba.com
> 
> It's a _bit_ more tied to persistent memory and it appears a bit more
> tied to two tiers rather something arbitrarily deep.  They're pretty
> similar conceptually although there are quite a few differences.
> 
> For instance, what I posted has a static mapping for the migration path.
>  If node A is in reclaim, we always try to allocate pages on node B.
> There are no restrictions on what those nodes can be.  In Yang Shi's
> apporach, there's a dynamic search for a target migration node on each
> migration that follows the normal alloc fallback path.  This ends up
> making migration nodes special.

As we have discussed at LSFMM this year and there seemed to be a goog
consensus on that, the resulting implementation should be as pmem
neutral as possible. After all node migration mode sounds like a
reasonable feature even without pmem. So I would be more inclined to the
normal alloc fallback path rather than a very specific and static
migration fallback path. If that turns out impractical then sure let's
come up with something more specific but I think there is quite a long
route there because we do not really have much of an experience with
this so far.

> There are also some different choices that are pretty arbitrary.  For
> instance, when you allocation a migration target page, should you cause
> memory pressure on the target?

Those are details to really sort out and they require some
experimentation to.

> To be honest, though, I don't see anything fatally flawed with it.  It's
> probably a useful exercise to factor out the common bits from the two
> sets and see what we can agree on being absolutely necessary.

Makes sense. What would that be? Is there a real consensus on having the
new node_reclaim mode to be the configuration mechanism? Do we want to
support generic NUMA without any PMEM in place as well for starter?

Thanks!