[mm,v5,0/7] Deferred page init improvements

Message ID	154145268025.30046.11742652345962594283.stgit@ahduyck-desk1.jf.intel.com (mailing list archive)
Headers	show Return-Path: <linux-nvdimm-bounces@lists.01.org> Received-SPF: None (no SPF record) identity=mailfrom; client-ip=192.55.52.93; helo=mga11.intel.com; envelope-from=alexander.h.duyck@linux.intel.com; receiver=linux-nvdimm@lists.01.org Subject: [mm PATCH v5 0/7] Deferred page init improvements From: Alexander Duyck <alexander.h.duyck@linux.intel.com> To: akpm@linux-foundation.org, linux-mm@kvack.org Date: Mon, 05 Nov 2018 13:19:25 -0800 Message-ID: <154145268025.30046.11742652345962594283.stgit@ahduyck-desk1.jf.intel.com> User-Agent: StGit/unknown-version MIME-Version: 1.0 Precedence: list Cc: pavel.tatashin@microsoft.com, mhocko@suse.com, linux-nvdimm@lists.01.org, alexander.h.duyck@linux.intel.com, linux-kernel@vger.kernel.org, willy@infradead.org, mingo@kernel.org, khalid.aziz@oracle.com, rppt@linux.vnet.ibm.com, vbabka@suse.cz, sparclinux@vger.kernel.org, ldufour@linux.vnet.ibm.com, mgorman@techsingularity.net, davem@davemloft.net, kirill.shutemov@linux.intel.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" <linux-nvdimm-bounces@lists.01.org>
Series	Deferred page init improvements \| expand [mm,v5,0/7] Deferred page init improvements [mm,v5,1/7] mm: Use mm_zero_struct_page from SPARC on all 64b architectures [mm,v5,2/7] mm: Drop meminit_pfn_in_nid as it is redundant [mm,v5,3/7] mm: Implement new zone specific memblock iterator [mm,v5,4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections [mm,v5,5/7] mm: Move hot-plug specific memory init into separate functions and optimize [mm,v5,6/7] mm: Add reserved flag setting to set_page_links [mm,v5,7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

Alexander Duyck Nov. 5, 2018, 9:19 p.m. UTC

This patchset is essentially a refactor of the page initialization logic
that is meant to provide for better code reuse while providing a
significant improvement in deferred page initialization performance.

In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
memory per node I have seen the following. In the case of regular memory
initialization the deferred init time was decreased from 3.75s to 1.06s on
average. For the persistent memory the initialization time dropped from
24.17s to 19.12s on average. This amounts to a 253% improvement for the
deferred memory initialization performance, and a 26% improvement in the
persistent memory initialization performance.

I have called out the improvement observed with each patch.

v1->v2:
    Fixed build issue on PowerPC due to page struct size being 56
    Added new patch that removed __SetPageReserved call for hotplug
v2->v3:
    Rebased on latest linux-next
    Removed patch that had removed __SetPageReserved call from init
    Added patch that folded __SetPageReserved into set_page_links
    Tweaked __init_pageblock to use start_pfn to get section_nr instead of pfn
v3->v4:
    Updated patch description and comments for mm_zero_struct_page patch
        Replaced "default" with "case 64"
        Removed #ifndef mm_zero_struct_page
    Fixed typo in comment that ommited "_from" in kerneldoc for iterator
    Added Reviewed-by for patches reviewed by Pavel
    Added Acked-by from Michal Hocko
    Added deferred init times for patches that affect init performance
    Swapped patches 5 & 6, pulled some code/comments from 4 into 5
v4->v5:
    Updated Acks/Reviewed-by
    Rebased on latest linux-next
    Split core bits of zone iterator patch from MAX_ORDER_NR_PAGES init

---

Alexander Duyck (7):
      mm: Use mm_zero_struct_page from SPARC on all 64b architectures
      mm: Drop meminit_pfn_in_nid as it is redundant
      mm: Implement new zone specific memblock iterator
      mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
      mm: Move hot-plug specific memory init into separate functions and optimize
      mm: Add reserved flag setting to set_page_links
      mm: Use common iterator for deferred_init_pages and deferred_free_pages


 arch/sparc/include/asm/pgtable_64.h |   30 --
 include/linux/memblock.h            |   38 ++
 include/linux/mm.h                  |   50 +++
 mm/memblock.c                       |   63 ++++
 mm/page_alloc.c                     |  567 +++++++++++++++++++++--------------
 5 files changed, 492 insertions(+), 256 deletions(-)

--

Pasha Tatashin Nov. 9, 2018, 9:15 p.m. UTC | #1

On 18-11-05 13:19:25, Alexander Duyck wrote:
> This patchset is essentially a refactor of the page initialization logic
> that is meant to provide for better code reuse while providing a
> significant improvement in deferred page initialization performance.
> 
> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> memory per node I have seen the following. In the case of regular memory
> initialization the deferred init time was decreased from 3.75s to 1.06s on
> average. For the persistent memory the initialization time dropped from
> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> deferred memory initialization performance, and a 26% improvement in the
> persistent memory initialization performance.

Hi Alex,

Please try to run your persistent memory init experiment with Daniel's
patches:

https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/

The performance should improve by much more than 26%.

Overall, your works looks good, but it needs to be considered how easy it will be
to merge with ktask. I will try to complete the review today.

Thank you,
Pasha

Alexander Duyck Nov. 9, 2018, 11:14 p.m. UTC | #2

On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> On 18-11-05 13:19:25, Alexander Duyck wrote:
> > This patchset is essentially a refactor of the page initialization logic
> > that is meant to provide for better code reuse while providing a
> > significant improvement in deferred page initialization performance.
> > 
> > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > memory per node I have seen the following. In the case of regular memory
> > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > average. For the persistent memory the initialization time dropped from
> > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > deferred memory initialization performance, and a 26% improvement in the
> > persistent memory initialization performance.
> 
> Hi Alex,
> 
> Please try to run your persistent memory init experiment with Daniel's
> patches:
> 
> https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/

I've taken a quick look at it. It seems like a bit of a brute force way
to try and speed things up. I would be worried about it potentially
introducing performance issues if the number of CPUs thrown at it end
up exceeding the maximum throughput of the memory.

The data provided with patch 11 seems to point to issues such as that.
In the case of the E7-8895 example cited it is increasing the numbers
of CPUs used from memory initialization from 8 to 72, a 9x increase in
the number of CPUs but it is yeilding only a 3.88x speedup.

> The performance should improve by much more than 26%.

The 26% improvement, or speedup of 1.26x using the ktask approach, was
for persistent memory, not deferred memory init. The ktask patch
doesn't do anything for persistent memory since it is takes the hot-
plug path and isn't handled via the deferred memory init.

I had increased deferred memory init to about 3.53x the original speed
(3.75s to 1.06s) on the system which I was testing. I do agree the two
patches should be able to synergistically boost each other though as
this patch set was meant to make the init much more cache friendly so
as a result it should scale better as you add additional cores. I know
I had done some playing around with fake numa to split up a single node
into 8 logical nodes and I had seen a similar speedup of about 3.85x
with my test memory initializing in about 275ms.

> Overall, your works looks good, but it needs to be considered how easy it will be
> to merge with ktask. I will try to complete the review today.
> 
> Thank you,
> Pasha

Looking over the patches they are still in the RFC stage and the data
is in need of updates since it is referencing 4.15-rc kernels as its
baseline. If anything I really think the ktask patch 11 would be easier
to rebase around my patch set then the other way around. Also, this
series is in Andrew's mmots as of a few days ago, so I think it will be
in the next mmotm that comes out.

The integration with the ktask code should be pretty straight forward.
If anything I think my code would probably make it easier since it gets
rid of the need to do all this in two passes. The only new limitation
it would add is that you would probably want to split up the work along
either max order or section aligned boundaries. What it would
essentially do is make things so that each of the ktask threads would
probably look more like deferred_grow_zone which after my patch set is
actually a fairly simple function.

Thanks.

- Alex

Pasha Tatashin Nov. 10, 2018, midnight UTC | #3

On 18-11-09 15:14:35, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > 
> > Hi Alex,
> > 
> > Please try to run your persistent memory init experiment with Daniel's
> > patches:
> > 
> > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> 
> I've taken a quick look at it. It seems like a bit of a brute force way
> to try and speed things up. I would be worried about it potentially

There is a limit to max number of threads that ktasks start. The memory
throughput is *much* higher than what one CPU can maxout in a node, so
there is no reason to leave the other CPUs sit idle during boot when
they can help to initialize.

> introducing performance issues if the number of CPUs thrown at it end
> up exceeding the maximum throughput of the memory.
> 
> The data provided with patch 11 seems to point to issues such as that.
> In the case of the E7-8895 example cited it is increasing the numbers
> of CPUs used from memory initialization from 8 to 72, a 9x increase in
> the number of CPUs but it is yeilding only a 3.88x speedup.

Yes, but in both cases we are far from maxing out the memory throughput.
The 3.88x is indeed low, and I do not know what slows it down.

Daniel,

Could you please check why multi-threading efficiency is so low here?

I bet, there is some atomic operation introduces a contention within a
node. It should be possible to resolve.

> 
> > The performance should improve by much more than 26%.
> 
> The 26% improvement, or speedup of 1.26x using the ktask approach, was
> for persistent memory, not deferred memory init. The ktask patch
> doesn't do anything for persistent memory since it is takes the hot-
> plug path and isn't handled via the deferred memory init.

Ah, I thought in your experiment persistent memory takes deferred init
path. So, what exactly in your patches make this 1.26x speedup?

> 
> I had increased deferred memory init to about 3.53x the original speed
> (3.75s to 1.06s) on the system which I was testing. I do agree the two
> patches should be able to synergistically boost each other though as
> this patch set was meant to make the init much more cache friendly so
> as a result it should scale better as you add additional cores. I know
> I had done some playing around with fake numa to split up a single node
> into 8 logical nodes and I had seen a similar speedup of about 3.85x
> with my test memory initializing in about 275ms.
> 
> > Overall, your works looks good, but it needs to be considered how easy it will be
> > to merge with ktask. I will try to complete the review today.
> > 
> > Thank you,
> > Pasha
> 
> Looking over the patches they are still in the RFC stage and the data
> is in need of updates since it is referencing 4.15-rc kernels as its
> baseline. If anything I really think the ktask patch 11 would be easier
> to rebase around my patch set then the other way around. Also, this
> series is in Andrew's mmots as of a few days ago, so I think it will be
> in the next mmotm that comes out.

I do not disagree, I think these two patch series should complement each
other. But, if your changes make it impossible for ktask, I would strongly argue
against it, as the potential improvements with ktasks are much higher.
But, so far I do not see anything, so I think they can work together. I
am still reviewing your work.

> 
> The integration with the ktask code should be pretty straight forward.
> If anything I think my code would probably make it easier since it gets
> rid of the need to do all this in two passes. The only new limitation
> it would add is that you would probably want to split up the work along
> either max order or section aligned boundaries. What it would

Which is totally OK, it should make ktasks scale even better.

> essentially do is make things so that each of the ktask threads would
> probably look more like deferred_grow_zone which after my patch set is
> actually a fairly simple function.
> 
> Thanks.

Thank you,
Pasha

Alexander Duyck Nov. 10, 2018, 12:46 a.m. UTC | #4

On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > > 
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > memory per node I have seen the following. In the case of regular memory
> > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > average. For the persistent memory the initialization time dropped from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > deferred memory initialization performance, and a 26% improvement in the
> > > > persistent memory initialization performance.
> > > 
> > > Hi Alex,
> > > 
> > > Please try to run your persistent memory init experiment with Daniel's
> > > patches:
> > > 
> > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > 
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
> 
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.

Right, but right now that limit can still be pretty big when it is
something like 25% of all the CPUs on a 288 CPU system.

One issue is the way the code was ends up essentially blowing out the
cache over and over again. Doing things in two passes made it really
expensive as you took one cache miss to initialize it, and another to
free it. I think getting rid of that is one of the biggest gains with
my patch set.

> > introducing performance issues if the number of CPUs thrown at it end
> > up exceeding the maximum throughput of the memory.
> > 
> > The data provided with patch 11 seems to point to issues such as that.
> > In the case of the E7-8895 example cited it is increasing the numbers
> > of CPUs used from memory initialization from 8 to 72, a 9x increase in
> > the number of CPUs but it is yeilding only a 3.88x speedup.
> 
> Yes, but in both cases we are far from maxing out the memory throughput.
> The 3.88x is indeed low, and I do not know what slows it down.
> 
> Daniel,
> 
> Could you please check why multi-threading efficiency is so low here?
> 
> I bet, there is some atomic operation introduces a contention within a
> node. It should be possible to resolve.

Depending on what kernel this was based on it probably was something
like SetPageReserved triggering pipeline stalls, or maybe it is freeing
the pages themselves since I would imagine that has a pretty big
overhead an may end up serializing some things.

> > 
> > > The performance should improve by much more than 26%.
> > 
> > The 26% improvement, or speedup of 1.26x using the ktask approach, was
> > for persistent memory, not deferred memory init. The ktask patch
> > doesn't do anything for persistent memory since it is takes the hot-
> > plug path and isn't handled via the deferred memory init.
> 
> Ah, I thought in your experiment persistent memory takes deferred init
> path. So, what exactly in your patches make this 1.26x speedup?

Basically it is the folding of the reserved bit into the setting of the
rest of the values written into the page flags. With that change we are
able to tighten the memory initialization loop to the point where it
almost looks like a memset case as it will put together all of the
memory values outside of the loop and then just loop through assigning
the values for each page.

> > 
> > I had increased deferred memory init to about 3.53x the original speed
> > (3.75s to 1.06s) on the system which I was testing. I do agree the two
> > patches should be able to synergistically boost each other though as
> > this patch set was meant to make the init much more cache friendly so
> > as a result it should scale better as you add additional cores. I know
> > I had done some playing around with fake numa to split up a single node
> > into 8 logical nodes and I had seen a similar speedup of about 3.85x
> > with my test memory initializing in about 275ms.

It is also possible that the the 3.8x is an upper limit for something
on the system in terms of scaling. I just realized that the 3.85x I saw
with an 8 way split over a 4 socket system isn't too different from the
3.88x that was recorded for a 9 way split over an 8 socket system.

> > > Overall, your works looks good, but it needs to be considered how easy it will be
> > > to merge with ktask. I will try to complete the review today.
> > > 
> > > Thank you,
> > > Pasha
> > 
> > Looking over the patches they are still in the RFC stage and the data
> > is in need of updates since it is referencing 4.15-rc kernels as its
> > baseline. If anything I really think the ktask patch 11 would be easier
> > to rebase around my patch set then the other way around. Also, this
> > series is in Andrew's mmots as of a few days ago, so I think it will be
> > in the next mmotm that comes out.
> 
> I do not disagree, I think these two patch series should complement each
> other. But, if your changes make it impossible for ktask, I would strongly argue
> against it, as the potential improvements with ktasks are much higher.
> But, so far I do not see anything, so I think they can work together. I
> am still reviewing your work.

I am assuming patch 11 from the ktask set is the only one that really
touches the deferred memory init. I'm thinking that the actual patch
should be smaller if it is rebased off of this patch set. I didn't see
anything that should conflict with my patch set, and as I said the only
concern would be making certain to split the zone correctly to prevent
the free thread from hopping across the MAX_ORDER boundaries.

> > 
> > The integration with the ktask code should be pretty straight forward.
> > If anything I think my code would probably make it easier since it gets
> > rid of the need to do all this in two passes. The only new limitation
> > it would add is that you would probably want to split up the work along
> > either max order or section aligned boundaries. What it would
> 
> Which is totally OK, it should make ktasks scale even better.

Right.

> > essentially do is make things so that each of the ktask threads would
> > probably look more like deferred_grow_zone which after my patch set is
> > actually a fairly simple function.
> > 
> > Thanks.
> 
> Thank you,
> Pasha

Pasha Tatashin Nov. 10, 2018, 1:16 a.m. UTC | #5

On 18-11-09 16:46:02, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> > On 18-11-09 15:14:35, Alexander Duyck wrote:
> > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > > This patchset is essentially a refactor of the page initialization logic
> > > > > that is meant to provide for better code reuse while providing a
> > > > > significant improvement in deferred page initialization performance.
> > > > > 
> > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > > memory per node I have seen the following. In the case of regular memory
> > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > > average. For the persistent memory the initialization time dropped from
> > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > > deferred memory initialization performance, and a 26% improvement in the
> > > > > persistent memory initialization performance.
> > > > 
> > > > Hi Alex,
> > > > 
> > > > Please try to run your persistent memory init experiment with Daniel's
> > > > patches:
> > > > 
> > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > > 
> > > I've taken a quick look at it. It seems like a bit of a brute force way
> > > to try and speed things up. I would be worried about it potentially
> > 
> > There is a limit to max number of threads that ktasks start. The memory
> > throughput is *much* higher than what one CPU can maxout in a node, so
> > there is no reason to leave the other CPUs sit idle during boot when
> > they can help to initialize.
> 
> Right, but right now that limit can still be pretty big when it is
> something like 25% of all the CPUs on a 288 CPU system.

It is still OK. About 9 threads per node.

That machine has 1T of memory, which means 8 nodes need to initialize 2G
of memory each. With 46G/s throughout it should take 0.043s Which is 10
times higher than what Daniel sees with 0.325s, so there is still room
to saturate the memory throughput.

Now, if the multi-threadding efficiency is good, it should take
1.261s / 9 threads =  0.14s

> 
> One issue is the way the code was ends up essentially blowing out the
> cache over and over again. Doing things in two passes made it really
> expensive as you took one cache miss to initialize it, and another to
> free it. I think getting rid of that is one of the biggest gains with
> my patch set.

I am not arguing that your patches make sense, all I am saying that
ktasks improve time order of magnitude better on machines with large
amount of memory.

Pasha

Daniel Jordan Nov. 12, 2018, 4:25 p.m. UTC | #6

On Fri, Nov 09, 2018 at 07:00:06PM -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > > 
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > memory per node I have seen the following. In the case of regular memory
> > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > average. For the persistent memory the initialization time dropped from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > deferred memory initialization performance, and a 26% improvement in the
> > > > persistent memory initialization performance.
> > > 
> > > Hi Alex,
> > > 
> > > Please try to run your persistent memory init experiment with Daniel's
> > > patches:
> > > 
> > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > 
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
> 
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.
> 
> > introducing performance issues if the number of CPUs thrown at it end
> > up exceeding the maximum throughput of the memory.
> > 
> > The data provided with patch 11 seems to point to issues such as that.
> > In the case of the E7-8895 example cited it is increasing the numbers
> > of CPUs used from memory initialization from 8 to 72, a 9x increase in
> > the number of CPUs but it is yeilding only a 3.88x speedup.
> 
> Yes, but in both cases we are far from maxing out the memory throughput.
> The 3.88x is indeed low, and I do not know what slows it down.
> 
> Daniel,
> 
> Could you please check why multi-threading efficiency is so low here?

I'll hop on the machine after Plumbers.

> I bet, there is some atomic operation introduces a contention within a
> node. It should be possible to resolve.

We'll see, in any case I'm curious to see what the multithreading does with
Alex's patches, especially since we won't do two passes through the memory
anymore.

Not seeing anything in Alex's work right off that would preclude
multithreading.

Alexander Duyck Nov. 12, 2018, 7:10 p.m. UTC | #7

On Fri, Nov 9, 2018 at 5:17 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote:
>
> On 18-11-09 16:46:02, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> > > On 18-11-09 15:14:35, Alexander Duyck wrote:
> > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > > > This patchset is essentially a refactor of the page initialization logic
> > > > > > that is meant to provide for better code reuse while providing a
> > > > > > significant improvement in deferred page initialization performance.
> > > > > >
> > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > > > > memory per node I have seen the following. In the case of regular memory
> > > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > > > > average. For the persistent memory the initialization time dropped from
> > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > > > deferred memory initialization performance, and a 26% improvement in the
> > > > > > persistent memory initialization performance.
> > > > >
> > > > > Hi Alex,
> > > > >
> > > > > Please try to run your persistent memory init experiment with Daniel's
> > > > > patches:
> > > > >
> > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/
> > > >
> > > > I've taken a quick look at it. It seems like a bit of a brute force way
> > > > to try and speed things up. I would be worried about it potentially
> > >
> > > There is a limit to max number of threads that ktasks start. The memory
> > > throughput is *much* higher than what one CPU can maxout in a node, so
> > > there is no reason to leave the other CPUs sit idle during boot when
> > > they can help to initialize.
> >
> > Right, but right now that limit can still be pretty big when it is
> > something like 25% of all the CPUs on a 288 CPU system.
>
> It is still OK. About 9 threads per node.
>
> That machine has 1T of memory, which means 8 nodes need to initialize 2G
> of memory each. With 46G/s throughout it should take 0.043s Which is 10
> times higher than what Daniel sees with 0.325s, so there is still room
> to saturate the memory throughput.
>
> Now, if the multi-threadding efficiency is good, it should take
> 1.261s / 9 threads =  0.14s

The two patch sets combined would hopefully do better then that,
although I don't know what the clock speed is on the CPUs in use.

The system I have been testing with has 1.5TB spread over 4 nodes. So
we have effectively 3 times as much to initialize and with the
"numa=fake=8U" I was seeing it only take .275s to initialize the
nodes.

> >
> > One issue is the way the code was ends up essentially blowing out the
> > cache over and over again. Doing things in two passes made it really
> > expensive as you took one cache miss to initialize it, and another to
> > free it. I think getting rid of that is one of the biggest gains with
> > my patch set.
>
> I am not arguing that your patches make sense, all I am saying that
> ktasks improve time order of magnitude better on machines with large
> amount of memory.

The point I was trying to make is that it doesn't. You say it is an
order of magnitude better but it is essentially 3.5x vs 3.8x and to
achieve the 3.8x you are using a ton of system resources. My approach
is meant to do more with less, while this approach will throw a
quarter of the system at  page initialization.

An added advantage to my approach is that it speeds up things
regardless of the number of cores used, whereas the scaling approach
requires that there be more cores available to use. So for example on
some of the new AMD Zen stuff I am not sure the benefit would be all
that great since if I am not mistaken each tile is only 8 processors
so at most you are only doubling the processing power applied to the
initialization. In such a case it is likely that my approach would
fare much better then this approach since I don't require additional
cores to achieve the same results.

Anyway there are tradeoffs we have to take into account.

I will go over the changes you suggested after Plumbers. I just need
to figure out if I am doing incremental changes, or if Andrew wants me
to just resubmit the whole set. I can probably deal with these changes
either way since most of them are pretty small.

Thanks.

- Alex

Pasha Tatashin Nov. 12, 2018, 8:37 p.m. UTC | #8

On 18-11-12 11:10:35, Alexander Duyck wrote:
> 
> The point I was trying to make is that it doesn't. You say it is an
> order of magnitude better but it is essentially 3.5x vs 3.8x and to
> achieve the 3.8x you are using a ton of system resources. My approach
> is meant to do more with less, while this approach will throw a
> quarter of the system at  page initialization.

3.8x is a bug, that is going to be fixed before ktasks are accepted. The
final results will be close to time/nthreads.
Using more resources to initialize pages is fine, because other CPUs are
idling during this time in boot.

Lets wait for what Daniel finds out after Linux Plumber. And we can
continue this discussion in ktask thread.

> 
> An added advantage to my approach is that it speeds up things
> regardless of the number of cores used, whereas the scaling approach

Yes, I agree, I like your approach. It is clean, simplifies, and
improves the performance. I have tested it on both ARM and x86, and
verified the performance improvements. So:

Tested-by: Pavel Tatashin <pasha.tatashin@soleen.com>

> requires that there be more cores available to use. So for example on
> some of the new AMD Zen stuff I am not sure the benefit would be all
> that great since if I am not mistaken each tile is only 8 processors
> so at most you are only doubling the processing power applied to the
> initialization. In such a case it is likely that my approach would
> fare much better then this approach since I don't require additional
> cores to achieve the same results.
> 
> Anyway there are tradeoffs we have to take into account.
> 
> I will go over the changes you suggested after Plumbers. I just need
> to figure out if I am doing incremental changes, or if Andrew wants me
> to just resubmit the whole set. I can probably deal with these changes
> either way since most of them are pretty small.

Send the full series again, Andrew is very good at taking only
incremental  changes once a new version is posted of something
that is already in mm-tree.

Thank you,
Pasha

Michal Hocko Nov. 14, 2018, 3:07 p.m. UTC | #9

On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> This patchset is essentially a refactor of the page initialization logic
> that is meant to provide for better code reuse while providing a
> significant improvement in deferred page initialization performance.
> 
> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> memory per node I have seen the following. In the case of regular memory
> initialization the deferred init time was decreased from 3.75s to 1.06s on
> average. For the persistent memory the initialization time dropped from
> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> deferred memory initialization performance, and a 26% improvement in the
> persistent memory initialization performance.
> 
> I have called out the improvement observed with each patch.

I have only glanced through the code (there is a lot of the code to look
at here). And I do not like the code duplication and the way how you
make the hotplug special. There shouldn't be any real reason for that
IMHO (e.g. why do we init pfn-at-a-time in early init while we do
pageblock-at-a-time for hotplug). I might be wrong here and the code
reuse might be really hard to achieve though.

I am also not impressed by new iterators because this api is quite
complex already. But this is mostly a detail.

Thing I do not like is that you keep microptimizing PageReserved part
while there shouldn't be anything fundamental about it. We should just
remove it rather than make the code more complex. I fell more and more
guilty to add there actually.

Pasha Tatashin Nov. 14, 2018, 7:12 p.m. UTC | #10

On 18-11-14 16:07:42, Michal Hocko wrote:
> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > This patchset is essentially a refactor of the page initialization logic
> > that is meant to provide for better code reuse while providing a
> > significant improvement in deferred page initialization performance.
> > 
> > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > memory per node I have seen the following. In the case of regular memory
> > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > average. For the persistent memory the initialization time dropped from
> > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > deferred memory initialization performance, and a 26% improvement in the
> > persistent memory initialization performance.
> > 
> > I have called out the improvement observed with each patch.
> 
> I have only glanced through the code (there is a lot of the code to look
> at here). And I do not like the code duplication and the way how you
> make the hotplug special. There shouldn't be any real reason for that
> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> pageblock-at-a-time for hotplug). I might be wrong here and the code
> reuse might be really hard to achieve though.

I do not like having __init_single_page() to be done differently for
hotplug. I think, if that is fixed, there is almost no more code
duplication, the rest looks alright to me.

> 
> I am also not impressed by new iterators because this api is quite
> complex already. But this is mostly a detail.

I have reviewed all the patches in this series, and at first was also
worried about the new iterators. But, after diving deeper, they actually
make sense, and new memblock iterators look alright to me. The only
iterator, that I would like to see improved is
for_each_deferred_pfn_valid_range(), because it is very hard to
understand it.

This series is an impressive performance improvement, and I have
confirmed it on both arm, and x86. It is also should be compatible with
ktasks.

> 
> Thing I do not like is that you keep microptimizing PageReserved part
> while there shouldn't be anything fundamental about it. We should just

Agree.

> remove it rather than make the code more complex. I fell more and more
> guilty to add there actually.

Michal Hocko Nov. 14, 2018, 9:35 p.m. UTC | #11

On Wed 14-11-18 19:12:53, Pavel Tatashin wrote:
> On 18-11-14 16:07:42, Michal Hocko wrote:
> > On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > > 
> > > I have called out the improvement observed with each patch.
> > 
> > I have only glanced through the code (there is a lot of the code to look
> > at here). And I do not like the code duplication and the way how you
> > make the hotplug special. There shouldn't be any real reason for that
> > IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > pageblock-at-a-time for hotplug). I might be wrong here and the code
> > reuse might be really hard to achieve though.
> 
> I do not like having __init_single_page() to be done differently for
> hotplug. I think, if that is fixed, there is almost no more code
> duplication, the rest looks alright to me.

There is still that zone device special casing which I have brought up
previously but I reckon this is out of scope of this series.

> > I am also not impressed by new iterators because this api is quite
> > complex already. But this is mostly a detail.
> 
> I have reviewed all the patches in this series, and at first was also
> worried about the new iterators. But, after diving deeper, they actually
> make sense, and new memblock iterators look alright to me. The only
> iterator, that I would like to see improved is
> for_each_deferred_pfn_valid_range(), because it is very hard to
> understand it.
> 
> This series is an impressive performance improvement, and I have
> confirmed it on both arm, and x86. It is also should be compatible with
> ktasks.

I see the performance argument. I also do see the maintenance burden
argument. Recent bootmem removal has shown how big and hard to follow
the whole API is. And this should be considered. I am not saying the
current series is a nogo though. Maybe changelogs should be more clear
to spell out advantages to do a per-pageblock initialization that brings
a lot of new code in. As I've said I have basically glanced through
the comulative diff and changelogs to get an impression so it is not
entirely clear to me.  Especially when the per block init does per pfn
within the block init anyway.

That being said, I belive changelogs should be much more specific and
I hope to see this to be a much more modest in the added code. If that
is not possible, then fair enough, but I would like to see it tried.
And please let's not build on top of cargocult and rather get rid of
pointless stuff (e.g. PageReserved) rather than go around it.

Alexander Duyck Nov. 15, 2018, 12:50 a.m. UTC | #12

On 11/14/2018 7:07 AM, Michal Hocko wrote:
> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
>> This patchset is essentially a refactor of the page initialization logic
>> that is meant to provide for better code reuse while providing a
>> significant improvement in deferred page initialization performance.
>>
>> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
>> memory per node I have seen the following. In the case of regular memory
>> initialization the deferred init time was decreased from 3.75s to 1.06s on
>> average. For the persistent memory the initialization time dropped from
>> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
>> deferred memory initialization performance, and a 26% improvement in the
>> persistent memory initialization performance.
>>
>> I have called out the improvement observed with each patch.
> 
> I have only glanced through the code (there is a lot of the code to look
> at here). And I do not like the code duplication and the way how you
> make the hotplug special. There shouldn't be any real reason for that
> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> pageblock-at-a-time for hotplug). I might be wrong here and the code
> reuse might be really hard to achieve though.

Actually it isn't so much that hotplug is special. The issue is more 
that the non-hotplug case is special in that you have to perform a 
number of extra checks for things that just aren't necessary for the 
hotplug case.

If anything I would probably need a new iterator that would be able to 
take into account all the checks for the non-hotplug case and then 
provide ranges of PFNs to initialize.

> I am also not impressed by new iterators because this api is quite
> complex already. But this is mostly a detail.

Yeah, the iterators were mostly an attempt at hiding some of the 
complexity. Being able to break a loop down to just an iterator provding 
the start of the range and the number of elements to initialize is 
pretty easy to visualize, or at least I thought so.

> Thing I do not like is that you keep microptimizing PageReserved part
> while there shouldn't be anything fundamental about it. We should just
> remove it rather than make the code more complex. I fell more and more
> guilty to add there actually.

I plan to remove it, but don't think I can get to it in this patch set.

I was planning to submit one more iteration of this patch set early next 
week, and then start focusing more on the removal of the PageReserved 
bit for hotplug. I figure it is probably going to be a full patch set 
onto itself and as you pointed out at the start of this email there is 
already enough code to review without adding that.

Mike Rapoport Nov. 15, 2018, 1:55 a.m. UTC | #13

On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote:
> 
> 
> On 11/14/2018 7:07 AM, Michal Hocko wrote:
> >On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> >>This patchset is essentially a refactor of the page initialization logic
> >>that is meant to provide for better code reuse while providing a
> >>significant improvement in deferred page initialization performance.
> >>
> >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> >>memory per node I have seen the following. In the case of regular memory
> >>initialization the deferred init time was decreased from 3.75s to 1.06s on
> >>average. For the persistent memory the initialization time dropped from
> >>24.17s to 19.12s on average. This amounts to a 253% improvement for the
> >>deferred memory initialization performance, and a 26% improvement in the
> >>persistent memory initialization performance.
> >>
> >>I have called out the improvement observed with each patch.
> >
> >I have only glanced through the code (there is a lot of the code to look
> >at here). And I do not like the code duplication and the way how you
> >make the hotplug special. There shouldn't be any real reason for that
> >IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> >pageblock-at-a-time for hotplug). I might be wrong here and the code
> >reuse might be really hard to achieve though.
> 
> Actually it isn't so much that hotplug is special. The issue is more that
> the non-hotplug case is special in that you have to perform a number of
> extra checks for things that just aren't necessary for the hotplug case.
> 
> If anything I would probably need a new iterator that would be able to take
> into account all the checks for the non-hotplug case and then provide ranges
> of PFNs to initialize.
> 
> >I am also not impressed by new iterators because this api is quite
> >complex already. But this is mostly a detail.
> 
> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> Being able to break a loop down to just an iterator provding the start of
> the range and the number of elements to initialize is pretty easy to
> visualize, or at least I thought so.

Just recently we had a discussion about overlapping for_each_mem_range()
and for_each_mem_pfn_range(), but unfortunately it appears that no mailing
list was cc'ed by the original patch author :(
In short, there was a spelling fix in one of them and Michal pointed out
that their functionality overlaps.

I have no objection for for_each_free_mem_pfn_range_in_zone() and
__next_mem_pfn_range_in_zone(), but probably we should consider unifying
the older iterators before we introduce a new one? 
 
> >Thing I do not like is that you keep microptimizing PageReserved part
> >while there shouldn't be anything fundamental about it. We should just
> >remove it rather than make the code more complex. I fell more and more
> >guilty to add there actually.
> 
> I plan to remove it, but don't think I can get to it in this patch set.
> 
> I was planning to submit one more iteration of this patch set early next
> week, and then start focusing more on the removal of the PageReserved bit
> for hotplug. I figure it is probably going to be a full patch set onto
> itself and as you pointed out at the start of this email there is already
> enough code to review without adding that.
> 
>

Michal Hocko Nov. 15, 2018, 8:10 a.m. UTC | #14

On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
> 
> 
> On 11/14/2018 7:07 AM, Michal Hocko wrote:
> > On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > > 
> > > I have called out the improvement observed with each patch.
> > 
> > I have only glanced through the code (there is a lot of the code to look
> > at here). And I do not like the code duplication and the way how you
> > make the hotplug special. There shouldn't be any real reason for that
> > IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > pageblock-at-a-time for hotplug). I might be wrong here and the code
> > reuse might be really hard to achieve though.
> 
> Actually it isn't so much that hotplug is special. The issue is more that
> the non-hotplug case is special in that you have to perform a number of
> extra checks for things that just aren't necessary for the hotplug case.

Can we hide those behind a helper (potentially with a jump label if
necessary) and still share a large part of the code? Also this code is
quite old and maybe we are overzealous with the early checks. Do we
really need them. Why should be early boot memory any different from the
hotplug. The only exception I can see should really be deferred
initialization check.

> If anything I would probably need a new iterator that would be able to take
> into account all the checks for the non-hotplug case and then provide ranges
> of PFNs to initialize.
> 
> > I am also not impressed by new iterators because this api is quite
> > complex already. But this is mostly a detail.
> 
> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> Being able to break a loop down to just an iterator provding the start of
> the range and the number of elements to initialize is pretty easy to
> visualize, or at least I thought so.

I am not against hiding the complexity. I am mostly concerned that we
have too many of those iterators. Maybe we can reuse existing ones in
some way. If that is not really possible or it would make even more mess
then fair enough and go with new ones.

> > Thing I do not like is that you keep microptimizing PageReserved part
> > while there shouldn't be anything fundamental about it. We should just
> > remove it rather than make the code more complex. I fell more and more
> > guilty to add there actually.
> 
> I plan to remove it, but don't think I can get to it in this patch set.

What I am trying to argue is that we should simply drop the
__SetPageReserved as an independent patch prior to this whole series.
As I've mentioned earlier, I have added this just to be sure and part of
that was that __add_section has set the reserved bit. This is no longer
the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
hotplug").

Nobody should really depend on that because struct pages are in
undefined state after __add_pages and they should get fully initialized
after move_pfn_range_to_zone.

If you really insist on setting the reserved bit then it really has to
happen much sooner than it is right now. So I do not really see any
point in doing so. Sure there are some pfn walkers that really need to
do pfn_to_online_page because pfn_valid is not sufficient but that is
largely independent on any optimization work in this area.

I am sorry if I haven't been clear about that before. Does it make more
sense to you now?

P.S.
There is always that tempting thing to follow the existing code and
tweak it for a new purpose. This approach, however, adds more and more
complex code on top of something that might be wrong or stale already.
I have seen that in MM code countless times and I have contributed to
that myself. I am sorry to push back on this so hard but this code is
a mess and any changes to make it more optimal should really make sure
the foundations are solid before. Been there done that, not a huge fun
but that is the price for having basically unmaintained piece of code
that random usecases stop by and do what they need without ever
following up later.

Alexander Duyck Nov. 15, 2018, 4:02 p.m. UTC | #15

On 11/15/2018 12:10 AM, Michal Hocko wrote:
> On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
>>
>>
>> On 11/14/2018 7:07 AM, Michal Hocko wrote:
>>> On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
>>>> This patchset is essentially a refactor of the page initialization logic
>>>> that is meant to provide for better code reuse while providing a
>>>> significant improvement in deferred page initialization performance.
>>>>
>>>> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
>>>> memory per node I have seen the following. In the case of regular memory
>>>> initialization the deferred init time was decreased from 3.75s to 1.06s on
>>>> average. For the persistent memory the initialization time dropped from
>>>> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
>>>> deferred memory initialization performance, and a 26% improvement in the
>>>> persistent memory initialization performance.
>>>>
>>>> I have called out the improvement observed with each patch.
>>>
>>> I have only glanced through the code (there is a lot of the code to look
>>> at here). And I do not like the code duplication and the way how you
>>> make the hotplug special. There shouldn't be any real reason for that
>>> IMHO (e.g. why do we init pfn-at-a-time in early init while we do
>>> pageblock-at-a-time for hotplug). I might be wrong here and the code
>>> reuse might be really hard to achieve though.
>>
>> Actually it isn't so much that hotplug is special. The issue is more that
>> the non-hotplug case is special in that you have to perform a number of
>> extra checks for things that just aren't necessary for the hotplug case.
> 
> Can we hide those behind a helper (potentially with a jump label if
> necessary) and still share a large part of the code? Also this code is
> quite old and maybe we are overzealous with the early checks. Do we
> really need them. Why should be early boot memory any different from the
> hotplug. The only exception I can see should really be deferred
> initialization check.
> 
>> If anything I would probably need a new iterator that would be able to take
>> into account all the checks for the non-hotplug case and then provide ranges
>> of PFNs to initialize.
>>
>>> I am also not impressed by new iterators because this api is quite
>>> complex already. But this is mostly a detail.
>>
>> Yeah, the iterators were mostly an attempt at hiding some of the complexity.
>> Being able to break a loop down to just an iterator provding the start of
>> the range and the number of elements to initialize is pretty easy to
>> visualize, or at least I thought so.
> 
> I am not against hiding the complexity. I am mostly concerned that we
> have too many of those iterators. Maybe we can reuse existing ones in
> some way. If that is not really possible or it would make even more mess
> then fair enough and go with new ones.
> 
>>> Thing I do not like is that you keep microptimizing PageReserved part
>>> while there shouldn't be anything fundamental about it. We should just
>>> remove it rather than make the code more complex. I fell more and more
>>> guilty to add there actually.
>>
>> I plan to remove it, but don't think I can get to it in this patch set.
> 
> What I am trying to argue is that we should simply drop the
> __SetPageReserved as an independent patch prior to this whole series.
> As I've mentioned earlier, I have added this just to be sure and part of
> that was that __add_section has set the reserved bit. This is no longer
> the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
> hotplug").
> 
> Nobody should really depend on that because struct pages are in
> undefined state after __add_pages and they should get fully initialized
> after move_pfn_range_to_zone.
> 
> If you really insist on setting the reserved bit then it really has to
> happen much sooner than it is right now. So I do not really see any
> point in doing so. Sure there are some pfn walkers that really need to
> do pfn_to_online_page because pfn_valid is not sufficient but that is
> largely independent on any optimization work in this area.
> 
> I am sorry if I haven't been clear about that before. Does it make more
> sense to you now?

I get what you are saying I just don't agree with the ordering. I have 
had these patches in the works for a couple months now. You are 
essentially telling me to defer these in favor of taking care of the 
reserved bit first.

The only spot where I think we disagree is that I would prefer to get 
these in first, and then focus on the reserved bit. I'm not saying we 
shouldn't do the work, but I would rather take care of the immediate 
issue, and then "clean house" as it were. I've done that sort of thing 
in the past where I start deferring patches and then by the end of 
things I have a 60 patch set I am trying to push because one fix gets 
ahead of another and another.

My big concern is that dropping the reserved bit is going to be much 
more error prone than the work I have done in this patch set since it is 
much more likely that somebody somewhere has made a false reliance on it 
being set. If you have any tests you usually run for memory hotplug it 
would be useful if you could point me in that direction. Then I can move 
forward with the upcoming patch set with a bit more confidence.

> P.S.
> There is always that tempting thing to follow the existing code and
> tweak it for a new purpose. This approach, however, adds more and more
> complex code on top of something that might be wrong or stale already.
> I have seen that in MM code countless times and I have contributed to
> that myself. I am sorry to push back on this so hard but this code is
> a mess and any changes to make it more optimal should really make sure
> the foundations are solid before. Been there done that, not a huge fun
> but that is the price for having basically unmaintained piece of code
> that random usecases stop by and do what they need without ever
> following up later.

That is what I am doing. That is why I found and dropped the check for 
the NUMA not in the deferred init. I am pushing back on code where it 
makes sense to do so and determine what actually is adding value. My 
concern is more that I am worried that trying to make things "perfect" 
might be getting in the way of "good". Kernel development has always 
been an incremental process. My preference would be to lock down what we 
have before I go diving in and cleaning up other bits of the memory init.

Michal Hocko Nov. 15, 2018, 4:40 p.m. UTC | #16

On Thu 15-11-18 08:02:33, Alexander Duyck wrote:
> On 11/15/2018 12:10 AM, Michal Hocko wrote:
> > On Wed 14-11-18 16:50:23, Alexander Duyck wrote:
[...]
> > > I plan to remove it, but don't think I can get to it in this patch set.
> > 
> > What I am trying to argue is that we should simply drop the
> > __SetPageReserved as an independent patch prior to this whole series.
> > As I've mentioned earlier, I have added this just to be sure and part of
> > that was that __add_section has set the reserved bit. This is no longer
> > the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory
> > hotplug").
> > 
> > Nobody should really depend on that because struct pages are in
> > undefined state after __add_pages and they should get fully initialized
> > after move_pfn_range_to_zone.
> > 
> > If you really insist on setting the reserved bit then it really has to
> > happen much sooner than it is right now. So I do not really see any
> > point in doing so. Sure there are some pfn walkers that really need to
> > do pfn_to_online_page because pfn_valid is not sufficient but that is
> > largely independent on any optimization work in this area.
> > 
> > I am sorry if I haven't been clear about that before. Does it make more
> > sense to you now?
> 
> I get what you are saying I just don't agree with the ordering. I have had
> these patches in the works for a couple months now. You are essentially
> telling me to defer these in favor of taking care of the reserved bit first.

General development is to fix correctness and/or cleanup stale code
first and optimize on top. I really do not see why this should be
any different. Especially when page reserved bit is a part of your
optimization work.

There is some review feedback to address from this version so you can
add a patch to remove the reserved bit to the series - feel free to
reuse my explanation as the justification. I really do not think you
have to change other call sites because they would be broken in the same
way as when the bit is set (late).

> The only spot where I think we disagree is that I would prefer to get these
> in first, and then focus on the reserved bit. I'm not saying we shouldn't do
> the work, but I would rather take care of the immediate issue, and then
> "clean house" as it were. I've done that sort of thing in the past where I
> start deferring patches and then by the end of things I have a 60 patch set
> I am trying to push because one fix gets ahead of another and another.

This is nothing exceptional and it happened to me as well.

> My big concern is that dropping the reserved bit is going to be much more
> error prone than the work I have done in this patch set since it is much
> more likely that somebody somewhere has made a false reliance on it being
> set. If you have any tests you usually run for memory hotplug it would be
> useful if you could point me in that direction. Then I can move forward with
> the upcoming patch set with a bit more confidence.

There is nothing except for exercising hotplug and do random stuff to
it. We simply do not know about all the pfn walkers so we have to count
on a reasonable model to justify changes. I hope I have clarified why
the reserved bit doesn't act the purpose it has been added for.

I understand that you want to push your optimization work ASAP but I _do_
care about longterm maintainability much more than having performance
gain _right_ now.

That being said, I am not nacking your series so if others really think
that I am asking too much I can live with that. I was overruled the last
time when the first optimization pile went in. I still hope we can get
the result cleaned up as promised, though.

Mike Rapoport Nov. 15, 2018, 7:09 p.m. UTC | #17

On Wed, Nov 14, 2018 at 05:55:12PM -0800, Mike Rapoport wrote:
> On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote:
> > 
> > 
> > On 11/14/2018 7:07 AM, Michal Hocko wrote:
> > >On Mon 05-11-18 13:19:25, Alexander Duyck wrote:
> > >>This patchset is essentially a refactor of the page initialization logic
> > >>that is meant to provide for better code reuse while providing a
> > >>significant improvement in deferred page initialization performance.
> > >>
> > >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > >>memory per node I have seen the following. In the case of regular memory
> > >>initialization the deferred init time was decreased from 3.75s to 1.06s on
> > >>average. For the persistent memory the initialization time dropped from
> > >>24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > >>deferred memory initialization performance, and a 26% improvement in the
> > >>persistent memory initialization performance.
> > >>
> > >>I have called out the improvement observed with each patch.
> > >
> > >I have only glanced through the code (there is a lot of the code to look
> > >at here). And I do not like the code duplication and the way how you
> > >make the hotplug special. There shouldn't be any real reason for that
> > >IMHO (e.g. why do we init pfn-at-a-time in early init while we do
> > >pageblock-at-a-time for hotplug). I might be wrong here and the code
> > >reuse might be really hard to achieve though.
> > 
> > Actually it isn't so much that hotplug is special. The issue is more that
> > the non-hotplug case is special in that you have to perform a number of
> > extra checks for things that just aren't necessary for the hotplug case.
> > 
> > If anything I would probably need a new iterator that would be able to take
> > into account all the checks for the non-hotplug case and then provide ranges
> > of PFNs to initialize.
> > 
> > >I am also not impressed by new iterators because this api is quite
> > >complex already. But this is mostly a detail.
> > 
> > Yeah, the iterators were mostly an attempt at hiding some of the complexity.
> > Being able to break a loop down to just an iterator provding the start of
> > the range and the number of elements to initialize is pretty easy to
> > visualize, or at least I thought so.
> 
> Just recently we had a discussion about overlapping for_each_mem_range()
> and for_each_mem_pfn_range(), but unfortunately it appears that no mailing
> list was cc'ed by the original patch author :(
> In short, there was a spelling fix in one of them and Michal pointed out
> that their functionality overlaps.
> 
> I have no objection for for_each_free_mem_pfn_range_in_zone() and
> __next_mem_pfn_range_in_zone(), but probably we should consider unifying
> the older iterators before we introduce a new one? 

Another thing I realized only now is that
for_each_free_mem_pfn_range_in_zone() can be used only relatively late in
the memblock life-span because zones are initialized far later than
setup_arch() in many cases.

At the very least this should be documented.
 
> > >Thing I do not like is that you keep microptimizing PageReserved part
> > >while there shouldn't be anything fundamental about it. We should just
> > >remove it rather than make the code more complex. I fell more and more
> > >guilty to add there actually.
> > 
> > I plan to remove it, but don't think I can get to it in this patch set.
> > 
> > I was planning to submit one more iteration of this patch set early next
> > week, and then start focusing more on the removal of the PageReserved bit
> > for hotplug. I figure it is probably going to be a full patch set onto
> > itself and as you pointed out at the start of this email there is already
> > enough code to review without adding that.
> > 
> > 
> 
> -- 
> Sincerely yours,
> Mike.

[mm,v5,0/7] Deferred page init improvements

Message

Comments