Message ID | 154145268025.30046.11742652345962594283.stgit@ahduyck-desk1.jf.intel.com (mailing list archive) |
---|---|
Headers | show |
Series | Deferred page init improvements | expand |
On 18-11-05 13:19:25, Alexander Duyck wrote: > This patchset is essentially a refactor of the page initialization logic > that is meant to provide for better code reuse while providing a > significant improvement in deferred page initialization performance. > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > memory per node I have seen the following. In the case of regular memory > initialization the deferred init time was decreased from 3.75s to 1.06s on > average. For the persistent memory the initialization time dropped from > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > deferred memory initialization performance, and a 26% improvement in the > persistent memory initialization performance. Hi Alex, Please try to run your persistent memory init experiment with Daniel's patches: https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ The performance should improve by much more than 26%. Overall, your works looks good, but it needs to be considered how easy it will be to merge with ktask. I will try to complete the review today. Thank you, Pasha
On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > On 18-11-05 13:19:25, Alexander Duyck wrote: > > This patchset is essentially a refactor of the page initialization logic > > that is meant to provide for better code reuse while providing a > > significant improvement in deferred page initialization performance. > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > memory per node I have seen the following. In the case of regular memory > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > average. For the persistent memory the initialization time dropped from > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > deferred memory initialization performance, and a 26% improvement in the > > persistent memory initialization performance. > > Hi Alex, > > Please try to run your persistent memory init experiment with Daniel's > patches: > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ I've taken a quick look at it. It seems like a bit of a brute force way to try and speed things up. I would be worried about it potentially introducing performance issues if the number of CPUs thrown at it end up exceeding the maximum throughput of the memory. The data provided with patch 11 seems to point to issues such as that. In the case of the E7-8895 example cited it is increasing the numbers of CPUs used from memory initialization from 8 to 72, a 9x increase in the number of CPUs but it is yeilding only a 3.88x speedup. > The performance should improve by much more than 26%. The 26% improvement, or speedup of 1.26x using the ktask approach, was for persistent memory, not deferred memory init. The ktask patch doesn't do anything for persistent memory since it is takes the hot- plug path and isn't handled via the deferred memory init. I had increased deferred memory init to about 3.53x the original speed (3.75s to 1.06s) on the system which I was testing. I do agree the two patches should be able to synergistically boost each other though as this patch set was meant to make the init much more cache friendly so as a result it should scale better as you add additional cores. I know I had done some playing around with fake numa to split up a single node into 8 logical nodes and I had seen a similar speedup of about 3.85x with my test memory initializing in about 275ms. > Overall, your works looks good, but it needs to be considered how easy it will be > to merge with ktask. I will try to complete the review today. > > Thank you, > Pasha Looking over the patches they are still in the RFC stage and the data is in need of updates since it is referencing 4.15-rc kernels as its baseline. If anything I really think the ktask patch 11 would be easier to rebase around my patch set then the other way around. Also, this series is in Andrew's mmots as of a few days ago, so I think it will be in the next mmotm that comes out. The integration with the ktask code should be pretty straight forward. If anything I think my code would probably make it easier since it gets rid of the need to do all this in two passes. The only new limitation it would add is that you would probably want to split up the work along either max order or section aligned boundaries. What it would essentially do is make things so that each of the ktask threads would probably look more like deferred_grow_zone which after my patch set is actually a fairly simple function. Thanks. - Alex
On 18-11-09 15:14:35, Alexander Duyck wrote: > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > This patchset is essentially a refactor of the page initialization logic > > > that is meant to provide for better code reuse while providing a > > > significant improvement in deferred page initialization performance. > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > memory per node I have seen the following. In the case of regular memory > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > average. For the persistent memory the initialization time dropped from > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > deferred memory initialization performance, and a 26% improvement in the > > > persistent memory initialization performance. > > > > Hi Alex, > > > > Please try to run your persistent memory init experiment with Daniel's > > patches: > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > I've taken a quick look at it. It seems like a bit of a brute force way > to try and speed things up. I would be worried about it potentially There is a limit to max number of threads that ktasks start. The memory throughput is *much* higher than what one CPU can maxout in a node, so there is no reason to leave the other CPUs sit idle during boot when they can help to initialize. > introducing performance issues if the number of CPUs thrown at it end > up exceeding the maximum throughput of the memory. > > The data provided with patch 11 seems to point to issues such as that. > In the case of the E7-8895 example cited it is increasing the numbers > of CPUs used from memory initialization from 8 to 72, a 9x increase in > the number of CPUs but it is yeilding only a 3.88x speedup. Yes, but in both cases we are far from maxing out the memory throughput. The 3.88x is indeed low, and I do not know what slows it down. Daniel, Could you please check why multi-threading efficiency is so low here? I bet, there is some atomic operation introduces a contention within a node. It should be possible to resolve. > > > The performance should improve by much more than 26%. > > The 26% improvement, or speedup of 1.26x using the ktask approach, was > for persistent memory, not deferred memory init. The ktask patch > doesn't do anything for persistent memory since it is takes the hot- > plug path and isn't handled via the deferred memory init. Ah, I thought in your experiment persistent memory takes deferred init path. So, what exactly in your patches make this 1.26x speedup? > > I had increased deferred memory init to about 3.53x the original speed > (3.75s to 1.06s) on the system which I was testing. I do agree the two > patches should be able to synergistically boost each other though as > this patch set was meant to make the init much more cache friendly so > as a result it should scale better as you add additional cores. I know > I had done some playing around with fake numa to split up a single node > into 8 logical nodes and I had seen a similar speedup of about 3.85x > with my test memory initializing in about 275ms. > > > Overall, your works looks good, but it needs to be considered how easy it will be > > to merge with ktask. I will try to complete the review today. > > > > Thank you, > > Pasha > > Looking over the patches they are still in the RFC stage and the data > is in need of updates since it is referencing 4.15-rc kernels as its > baseline. If anything I really think the ktask patch 11 would be easier > to rebase around my patch set then the other way around. Also, this > series is in Andrew's mmots as of a few days ago, so I think it will be > in the next mmotm that comes out. I do not disagree, I think these two patch series should complement each other. But, if your changes make it impossible for ktask, I would strongly argue against it, as the potential improvements with ktasks are much higher. But, so far I do not see anything, so I think they can work together. I am still reviewing your work. > > The integration with the ktask code should be pretty straight forward. > If anything I think my code would probably make it easier since it gets > rid of the need to do all this in two passes. The only new limitation > it would add is that you would probably want to split up the work along > either max order or section aligned boundaries. What it would Which is totally OK, it should make ktasks scale even better. > essentially do is make things so that each of the ktask threads would > probably look more like deferred_grow_zone which after my patch set is > actually a fairly simple function. > > Thanks. Thank you, Pasha
On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > On 18-11-09 15:14:35, Alexander Duyck wrote: > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > This patchset is essentially a refactor of the page initialization logic > > > > that is meant to provide for better code reuse while providing a > > > > significant improvement in deferred page initialization performance. > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > > memory per node I have seen the following. In the case of regular memory > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > > average. For the persistent memory the initialization time dropped from > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > deferred memory initialization performance, and a 26% improvement in the > > > > persistent memory initialization performance. > > > > > > Hi Alex, > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > patches: > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > to try and speed things up. I would be worried about it potentially > > There is a limit to max number of threads that ktasks start. The memory > throughput is *much* higher than what one CPU can maxout in a node, so > there is no reason to leave the other CPUs sit idle during boot when > they can help to initialize. Right, but right now that limit can still be pretty big when it is something like 25% of all the CPUs on a 288 CPU system. One issue is the way the code was ends up essentially blowing out the cache over and over again. Doing things in two passes made it really expensive as you took one cache miss to initialize it, and another to free it. I think getting rid of that is one of the biggest gains with my patch set. > > introducing performance issues if the number of CPUs thrown at it end > > up exceeding the maximum throughput of the memory. > > > > The data provided with patch 11 seems to point to issues such as that. > > In the case of the E7-8895 example cited it is increasing the numbers > > of CPUs used from memory initialization from 8 to 72, a 9x increase in > > the number of CPUs but it is yeilding only a 3.88x speedup. > > Yes, but in both cases we are far from maxing out the memory throughput. > The 3.88x is indeed low, and I do not know what slows it down. > > Daniel, > > Could you please check why multi-threading efficiency is so low here? > > I bet, there is some atomic operation introduces a contention within a > node. It should be possible to resolve. Depending on what kernel this was based on it probably was something like SetPageReserved triggering pipeline stalls, or maybe it is freeing the pages themselves since I would imagine that has a pretty big overhead an may end up serializing some things. > > > > > The performance should improve by much more than 26%. > > > > The 26% improvement, or speedup of 1.26x using the ktask approach, was > > for persistent memory, not deferred memory init. The ktask patch > > doesn't do anything for persistent memory since it is takes the hot- > > plug path and isn't handled via the deferred memory init. > > Ah, I thought in your experiment persistent memory takes deferred init > path. So, what exactly in your patches make this 1.26x speedup? Basically it is the folding of the reserved bit into the setting of the rest of the values written into the page flags. With that change we are able to tighten the memory initialization loop to the point where it almost looks like a memset case as it will put together all of the memory values outside of the loop and then just loop through assigning the values for each page. > > > > I had increased deferred memory init to about 3.53x the original speed > > (3.75s to 1.06s) on the system which I was testing. I do agree the two > > patches should be able to synergistically boost each other though as > > this patch set was meant to make the init much more cache friendly so > > as a result it should scale better as you add additional cores. I know > > I had done some playing around with fake numa to split up a single node > > into 8 logical nodes and I had seen a similar speedup of about 3.85x > > with my test memory initializing in about 275ms. It is also possible that the the 3.8x is an upper limit for something on the system in terms of scaling. I just realized that the 3.85x I saw with an 8 way split over a 4 socket system isn't too different from the 3.88x that was recorded for a 9 way split over an 8 socket system. > > > Overall, your works looks good, but it needs to be considered how easy it will be > > > to merge with ktask. I will try to complete the review today. > > > > > > Thank you, > > > Pasha > > > > Looking over the patches they are still in the RFC stage and the data > > is in need of updates since it is referencing 4.15-rc kernels as its > > baseline. If anything I really think the ktask patch 11 would be easier > > to rebase around my patch set then the other way around. Also, this > > series is in Andrew's mmots as of a few days ago, so I think it will be > > in the next mmotm that comes out. > > I do not disagree, I think these two patch series should complement each > other. But, if your changes make it impossible for ktask, I would strongly argue > against it, as the potential improvements with ktasks are much higher. > But, so far I do not see anything, so I think they can work together. I > am still reviewing your work. I am assuming patch 11 from the ktask set is the only one that really touches the deferred memory init. I'm thinking that the actual patch should be smaller if it is rebased off of this patch set. I didn't see anything that should conflict with my patch set, and as I said the only concern would be making certain to split the zone correctly to prevent the free thread from hopping across the MAX_ORDER boundaries. > > > > The integration with the ktask code should be pretty straight forward. > > If anything I think my code would probably make it easier since it gets > > rid of the need to do all this in two passes. The only new limitation > > it would add is that you would probably want to split up the work along > > either max order or section aligned boundaries. What it would > > Which is totally OK, it should make ktasks scale even better. Right. > > essentially do is make things so that each of the ktask threads would > > probably look more like deferred_grow_zone which after my patch set is > > actually a fairly simple function. > > > > Thanks. > > Thank you, > Pasha
On 18-11-09 16:46:02, Alexander Duyck wrote: > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > > On 18-11-09 15:14:35, Alexander Duyck wrote: > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > > This patchset is essentially a refactor of the page initialization logic > > > > > that is meant to provide for better code reuse while providing a > > > > > significant improvement in deferred page initialization performance. > > > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > > > memory per node I have seen the following. In the case of regular memory > > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > > > average. For the persistent memory the initialization time dropped from > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > > deferred memory initialization performance, and a 26% improvement in the > > > > > persistent memory initialization performance. > > > > > > > > Hi Alex, > > > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > > patches: > > > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > > to try and speed things up. I would be worried about it potentially > > > > There is a limit to max number of threads that ktasks start. The memory > > throughput is *much* higher than what one CPU can maxout in a node, so > > there is no reason to leave the other CPUs sit idle during boot when > > they can help to initialize. > > Right, but right now that limit can still be pretty big when it is > something like 25% of all the CPUs on a 288 CPU system. It is still OK. About 9 threads per node. That machine has 1T of memory, which means 8 nodes need to initialize 2G of memory each. With 46G/s throughout it should take 0.043s Which is 10 times higher than what Daniel sees with 0.325s, so there is still room to saturate the memory throughput. Now, if the multi-threadding efficiency is good, it should take 1.261s / 9 threads = 0.14s > > One issue is the way the code was ends up essentially blowing out the > cache over and over again. Doing things in two passes made it really > expensive as you took one cache miss to initialize it, and another to > free it. I think getting rid of that is one of the biggest gains with > my patch set. I am not arguing that your patches make sense, all I am saying that ktasks improve time order of magnitude better on machines with large amount of memory. Pasha
On Fri, Nov 09, 2018 at 07:00:06PM -0500, Pavel Tatashin wrote: > On 18-11-09 15:14:35, Alexander Duyck wrote: > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > This patchset is essentially a refactor of the page initialization logic > > > > that is meant to provide for better code reuse while providing a > > > > significant improvement in deferred page initialization performance. > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > > memory per node I have seen the following. In the case of regular memory > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > > average. For the persistent memory the initialization time dropped from > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > deferred memory initialization performance, and a 26% improvement in the > > > > persistent memory initialization performance. > > > > > > Hi Alex, > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > patches: > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > to try and speed things up. I would be worried about it potentially > > There is a limit to max number of threads that ktasks start. The memory > throughput is *much* higher than what one CPU can maxout in a node, so > there is no reason to leave the other CPUs sit idle during boot when > they can help to initialize. > > > introducing performance issues if the number of CPUs thrown at it end > > up exceeding the maximum throughput of the memory. > > > > The data provided with patch 11 seems to point to issues such as that. > > In the case of the E7-8895 example cited it is increasing the numbers > > of CPUs used from memory initialization from 8 to 72, a 9x increase in > > the number of CPUs but it is yeilding only a 3.88x speedup. > > Yes, but in both cases we are far from maxing out the memory throughput. > The 3.88x is indeed low, and I do not know what slows it down. > > Daniel, > > Could you please check why multi-threading efficiency is so low here? I'll hop on the machine after Plumbers. > I bet, there is some atomic operation introduces a contention within a > node. It should be possible to resolve. We'll see, in any case I'm curious to see what the multithreading does with Alex's patches, especially since we won't do two passes through the memory anymore. Not seeing anything in Alex's work right off that would preclude multithreading.
On Fri, Nov 9, 2018 at 5:17 PM Pavel Tatashin <pasha.tatashin@soleen.com> wrote: > > On 18-11-09 16:46:02, Alexander Duyck wrote: > > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > > > On 18-11-09 15:14:35, Alexander Duyck wrote: > > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > > > This patchset is essentially a refactor of the page initialization logic > > > > > > that is meant to provide for better code reuse while providing a > > > > > > significant improvement in deferred page initialization performance. > > > > > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > > > > memory per node I have seen the following. In the case of regular memory > > > > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > > > > average. For the persistent memory the initialization time dropped from > > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > > > deferred memory initialization performance, and a 26% improvement in the > > > > > > persistent memory initialization performance. > > > > > > > > > > Hi Alex, > > > > > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > > > patches: > > > > > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jordan@oracle.com/ > > > > > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > > > to try and speed things up. I would be worried about it potentially > > > > > > There is a limit to max number of threads that ktasks start. The memory > > > throughput is *much* higher than what one CPU can maxout in a node, so > > > there is no reason to leave the other CPUs sit idle during boot when > > > they can help to initialize. > > > > Right, but right now that limit can still be pretty big when it is > > something like 25% of all the CPUs on a 288 CPU system. > > It is still OK. About 9 threads per node. > > That machine has 1T of memory, which means 8 nodes need to initialize 2G > of memory each. With 46G/s throughout it should take 0.043s Which is 10 > times higher than what Daniel sees with 0.325s, so there is still room > to saturate the memory throughput. > > Now, if the multi-threadding efficiency is good, it should take > 1.261s / 9 threads = 0.14s The two patch sets combined would hopefully do better then that, although I don't know what the clock speed is on the CPUs in use. The system I have been testing with has 1.5TB spread over 4 nodes. So we have effectively 3 times as much to initialize and with the "numa=fake=8U" I was seeing it only take .275s to initialize the nodes. > > > > One issue is the way the code was ends up essentially blowing out the > > cache over and over again. Doing things in two passes made it really > > expensive as you took one cache miss to initialize it, and another to > > free it. I think getting rid of that is one of the biggest gains with > > my patch set. > > I am not arguing that your patches make sense, all I am saying that > ktasks improve time order of magnitude better on machines with large > amount of memory. The point I was trying to make is that it doesn't. You say it is an order of magnitude better but it is essentially 3.5x vs 3.8x and to achieve the 3.8x you are using a ton of system resources. My approach is meant to do more with less, while this approach will throw a quarter of the system at page initialization. An added advantage to my approach is that it speeds up things regardless of the number of cores used, whereas the scaling approach requires that there be more cores available to use. So for example on some of the new AMD Zen stuff I am not sure the benefit would be all that great since if I am not mistaken each tile is only 8 processors so at most you are only doubling the processing power applied to the initialization. In such a case it is likely that my approach would fare much better then this approach since I don't require additional cores to achieve the same results. Anyway there are tradeoffs we have to take into account. I will go over the changes you suggested after Plumbers. I just need to figure out if I am doing incremental changes, or if Andrew wants me to just resubmit the whole set. I can probably deal with these changes either way since most of them are pretty small. Thanks. - Alex
On 18-11-12 11:10:35, Alexander Duyck wrote: > > The point I was trying to make is that it doesn't. You say it is an > order of magnitude better but it is essentially 3.5x vs 3.8x and to > achieve the 3.8x you are using a ton of system resources. My approach > is meant to do more with less, while this approach will throw a > quarter of the system at page initialization. 3.8x is a bug, that is going to be fixed before ktasks are accepted. The final results will be close to time/nthreads. Using more resources to initialize pages is fine, because other CPUs are idling during this time in boot. Lets wait for what Daniel finds out after Linux Plumber. And we can continue this discussion in ktask thread. > > An added advantage to my approach is that it speeds up things > regardless of the number of cores used, whereas the scaling approach Yes, I agree, I like your approach. It is clean, simplifies, and improves the performance. I have tested it on both ARM and x86, and verified the performance improvements. So: Tested-by: Pavel Tatashin <pasha.tatashin@soleen.com> > requires that there be more cores available to use. So for example on > some of the new AMD Zen stuff I am not sure the benefit would be all > that great since if I am not mistaken each tile is only 8 processors > so at most you are only doubling the processing power applied to the > initialization. In such a case it is likely that my approach would > fare much better then this approach since I don't require additional > cores to achieve the same results. > > Anyway there are tradeoffs we have to take into account. > > I will go over the changes you suggested after Plumbers. I just need > to figure out if I am doing incremental changes, or if Andrew wants me > to just resubmit the whole set. I can probably deal with these changes > either way since most of them are pretty small. Send the full series again, Andrew is very good at taking only incremental changes once a new version is posted of something that is already in mm-tree. Thank you, Pasha
On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > This patchset is essentially a refactor of the page initialization logic > that is meant to provide for better code reuse while providing a > significant improvement in deferred page initialization performance. > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > memory per node I have seen the following. In the case of regular memory > initialization the deferred init time was decreased from 3.75s to 1.06s on > average. For the persistent memory the initialization time dropped from > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > deferred memory initialization performance, and a 26% improvement in the > persistent memory initialization performance. > > I have called out the improvement observed with each patch. I have only glanced through the code (there is a lot of the code to look at here). And I do not like the code duplication and the way how you make the hotplug special. There shouldn't be any real reason for that IMHO (e.g. why do we init pfn-at-a-time in early init while we do pageblock-at-a-time for hotplug). I might be wrong here and the code reuse might be really hard to achieve though. I am also not impressed by new iterators because this api is quite complex already. But this is mostly a detail. Thing I do not like is that you keep microptimizing PageReserved part while there shouldn't be anything fundamental about it. We should just remove it rather than make the code more complex. I fell more and more guilty to add there actually.
On 18-11-14 16:07:42, Michal Hocko wrote: > On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > > This patchset is essentially a refactor of the page initialization logic > > that is meant to provide for better code reuse while providing a > > significant improvement in deferred page initialization performance. > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > memory per node I have seen the following. In the case of regular memory > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > average. For the persistent memory the initialization time dropped from > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > deferred memory initialization performance, and a 26% improvement in the > > persistent memory initialization performance. > > > > I have called out the improvement observed with each patch. > > I have only glanced through the code (there is a lot of the code to look > at here). And I do not like the code duplication and the way how you > make the hotplug special. There shouldn't be any real reason for that > IMHO (e.g. why do we init pfn-at-a-time in early init while we do > pageblock-at-a-time for hotplug). I might be wrong here and the code > reuse might be really hard to achieve though. I do not like having __init_single_page() to be done differently for hotplug. I think, if that is fixed, there is almost no more code duplication, the rest looks alright to me. > > I am also not impressed by new iterators because this api is quite > complex already. But this is mostly a detail. I have reviewed all the patches in this series, and at first was also worried about the new iterators. But, after diving deeper, they actually make sense, and new memblock iterators look alright to me. The only iterator, that I would like to see improved is for_each_deferred_pfn_valid_range(), because it is very hard to understand it. This series is an impressive performance improvement, and I have confirmed it on both arm, and x86. It is also should be compatible with ktasks. > > Thing I do not like is that you keep microptimizing PageReserved part > while there shouldn't be anything fundamental about it. We should just Agree. > remove it rather than make the code more complex. I fell more and more > guilty to add there actually.
On Wed 14-11-18 19:12:53, Pavel Tatashin wrote: > On 18-11-14 16:07:42, Michal Hocko wrote: > > On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > > > This patchset is essentially a refactor of the page initialization logic > > > that is meant to provide for better code reuse while providing a > > > significant improvement in deferred page initialization performance. > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > memory per node I have seen the following. In the case of regular memory > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > average. For the persistent memory the initialization time dropped from > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > deferred memory initialization performance, and a 26% improvement in the > > > persistent memory initialization performance. > > > > > > I have called out the improvement observed with each patch. > > > > I have only glanced through the code (there is a lot of the code to look > > at here). And I do not like the code duplication and the way how you > > make the hotplug special. There shouldn't be any real reason for that > > IMHO (e.g. why do we init pfn-at-a-time in early init while we do > > pageblock-at-a-time for hotplug). I might be wrong here and the code > > reuse might be really hard to achieve though. > > I do not like having __init_single_page() to be done differently for > hotplug. I think, if that is fixed, there is almost no more code > duplication, the rest looks alright to me. There is still that zone device special casing which I have brought up previously but I reckon this is out of scope of this series. > > I am also not impressed by new iterators because this api is quite > > complex already. But this is mostly a detail. > > I have reviewed all the patches in this series, and at first was also > worried about the new iterators. But, after diving deeper, they actually > make sense, and new memblock iterators look alright to me. The only > iterator, that I would like to see improved is > for_each_deferred_pfn_valid_range(), because it is very hard to > understand it. > > This series is an impressive performance improvement, and I have > confirmed it on both arm, and x86. It is also should be compatible with > ktasks. I see the performance argument. I also do see the maintenance burden argument. Recent bootmem removal has shown how big and hard to follow the whole API is. And this should be considered. I am not saying the current series is a nogo though. Maybe changelogs should be more clear to spell out advantages to do a per-pageblock initialization that brings a lot of new code in. As I've said I have basically glanced through the comulative diff and changelogs to get an impression so it is not entirely clear to me. Especially when the per block init does per pfn within the block init anyway. That being said, I belive changelogs should be much more specific and I hope to see this to be a much more modest in the added code. If that is not possible, then fair enough, but I would like to see it tried. And please let's not build on top of cargocult and rather get rid of pointless stuff (e.g. PageReserved) rather than go around it.
On 11/14/2018 7:07 AM, Michal Hocko wrote: > On Mon 05-11-18 13:19:25, Alexander Duyck wrote: >> This patchset is essentially a refactor of the page initialization logic >> that is meant to provide for better code reuse while providing a >> significant improvement in deferred page initialization performance. >> >> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent >> memory per node I have seen the following. In the case of regular memory >> initialization the deferred init time was decreased from 3.75s to 1.06s on >> average. For the persistent memory the initialization time dropped from >> 24.17s to 19.12s on average. This amounts to a 253% improvement for the >> deferred memory initialization performance, and a 26% improvement in the >> persistent memory initialization performance. >> >> I have called out the improvement observed with each patch. > > I have only glanced through the code (there is a lot of the code to look > at here). And I do not like the code duplication and the way how you > make the hotplug special. There shouldn't be any real reason for that > IMHO (e.g. why do we init pfn-at-a-time in early init while we do > pageblock-at-a-time for hotplug). I might be wrong here and the code > reuse might be really hard to achieve though. Actually it isn't so much that hotplug is special. The issue is more that the non-hotplug case is special in that you have to perform a number of extra checks for things that just aren't necessary for the hotplug case. If anything I would probably need a new iterator that would be able to take into account all the checks for the non-hotplug case and then provide ranges of PFNs to initialize. > I am also not impressed by new iterators because this api is quite > complex already. But this is mostly a detail. Yeah, the iterators were mostly an attempt at hiding some of the complexity. Being able to break a loop down to just an iterator provding the start of the range and the number of elements to initialize is pretty easy to visualize, or at least I thought so. > Thing I do not like is that you keep microptimizing PageReserved part > while there shouldn't be anything fundamental about it. We should just > remove it rather than make the code more complex. I fell more and more > guilty to add there actually. I plan to remove it, but don't think I can get to it in this patch set. I was planning to submit one more iteration of this patch set early next week, and then start focusing more on the removal of the PageReserved bit for hotplug. I figure it is probably going to be a full patch set onto itself and as you pointed out at the start of this email there is already enough code to review without adding that.
On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote: > > > On 11/14/2018 7:07 AM, Michal Hocko wrote: > >On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > >>This patchset is essentially a refactor of the page initialization logic > >>that is meant to provide for better code reuse while providing a > >>significant improvement in deferred page initialization performance. > >> > >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > >>memory per node I have seen the following. In the case of regular memory > >>initialization the deferred init time was decreased from 3.75s to 1.06s on > >>average. For the persistent memory the initialization time dropped from > >>24.17s to 19.12s on average. This amounts to a 253% improvement for the > >>deferred memory initialization performance, and a 26% improvement in the > >>persistent memory initialization performance. > >> > >>I have called out the improvement observed with each patch. > > > >I have only glanced through the code (there is a lot of the code to look > >at here). And I do not like the code duplication and the way how you > >make the hotplug special. There shouldn't be any real reason for that > >IMHO (e.g. why do we init pfn-at-a-time in early init while we do > >pageblock-at-a-time for hotplug). I might be wrong here and the code > >reuse might be really hard to achieve though. > > Actually it isn't so much that hotplug is special. The issue is more that > the non-hotplug case is special in that you have to perform a number of > extra checks for things that just aren't necessary for the hotplug case. > > If anything I would probably need a new iterator that would be able to take > into account all the checks for the non-hotplug case and then provide ranges > of PFNs to initialize. > > >I am also not impressed by new iterators because this api is quite > >complex already. But this is mostly a detail. > > Yeah, the iterators were mostly an attempt at hiding some of the complexity. > Being able to break a loop down to just an iterator provding the start of > the range and the number of elements to initialize is pretty easy to > visualize, or at least I thought so. Just recently we had a discussion about overlapping for_each_mem_range() and for_each_mem_pfn_range(), but unfortunately it appears that no mailing list was cc'ed by the original patch author :( In short, there was a spelling fix in one of them and Michal pointed out that their functionality overlaps. I have no objection for for_each_free_mem_pfn_range_in_zone() and __next_mem_pfn_range_in_zone(), but probably we should consider unifying the older iterators before we introduce a new one? > >Thing I do not like is that you keep microptimizing PageReserved part > >while there shouldn't be anything fundamental about it. We should just > >remove it rather than make the code more complex. I fell more and more > >guilty to add there actually. > > I plan to remove it, but don't think I can get to it in this patch set. > > I was planning to submit one more iteration of this patch set early next > week, and then start focusing more on the removal of the PageReserved bit > for hotplug. I figure it is probably going to be a full patch set onto > itself and as you pointed out at the start of this email there is already > enough code to review without adding that. > >
On Wed 14-11-18 16:50:23, Alexander Duyck wrote: > > > On 11/14/2018 7:07 AM, Michal Hocko wrote: > > On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > > > This patchset is essentially a refactor of the page initialization logic > > > that is meant to provide for better code reuse while providing a > > > significant improvement in deferred page initialization performance. > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > memory per node I have seen the following. In the case of regular memory > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > average. For the persistent memory the initialization time dropped from > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > deferred memory initialization performance, and a 26% improvement in the > > > persistent memory initialization performance. > > > > > > I have called out the improvement observed with each patch. > > > > I have only glanced through the code (there is a lot of the code to look > > at here). And I do not like the code duplication and the way how you > > make the hotplug special. There shouldn't be any real reason for that > > IMHO (e.g. why do we init pfn-at-a-time in early init while we do > > pageblock-at-a-time for hotplug). I might be wrong here and the code > > reuse might be really hard to achieve though. > > Actually it isn't so much that hotplug is special. The issue is more that > the non-hotplug case is special in that you have to perform a number of > extra checks for things that just aren't necessary for the hotplug case. Can we hide those behind a helper (potentially with a jump label if necessary) and still share a large part of the code? Also this code is quite old and maybe we are overzealous with the early checks. Do we really need them. Why should be early boot memory any different from the hotplug. The only exception I can see should really be deferred initialization check. > If anything I would probably need a new iterator that would be able to take > into account all the checks for the non-hotplug case and then provide ranges > of PFNs to initialize. > > > I am also not impressed by new iterators because this api is quite > > complex already. But this is mostly a detail. > > Yeah, the iterators were mostly an attempt at hiding some of the complexity. > Being able to break a loop down to just an iterator provding the start of > the range and the number of elements to initialize is pretty easy to > visualize, or at least I thought so. I am not against hiding the complexity. I am mostly concerned that we have too many of those iterators. Maybe we can reuse existing ones in some way. If that is not really possible or it would make even more mess then fair enough and go with new ones. > > Thing I do not like is that you keep microptimizing PageReserved part > > while there shouldn't be anything fundamental about it. We should just > > remove it rather than make the code more complex. I fell more and more > > guilty to add there actually. > > I plan to remove it, but don't think I can get to it in this patch set. What I am trying to argue is that we should simply drop the __SetPageReserved as an independent patch prior to this whole series. As I've mentioned earlier, I have added this just to be sure and part of that was that __add_section has set the reserved bit. This is no longer the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug"). Nobody should really depend on that because struct pages are in undefined state after __add_pages and they should get fully initialized after move_pfn_range_to_zone. If you really insist on setting the reserved bit then it really has to happen much sooner than it is right now. So I do not really see any point in doing so. Sure there are some pfn walkers that really need to do pfn_to_online_page because pfn_valid is not sufficient but that is largely independent on any optimization work in this area. I am sorry if I haven't been clear about that before. Does it make more sense to you now? P.S. There is always that tempting thing to follow the existing code and tweak it for a new purpose. This approach, however, adds more and more complex code on top of something that might be wrong or stale already. I have seen that in MM code countless times and I have contributed to that myself. I am sorry to push back on this so hard but this code is a mess and any changes to make it more optimal should really make sure the foundations are solid before. Been there done that, not a huge fun but that is the price for having basically unmaintained piece of code that random usecases stop by and do what they need without ever following up later.
On 11/15/2018 12:10 AM, Michal Hocko wrote: > On Wed 14-11-18 16:50:23, Alexander Duyck wrote: >> >> >> On 11/14/2018 7:07 AM, Michal Hocko wrote: >>> On Mon 05-11-18 13:19:25, Alexander Duyck wrote: >>>> This patchset is essentially a refactor of the page initialization logic >>>> that is meant to provide for better code reuse while providing a >>>> significant improvement in deferred page initialization performance. >>>> >>>> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent >>>> memory per node I have seen the following. In the case of regular memory >>>> initialization the deferred init time was decreased from 3.75s to 1.06s on >>>> average. For the persistent memory the initialization time dropped from >>>> 24.17s to 19.12s on average. This amounts to a 253% improvement for the >>>> deferred memory initialization performance, and a 26% improvement in the >>>> persistent memory initialization performance. >>>> >>>> I have called out the improvement observed with each patch. >>> >>> I have only glanced through the code (there is a lot of the code to look >>> at here). And I do not like the code duplication and the way how you >>> make the hotplug special. There shouldn't be any real reason for that >>> IMHO (e.g. why do we init pfn-at-a-time in early init while we do >>> pageblock-at-a-time for hotplug). I might be wrong here and the code >>> reuse might be really hard to achieve though. >> >> Actually it isn't so much that hotplug is special. The issue is more that >> the non-hotplug case is special in that you have to perform a number of >> extra checks for things that just aren't necessary for the hotplug case. > > Can we hide those behind a helper (potentially with a jump label if > necessary) and still share a large part of the code? Also this code is > quite old and maybe we are overzealous with the early checks. Do we > really need them. Why should be early boot memory any different from the > hotplug. The only exception I can see should really be deferred > initialization check. > >> If anything I would probably need a new iterator that would be able to take >> into account all the checks for the non-hotplug case and then provide ranges >> of PFNs to initialize. >> >>> I am also not impressed by new iterators because this api is quite >>> complex already. But this is mostly a detail. >> >> Yeah, the iterators were mostly an attempt at hiding some of the complexity. >> Being able to break a loop down to just an iterator provding the start of >> the range and the number of elements to initialize is pretty easy to >> visualize, or at least I thought so. > > I am not against hiding the complexity. I am mostly concerned that we > have too many of those iterators. Maybe we can reuse existing ones in > some way. If that is not really possible or it would make even more mess > then fair enough and go with new ones. > >>> Thing I do not like is that you keep microptimizing PageReserved part >>> while there shouldn't be anything fundamental about it. We should just >>> remove it rather than make the code more complex. I fell more and more >>> guilty to add there actually. >> >> I plan to remove it, but don't think I can get to it in this patch set. > > What I am trying to argue is that we should simply drop the > __SetPageReserved as an independent patch prior to this whole series. > As I've mentioned earlier, I have added this just to be sure and part of > that was that __add_section has set the reserved bit. This is no longer > the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory > hotplug"). > > Nobody should really depend on that because struct pages are in > undefined state after __add_pages and they should get fully initialized > after move_pfn_range_to_zone. > > If you really insist on setting the reserved bit then it really has to > happen much sooner than it is right now. So I do not really see any > point in doing so. Sure there are some pfn walkers that really need to > do pfn_to_online_page because pfn_valid is not sufficient but that is > largely independent on any optimization work in this area. > > I am sorry if I haven't been clear about that before. Does it make more > sense to you now? I get what you are saying I just don't agree with the ordering. I have had these patches in the works for a couple months now. You are essentially telling me to defer these in favor of taking care of the reserved bit first. The only spot where I think we disagree is that I would prefer to get these in first, and then focus on the reserved bit. I'm not saying we shouldn't do the work, but I would rather take care of the immediate issue, and then "clean house" as it were. I've done that sort of thing in the past where I start deferring patches and then by the end of things I have a 60 patch set I am trying to push because one fix gets ahead of another and another. My big concern is that dropping the reserved bit is going to be much more error prone than the work I have done in this patch set since it is much more likely that somebody somewhere has made a false reliance on it being set. If you have any tests you usually run for memory hotplug it would be useful if you could point me in that direction. Then I can move forward with the upcoming patch set with a bit more confidence. > P.S. > There is always that tempting thing to follow the existing code and > tweak it for a new purpose. This approach, however, adds more and more > complex code on top of something that might be wrong or stale already. > I have seen that in MM code countless times and I have contributed to > that myself. I am sorry to push back on this so hard but this code is > a mess and any changes to make it more optimal should really make sure > the foundations are solid before. Been there done that, not a huge fun > but that is the price for having basically unmaintained piece of code > that random usecases stop by and do what they need without ever > following up later. That is what I am doing. That is why I found and dropped the check for the NUMA not in the deferred init. I am pushing back on code where it makes sense to do so and determine what actually is adding value. My concern is more that I am worried that trying to make things "perfect" might be getting in the way of "good". Kernel development has always been an incremental process. My preference would be to lock down what we have before I go diving in and cleaning up other bits of the memory init.
On Thu 15-11-18 08:02:33, Alexander Duyck wrote: > On 11/15/2018 12:10 AM, Michal Hocko wrote: > > On Wed 14-11-18 16:50:23, Alexander Duyck wrote: [...] > > > I plan to remove it, but don't think I can get to it in this patch set. > > > > What I am trying to argue is that we should simply drop the > > __SetPageReserved as an independent patch prior to this whole series. > > As I've mentioned earlier, I have added this just to be sure and part of > > that was that __add_section has set the reserved bit. This is no longer > > the case since d0dc12e86b31 ("mm/memory_hotplug: optimize memory > > hotplug"). > > > > Nobody should really depend on that because struct pages are in > > undefined state after __add_pages and they should get fully initialized > > after move_pfn_range_to_zone. > > > > If you really insist on setting the reserved bit then it really has to > > happen much sooner than it is right now. So I do not really see any > > point in doing so. Sure there are some pfn walkers that really need to > > do pfn_to_online_page because pfn_valid is not sufficient but that is > > largely independent on any optimization work in this area. > > > > I am sorry if I haven't been clear about that before. Does it make more > > sense to you now? > > I get what you are saying I just don't agree with the ordering. I have had > these patches in the works for a couple months now. You are essentially > telling me to defer these in favor of taking care of the reserved bit first. General development is to fix correctness and/or cleanup stale code first and optimize on top. I really do not see why this should be any different. Especially when page reserved bit is a part of your optimization work. There is some review feedback to address from this version so you can add a patch to remove the reserved bit to the series - feel free to reuse my explanation as the justification. I really do not think you have to change other call sites because they would be broken in the same way as when the bit is set (late). > The only spot where I think we disagree is that I would prefer to get these > in first, and then focus on the reserved bit. I'm not saying we shouldn't do > the work, but I would rather take care of the immediate issue, and then > "clean house" as it were. I've done that sort of thing in the past where I > start deferring patches and then by the end of things I have a 60 patch set > I am trying to push because one fix gets ahead of another and another. This is nothing exceptional and it happened to me as well. > My big concern is that dropping the reserved bit is going to be much more > error prone than the work I have done in this patch set since it is much > more likely that somebody somewhere has made a false reliance on it being > set. If you have any tests you usually run for memory hotplug it would be > useful if you could point me in that direction. Then I can move forward with > the upcoming patch set with a bit more confidence. There is nothing except for exercising hotplug and do random stuff to it. We simply do not know about all the pfn walkers so we have to count on a reasonable model to justify changes. I hope I have clarified why the reserved bit doesn't act the purpose it has been added for. I understand that you want to push your optimization work ASAP but I _do_ care about longterm maintainability much more than having performance gain _right_ now. That being said, I am not nacking your series so if others really think that I am asking too much I can live with that. I was overruled the last time when the first optimization pile went in. I still hope we can get the result cleaned up as promised, though.
On Wed, Nov 14, 2018 at 05:55:12PM -0800, Mike Rapoport wrote: > On Wed, Nov 14, 2018 at 04:50:23PM -0800, Alexander Duyck wrote: > > > > > > On 11/14/2018 7:07 AM, Michal Hocko wrote: > > >On Mon 05-11-18 13:19:25, Alexander Duyck wrote: > > >>This patchset is essentially a refactor of the page initialization logic > > >>that is meant to provide for better code reuse while providing a > > >>significant improvement in deferred page initialization performance. > > >> > > >>In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > >>memory per node I have seen the following. In the case of regular memory > > >>initialization the deferred init time was decreased from 3.75s to 1.06s on > > >>average. For the persistent memory the initialization time dropped from > > >>24.17s to 19.12s on average. This amounts to a 253% improvement for the > > >>deferred memory initialization performance, and a 26% improvement in the > > >>persistent memory initialization performance. > > >> > > >>I have called out the improvement observed with each patch. > > > > > >I have only glanced through the code (there is a lot of the code to look > > >at here). And I do not like the code duplication and the way how you > > >make the hotplug special. There shouldn't be any real reason for that > > >IMHO (e.g. why do we init pfn-at-a-time in early init while we do > > >pageblock-at-a-time for hotplug). I might be wrong here and the code > > >reuse might be really hard to achieve though. > > > > Actually it isn't so much that hotplug is special. The issue is more that > > the non-hotplug case is special in that you have to perform a number of > > extra checks for things that just aren't necessary for the hotplug case. > > > > If anything I would probably need a new iterator that would be able to take > > into account all the checks for the non-hotplug case and then provide ranges > > of PFNs to initialize. > > > > >I am also not impressed by new iterators because this api is quite > > >complex already. But this is mostly a detail. > > > > Yeah, the iterators were mostly an attempt at hiding some of the complexity. > > Being able to break a loop down to just an iterator provding the start of > > the range and the number of elements to initialize is pretty easy to > > visualize, or at least I thought so. > > Just recently we had a discussion about overlapping for_each_mem_range() > and for_each_mem_pfn_range(), but unfortunately it appears that no mailing > list was cc'ed by the original patch author :( > In short, there was a spelling fix in one of them and Michal pointed out > that their functionality overlaps. > > I have no objection for for_each_free_mem_pfn_range_in_zone() and > __next_mem_pfn_range_in_zone(), but probably we should consider unifying > the older iterators before we introduce a new one? Another thing I realized only now is that for_each_free_mem_pfn_range_in_zone() can be used only relatively late in the memblock life-span because zones are initialized far later than setup_arch() in many cases. At the very least this should be documented. > > >Thing I do not like is that you keep microptimizing PageReserved part > > >while there shouldn't be anything fundamental about it. We should just > > >remove it rather than make the code more complex. I fell more and more > > >guilty to add there actually. > > > > I plan to remove it, but don't think I can get to it in this patch set. > > > > I was planning to submit one more iteration of this patch set early next > > week, and then start focusing more on the removal of the PageReserved bit > > for hotplug. I figure it is probably going to be a full patch set onto > > itself and as you pointed out at the start of this email there is already > > enough code to review without adding that. > > > > > > -- > Sincerely yours, > Mike.