Message ID | 20230703135330.1865927-1-ryan.roberts@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | variable-order, large folios for anonymous memory | expand |
On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. Thanks for the quick response! > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. I suppose it's regression against the v1, not the unpatched kernel. > With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). I suppose the benchmark wasn't running under memory pressure, which is uncommon for client devices. It could be easier the other way around: using 32/16KB shows regression whereas order-0 shows better performance under memory pressure. I'm not sure we should use v1 as the baseline. Unpatched kernel sounds more reasonable at this point. If 32/16KB is proven to be better in most scenarios including under memory pressure, we can reintroduce that policy. I highly doubt this is the case: we tried 16KB base page size on client devices, and overall, the regressions outweighs the benefits. > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. It's not clear to me why [2] is a hard dependency. It seems to me we are getting close and I was hoping we could get into mm-unstable soon without depending on other series...
On 7/4/2023 10:18 AM, Yu Zhao wrote: > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >> >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. > > Thanks for the quick response! > >> I've significantly reworked and simplified the patch set based on comments from >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >> VARIABLE_THP, on Yu's advice. >> >> The last patch is for arm64 to explicitly override the default >> arch_wants_pte_order() and is intended as an example. If this series is accepted >> I suggest taking the first 4 patches through the mm tree and the arm64 change >> could be handled through the arm64 tree separately. Neither has any build >> dependency on the other. >> >> The one area where I haven't followed Yu's advice is in the determination of the >> size of folio to use. It was suggested that I have a single preferred large >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >> being existing overlapping populated PTEs, etc) then fallback immediately to >> order-0. It turned out that this approach caused a performance regression in the >> Speedometer benchmark. > > I suppose it's regression against the v1, not the unpatched kernel. From the performance data Ryan shared, it's against unpatched kernel: Speedometer 2.0: | kernel | runs_per_min | |:-------------------------------|---------------:| | baseline-4k | 0.0% | | anonfolio-lkml-v1 | 0.7% | | anonfolio-lkml-v2-simple-order | -0.9% | | anonfolio-lkml-v2 | 0.5% | What if we use 32K or 16K instead of 64K as default anonymous folio size? I suspect this app may have 32K or 16K anon folio as sweet spot. Regards Yin, Fengwei > >> With my v1 patch, there were significant quantities of >> memory which could not be placed in the 64K bucket and were instead being >> allocated for the 32K and 16K buckets. With the proposed simplification, that >> memory ended up using the 4K bucket, so page faults increased by 2.75x compared >> to the v1 patch (although due to the 64K bucket, this number is still a bit >> lower than the baseline). So instead, I continue to calculate a folio order that >> is somewhere between the preferred order and 0. (See below for more details). > > I suppose the benchmark wasn't running under memory pressure, which is > uncommon for client devices. It could be easier the other way around: > using 32/16KB shows regression whereas order-0 shows better > performance under memory pressure. > > I'm not sure we should use v1 as the baseline. Unpatched kernel sounds > more reasonable at this point. If 32/16KB is proven to be better in > most scenarios including under memory pressure, we can reintroduce > that policy. I highly doubt this is the case: we tried 16KB base page > size on client devices, and overall, the regressions outweighs the > benefits. > >> The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series >> [2], which is a hard dependency. I have a branch at [3]. > > It's not clear to me why [2] is a hard dependency. > > It seems to me we are getting close and I was hoping we could get into > mm-unstable soon without depending on other series... >
On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > > On 7/4/2023 10:18 AM, Yu Zhao wrote: > > On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >> > >> Hi All, > >> > >> This is v2 of a series to implement variable order, large folios for anonymous > >> memory. The objective of this is to improve performance by allocating larger > >> chunks of memory during anonymous page faults. See [1] for background. > > > > Thanks for the quick response! > > > >> I've significantly reworked and simplified the patch set based on comments from > >> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >> VARIABLE_THP, on Yu's advice. > >> > >> The last patch is for arm64 to explicitly override the default > >> arch_wants_pte_order() and is intended as an example. If this series is accepted > >> I suggest taking the first 4 patches through the mm tree and the arm64 change > >> could be handled through the arm64 tree separately. Neither has any build > >> dependency on the other. > >> > >> The one area where I haven't followed Yu's advice is in the determination of the > >> size of folio to use. It was suggested that I have a single preferred large > >> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >> being existing overlapping populated PTEs, etc) then fallback immediately to > >> order-0. It turned out that this approach caused a performance regression in the > >> Speedometer benchmark. > > > > I suppose it's regression against the v1, not the unpatched kernel. > From the performance data Ryan shared, it's against unpatched kernel: > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | I see. Thanks. A couple of questions: 1. Do we have a stddev? 2. Do we have a theory why it regressed? Assuming no bugs, I don't see how a real regression could happen -- falling back to order-0 isn't different from the original behavior. Ryan, could you `perf record` and `cat /proc/vmstat` and share them?
On 04/07/2023 08:11, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >> >> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>> >>>> Hi All, >>>> >>>> This is v2 of a series to implement variable order, large folios for anonymous >>>> memory. The objective of this is to improve performance by allocating larger >>>> chunks of memory during anonymous page faults. See [1] for background. >>> >>> Thanks for the quick response! >>> >>>> I've significantly reworked and simplified the patch set based on comments from >>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>> VARIABLE_THP, on Yu's advice. >>>> >>>> The last patch is for arm64 to explicitly override the default >>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>> could be handled through the arm64 tree separately. Neither has any build >>>> dependency on the other. >>>> >>>> The one area where I haven't followed Yu's advice is in the determination of the >>>> size of folio to use. It was suggested that I have a single preferred large >>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>> order-0. It turned out that this approach caused a performance regression in the >>>> Speedometer benchmark. >>> >>> I suppose it's regression against the v1, not the unpatched kernel. >> From the performance data Ryan shared, it's against unpatched kernel: >> >> Speedometer 2.0: >> >> | kernel | runs_per_min | >> |:-------------------------------|---------------:| >> | baseline-4k | 0.0% | >> | anonfolio-lkml-v1 | 0.7% | >> | anonfolio-lkml-v2-simple-order | -0.9% | >> | anonfolio-lkml-v2 | 0.5% | > > I see. Thanks. > > A couple of questions: > 1. Do we have a stddev? | kernel | mean_abs | std_abs | mean_rel | std_rel | |:------------------------- |-----------:|----------:|-----------:|----------:| | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | This is with 3 runs per reboot across 5 reboots, with first run after reboot trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data points per kernel in total. I've rerun the test multiple times and see similar results each time. I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I see the same performance as baseline-4k. > 2. Do we have a theory why it regressed? I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that mean when we fault, order-4 is often too big to fit in the VMA. So we fallback to order-0. I guess this is happening so often for this workload that the cost of doing the checks and fallback is outweighing the benefit of the memory that does end up with order-4 folios. I've sampled the memory in each bucket (once per second) while running and its roughly: 64K: 25% 32K: 15% 16K: 15% 4K: 45% 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and the 64K contents is more static - that's just a guess though. > Assuming no bugs, I don't see how a real regression could happen -- > falling back to order-0 isn't different from the original behavior. > Ryan, could you `perf record` and `cat /proc/vmstat` and share them? I can, but it will have to be a bit later in the week. I'll do some more test runs overnight so we have a larger number of runs - hopefully that might tell us that this is noise to a certain extent. I'd still like to hear a clear technical argument for why the bin-packing approach is not the correct one! Thanks, Ryan
On 7/4/23 23:36, Ryan Roberts wrote: > On 04/07/2023 08:11, Yu Zhao wrote: >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>> >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> This is v2 of a series to implement variable order, large folios for anonymous >>>>> memory. The objective of this is to improve performance by allocating larger >>>>> chunks of memory during anonymous page faults. See [1] for background. >>>> >>>> Thanks for the quick response! >>>> >>>>> I've significantly reworked and simplified the patch set based on comments from >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>>> VARIABLE_THP, on Yu's advice. >>>>> >>>>> The last patch is for arm64 to explicitly override the default >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>>> could be handled through the arm64 tree separately. Neither has any build >>>>> dependency on the other. >>>>> >>>>> The one area where I haven't followed Yu's advice is in the determination of the >>>>> size of folio to use. It was suggested that I have a single preferred large >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>>> order-0. It turned out that this approach caused a performance regression in the >>>>> Speedometer benchmark. >>>> >>>> I suppose it's regression against the v1, not the unpatched kernel. >>> From the performance data Ryan shared, it's against unpatched kernel: >>> >>> Speedometer 2.0: >>> >>> | kernel | runs_per_min | >>> |:-------------------------------|---------------:| >>> | baseline-4k | 0.0% | >>> | anonfolio-lkml-v1 | 0.7% | >>> | anonfolio-lkml-v2-simple-order | -0.9% | >>> | anonfolio-lkml-v2 | 0.5% | >> >> I see. Thanks. >> >> A couple of questions: >> 1. Do we have a stddev? > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > |:------------------------- |-----------:|----------:|-----------:|----------:| > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > points per kernel in total. > > I've rerun the test multiple times and see similar results each time. > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > see the same performance as baseline-4k. > > >> 2. Do we have a theory why it regressed? > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > to order-0. I guess this is happening so often for this workload that the cost > of doing the checks and fallback is outweighing the benefit of the memory that > does end up with order-4 folios. > > I've sampled the memory in each bucket (once per second) while running and its > roughly: > > 64K: 25% > 32K: 15% > 16K: 15% > 4K: 45% > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > the 64K contents is more static - that's just a guess though. So this is like out of vma range thing. > >> Assuming no bugs, I don't see how a real regression could happen -- >> falling back to order-0 isn't different from the original behavior. >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > I can, but it will have to be a bit later in the week. I'll do some more test > runs overnight so we have a larger number of runs - hopefully that might tell us > that this is noise to a certain extent. > > I'd still like to hear a clear technical argument for why the bin-packing > approach is not the correct one! My understanding to Yu's (Yu, correct me if I am wrong) comments is that we postpone this part of change and make basic anon large folio support in. Then discuss which approach we should take. Maybe people will agree retry is the choice, maybe other approach will be taken... For example, for this out of VMA range case, per VMA order should be considered. We don't need make decision that the retry should be taken now. Regards Yin, Fengwei > > Thanks, > Ryan > > >
On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > > > > On 7/4/23 23:36, Ryan Roberts wrote: > > On 04/07/2023 08:11, Yu Zhao wrote: > >> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>> > >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>> memory. The objective of this is to improve performance by allocating larger > >>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>> > >>>> Thanks for the quick response! > >>>> > >>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>> VARIABLE_THP, on Yu's advice. > >>>>> > >>>>> The last patch is for arm64 to explicitly override the default > >>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>> dependency on the other. > >>>>> > >>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>> Speedometer benchmark. > >>>> > >>>> I suppose it's regression against the v1, not the unpatched kernel. > >>> From the performance data Ryan shared, it's against unpatched kernel: > >>> > >>> Speedometer 2.0: > >>> > >>> | kernel | runs_per_min | > >>> |:-------------------------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-lkml-v1 | 0.7% | > >>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>> | anonfolio-lkml-v2 | 0.5% | > >> > >> I see. Thanks. > >> > >> A couple of questions: > >> 1. Do we have a stddev? > > > > | kernel | mean_abs | std_abs | mean_rel | std_rel | > > |:------------------------- |-----------:|----------:|-----------:|----------:| > > | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > > | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > > | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > > > > This is with 3 runs per reboot across 5 reboots, with first run after reboot > > trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > > points per kernel in total. > > > > I've rerun the test multiple times and see similar results each time. > > > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > > see the same performance as baseline-4k. > > > > > >> 2. Do we have a theory why it regressed? > > > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > > mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > > to order-0. I guess this is happening so often for this workload that the cost > > of doing the checks and fallback is outweighing the benefit of the memory that > > does end up with order-4 folios. > > > > I've sampled the memory in each bucket (once per second) while running and its > > roughly: > > > > 64K: 25% > > 32K: 15% > > 16K: 15% > > 4K: 45% > > > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > > But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > > the 64K contents is more static - that's just a guess though. > So this is like out of vma range thing. > > > > >> Assuming no bugs, I don't see how a real regression could happen -- > >> falling back to order-0 isn't different from the original behavior. > >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > > > I can, but it will have to be a bit later in the week. I'll do some more test > > runs overnight so we have a larger number of runs - hopefully that might tell us > > that this is noise to a certain extent. > > > > I'd still like to hear a clear technical argument for why the bin-packing > > approach is not the correct one! > My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > postpone this part of change and make basic anon large folio support in. Then > discuss which approach we should take. Maybe people will agree retry is the > choice, maybe other approach will be taken... > > For example, for this out of VMA range case, per VMA order should be considered. > We don't need make decision that the retry should be taken now. I've articulated the reasons in another email. Just summarize the most important point here: using more fallback orders makes a system reach equilibrium faster, at which point it can't allocate the order of arch_wants_pte_order() anymore. IOW, this best-fit policy can reduce the number of folios of the h/w prefered order for a system running long enough.
On 05/07/2023 01:21, Yu Zhao wrote: > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: >> >> >> >> On 7/4/23 23:36, Ryan Roberts wrote: >>> On 04/07/2023 08:11, Yu Zhao wrote: >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: >>>>> >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: >>>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous >>>>>>> memory. The objective of this is to improve performance by allocating larger >>>>>>> chunks of memory during anonymous page faults. See [1] for background. >>>>>> >>>>>> Thanks for the quick response! >>>>>> >>>>>>> I've significantly reworked and simplified the patch set based on comments from >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to >>>>>>> VARIABLE_THP, on Yu's advice. >>>>>>> >>>>>>> The last patch is for arm64 to explicitly override the default >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change >>>>>>> could be handled through the arm64 tree separately. Neither has any build >>>>>>> dependency on the other. >>>>>>> >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the >>>>>>> size of folio to use. It was suggested that I have a single preferred large >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to >>>>>>> order-0. It turned out that this approach caused a performance regression in the >>>>>>> Speedometer benchmark. >>>>>> >>>>>> I suppose it's regression against the v1, not the unpatched kernel. >>>>> From the performance data Ryan shared, it's against unpatched kernel: >>>>> >>>>> Speedometer 2.0: >>>>> >>>>> | kernel | runs_per_min | >>>>> |:-------------------------------|---------------:| >>>>> | baseline-4k | 0.0% | >>>>> | anonfolio-lkml-v1 | 0.7% | >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | >>>>> | anonfolio-lkml-v2 | 0.5% | >>>> >>>> I see. Thanks. >>>> >>>> A couple of questions: >>>> 1. Do we have a stddev? >>> >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | >>> |:------------------------- |-----------:|----------:|-----------:|----------:| >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | >>> >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data >>> points per kernel in total. >>> >>> I've rerun the test multiple times and see similar results each time. >>> >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I >>> see the same performance as baseline-4k. >>> >>> >>>> 2. Do we have a theory why it regressed? >>> >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback >>> to order-0. I guess this is happening so often for this workload that the cost >>> of doing the checks and fallback is outweighing the benefit of the memory that >>> does end up with order-4 folios. >>> >>> I've sampled the memory in each bucket (once per second) while running and its >>> roughly: >>> >>> 64K: 25% >>> 32K: 15% >>> 16K: 15% >>> 4K: 45% >>> >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and >>> the 64K contents is more static - that's just a guess though. >> So this is like out of vma range thing. >> >>> >>>> Assuming no bugs, I don't see how a real regression could happen -- >>>> falling back to order-0 isn't different from the original behavior. >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? >>> >>> I can, but it will have to be a bit later in the week. I'll do some more test >>> runs overnight so we have a larger number of runs - hopefully that might tell us >>> that this is noise to a certain extent. >>> >>> I'd still like to hear a clear technical argument for why the bin-packing >>> approach is not the correct one! >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we >> postpone this part of change and make basic anon large folio support in. Then >> discuss which approach we should take. Maybe people will agree retry is the >> choice, maybe other approach will be taken... >> >> For example, for this out of VMA range case, per VMA order should be considered. >> We don't need make decision that the retry should be taken now. > > I've articulated the reasons in another email. Just summarize the most > important point here: > using more fallback orders makes a system reach equilibrium faster, at > which point it can't allocate the order of arch_wants_pte_order() > anymore. IOW, this best-fit policy can reduce the number of folios of > the h/w prefered order for a system running long enough. Thanks for taking the time to write all the arguments down. I understand what you are saying. If we are considering the whole system, then we also need to think about the page cache though, and that will allocate multiple orders, so you are still going to suffer fragmentation from that user. That said, I like the proposal patch posted where we have up to 3 orders that we try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That feels like a good compromise that allows me to fulfil my objectives. I'm going to pull this together into a v3 patch set and aim to post towards the end of the week. Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I need your explicit permission). On the regression front, I've done a much bigger test run and see the regression is still present (although the mean has shifted a little bit). I've also built a kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns order-3. The aim was to test your hypothesis that 64K allocation is slow. This kernel is performing even better, so I think that confirms your hypothesis: | kernel | runs_per_min | runs | sessions | |:-------------------------------|---------------:|-------:|-----------:| | baseline-4k | 0.0% | 75 | 15 | | anonfolio-lkml-v1 | 1.0% | 75 | 15 | | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | | anonfolio-lkml-v2 | 0.9% | 75 | 15 | | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Thanks, Ryan
On Wed, Jul 5, 2023 at 4:16 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > > On 05/07/2023 01:21, Yu Zhao wrote: > > On Tue, Jul 4, 2023 at 5:53 PM Yin Fengwei <fengwei.yin@intel.com> wrote: > >> > >> > >> > >> On 7/4/23 23:36, Ryan Roberts wrote: > >>> On 04/07/2023 08:11, Yu Zhao wrote: > >>>> On Tue, Jul 4, 2023 at 12:22 AM Yin, Fengwei <fengwei.yin@intel.com> wrote: > >>>>> > >>>>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>>>> On Mon, Jul 3, 2023 at 7:53 AM Ryan Roberts <ryan.roberts@arm.com> wrote: > >>>>>>> > >>>>>>> Hi All, > >>>>>>> > >>>>>>> This is v2 of a series to implement variable order, large folios for anonymous > >>>>>>> memory. The objective of this is to improve performance by allocating larger > >>>>>>> chunks of memory during anonymous page faults. See [1] for background. > >>>>>> > >>>>>> Thanks for the quick response! > >>>>>> > >>>>>>> I've significantly reworked and simplified the patch set based on comments from > >>>>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > >>>>>>> VARIABLE_THP, on Yu's advice. > >>>>>>> > >>>>>>> The last patch is for arm64 to explicitly override the default > >>>>>>> arch_wants_pte_order() and is intended as an example. If this series is accepted > >>>>>>> I suggest taking the first 4 patches through the mm tree and the arm64 change > >>>>>>> could be handled through the arm64 tree separately. Neither has any build > >>>>>>> dependency on the other. > >>>>>>> > >>>>>>> The one area where I haven't followed Yu's advice is in the determination of the > >>>>>>> size of folio to use. It was suggested that I have a single preferred large > >>>>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > >>>>>>> being existing overlapping populated PTEs, etc) then fallback immediately to > >>>>>>> order-0. It turned out that this approach caused a performance regression in the > >>>>>>> Speedometer benchmark. > >>>>>> > >>>>>> I suppose it's regression against the v1, not the unpatched kernel. > >>>>> From the performance data Ryan shared, it's against unpatched kernel: > >>>>> > >>>>> Speedometer 2.0: > >>>>> > >>>>> | kernel | runs_per_min | > >>>>> |:-------------------------------|---------------:| > >>>>> | baseline-4k | 0.0% | > >>>>> | anonfolio-lkml-v1 | 0.7% | > >>>>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>>>> | anonfolio-lkml-v2 | 0.5% | > >>>> > >>>> I see. Thanks. > >>>> > >>>> A couple of questions: > >>>> 1. Do we have a stddev? > >>> > >>> | kernel | mean_abs | std_abs | mean_rel | std_rel | > >>> |:------------------------- |-----------:|----------:|-----------:|----------:| > >>> | baseline-4k | 117.4 | 0.8 | 0.0% | 0.7% | > >>> | anonfolio-v1 | 118.2 | 1 | 0.7% | 0.9% | > >>> | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | 0.9% | > >>> | anonfolio-v2 | 118 | 1.2 | 0.5% | 1.0% | > >>> > >>> This is with 3 runs per reboot across 5 reboots, with first run after reboot > >>> trimmed (it's always a bit slower, I assume due to cold page cache). So 10 data > >>> points per kernel in total. > >>> > >>> I've rerun the test multiple times and see similar results each time. > >>> > >>> I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=disabled and in this case I > >>> see the same performance as baseline-4k. > >>> > >>> > >>>> 2. Do we have a theory why it regressed? > >>> > >>> I have a woolly hypothesis; I think Chromium is doing mmap/munmap in ways that > >>> mean when we fault, order-4 is often too big to fit in the VMA. So we fallback > >>> to order-0. I guess this is happening so often for this workload that the cost > >>> of doing the checks and fallback is outweighing the benefit of the memory that > >>> does end up with order-4 folios. > >>> > >>> I've sampled the memory in each bucket (once per second) while running and its > >>> roughly: > >>> > >>> 64K: 25% > >>> 32K: 15% > >>> 16K: 15% > >>> 4K: 45% > >>> > >>> 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-order. > >>> But potentially, I suspect there is lots of mmap/unmap for the smaller sizes and > >>> the 64K contents is more static - that's just a guess though. > >> So this is like out of vma range thing. > >> > >>> > >>>> Assuming no bugs, I don't see how a real regression could happen -- > >>>> falling back to order-0 isn't different from the original behavior. > >>>> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > >>> > >>> I can, but it will have to be a bit later in the week. I'll do some more test > >>> runs overnight so we have a larger number of runs - hopefully that might tell us > >>> that this is noise to a certain extent. > >>> > >>> I'd still like to hear a clear technical argument for why the bin-packing > >>> approach is not the correct one! > >> My understanding to Yu's (Yu, correct me if I am wrong) comments is that we > >> postpone this part of change and make basic anon large folio support in. Then > >> discuss which approach we should take. Maybe people will agree retry is the > >> choice, maybe other approach will be taken... > >> > >> For example, for this out of VMA range case, per VMA order should be considered. > >> We don't need make decision that the retry should be taken now. > > > > I've articulated the reasons in another email. Just summarize the most > > important point here: > > using more fallback orders makes a system reach equilibrium faster, at > > which point it can't allocate the order of arch_wants_pte_order() > > anymore. IOW, this best-fit policy can reduce the number of folios of > > the h/w prefered order for a system running long enough. > > Thanks for taking the time to write all the arguments down. I understand what > you are saying. If we are considering the whole system, then we also need to > think about the page cache though, and that will allocate multiple orders, so > you are still going to suffer fragmentation from that user. 1. page cache doesn't use the best-fit policy -- it has the advantage of having RA hit/miss numbers -- IOW, it doesn't try all orders without an estimated ROI. 2. page cache causes far less fragmentation in my experience: clean page cache gets reclaimed first under memory; unmapped page cache is less costly to migrate. Neither is true for anon, and what makes it worse is that heavy anon users usually enable zram/zswap: allocating memory (to store compressed data) under memory pressure makes reclaim/compaction even harder. > That said, I like the proposal patch posted where we have up to 3 orders that we > try in order of preference; hw-preferred, PAGE_ALLOC_COSTLY_ORDER and 0. That > feels like a good compromise that allows me to fulfil my objectives. I'm going > to pull this together into a v3 patch set and aim to post towards the end of the > week. > > Are you ok for me to add a Suggested-by: for you? (submitting-patches.rst says I > need your explicit permission). Thanks for asking. No need to worry about it -- it's been a great team work with you, Fengwei, Yang et al. I'm attaching a single patch containing all pieces I spelled out/implied/forgot to mention. It doesn't depend on other series -- I just stress-tested it on top of the latest mm-unstable. Please feel free to reuse any bits you see fit. Again no need to worry about Suggested-by. > On the regression front, I've done a much bigger test run and see the regression > is still present (although the mean has shifted a little bit). I've also built a > kernel based on anonfolio-lkml-v2 but where arch_wants_pte_order() returns > order-3. The aim was to test your hypothesis that 64K allocation is slow. This > kernel is performing even better, so I think that confirms your hypothesis: Great, thanks for confirming. > | kernel | runs_per_min | runs | sessions | > |:-------------------------------|---------------:|-------:|-----------:| > | baseline-4k | 0.0% | 75 | 15 | > | anonfolio-lkml-v1 | 1.0% | 75 | 15 | > | anonfolio-lkml-v2-simple-order | -0.4% | 75 | 15 | > | anonfolio-lkml-v2 | 0.9% | 75 | 15 | > | anonfolio-lkml-v2-32k | 1.4% | 10 | 5 | Since we are all committed to the effort long term, the last number is good enough for the initial step to conclude. Hopefully v3 can address all pending comments and get into mm-unstable.
On 03.07.23 15:53, Ryan Roberts wrote: > Hi All, > > This is v2 of a series to implement variable order, large folios for anonymous > memory. The objective of this is to improve performance by allocating larger > chunks of memory during anonymous page faults. See [1] for background. > > I've significantly reworked and simplified the patch set based on comments from > Yu Zhao (thanks for all your feedback!). I've also renamed the feature to > VARIABLE_THP, on Yu's advice. > > The last patch is for arm64 to explicitly override the default > arch_wants_pte_order() and is intended as an example. If this series is accepted > I suggest taking the first 4 patches through the mm tree and the arm64 change > could be handled through the arm64 tree separately. Neither has any build > dependency on the other. > > The one area where I haven't followed Yu's advice is in the determination of the > size of folio to use. It was suggested that I have a single preferred large > order, and if it doesn't fit in the VMA (due to exceeding VMA bounds, or there > being existing overlapping populated PTEs, etc) then fallback immediately to > order-0. It turned out that this approach caused a performance regression in the > Speedometer benchmark. With my v1 patch, there were significant quantities of > memory which could not be placed in the 64K bucket and were instead being > allocated for the 32K and 16K buckets. With the proposed simplification, that > memory ended up using the 4K bucket, so page faults increased by 2.75x compared > to the v1 patch (although due to the 64K bucket, this number is still a bit > lower than the baseline). So instead, I continue to calculate a folio order that > is somewhere between the preferred order and 0. (See below for more details). > > The patches are based on top of v6.4 plus Matthew Wilcox's set_ptes() series > [2], which is a hard dependency. I have a branch at [3]. > > > Changes since v1 [1] > -------------------- > > - removed changes to arch-dependent vma_alloc_zeroed_movable_folio() > - replaced with arch-independent alloc_anon_folio() > - follows THP allocation approach > - no longer retry with intermediate orders if allocation fails > - fallback directly to order-0 > - remove folio_add_new_anon_rmap_range() patch > - instead add its new functionality to folio_add_new_anon_rmap() > - remove batch-zap pte mappings optimization patch > - remove enabler folio_remove_rmap_range() patch too > - These offer real perf improvement so will submit separately > - simplify Kconfig > - single FLEXIBLE_THP option, which is independent of arch > - depends on TRANSPARENT_HUGEPAGE > - when enabled default to max anon folio size of 64K unless arch > explicitly overrides > - simplify changes to do_anonymous_page(): > - no more retry loop > > > Performance > ----------- > > Below results show 3 benchmarks; kernel compilation with 8 jobs, kernel > compilation with 80 jobs, and speedometer 2.0 (a javascript benchmark running in > Chromium). All cases are running on Ampere Altra with 1 NUMA node enabled, > Ubuntu 22.04 and XFS filesystem. Each benchmark is repeated 15 times over 5 > reboots and averaged. > > 'anonfolio-lkml-v1' is the v1 patchset at [1]. 'anonfolio-lkml-v2' is this v2 > patchset. 'anonfolio-lkml-v2-simple-order' is anonfolio-lkml-v2 but with the > order selection simplification that Yu Zhao suggested - I'm trying to justify > here why I did not follow the advice. > > > Kernel compilation with 8 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -5.3% | -42.9% | -0.6% | > | anonfolio-lkml-v2-simple-order | -4.4% | -36.5% | -0.4% | > | anonfolio-lkml-v2 | -4.8% | -38.6% | -0.6% | > > We can see that the simple-order approach is responsible for a regression of > 0.4%. > > > Kernel compilation with 80 jobs: > > | kernel | real-time | kern-time | user-time | > |:-------------------------------|------------:|------------:|------------:| > | baseline-4k | 0.0% | 0.0% | 0.0% | > | anonfolio-lkml-v1 | -4.6% | -45.7% | 1.4% | > | anonfolio-lkml-v2-simple-order | -4.7% | -40.2% | -0.1% | > | anonfolio-lkml-v2 | -5.0% | -42.6% | -0.3% | > > simple-order costs 0.3 % here. v2 is actually performing higher than v1 due to > fixing the v1 regression on user-time. > > > Speedometer 2.0: > > | kernel | runs_per_min | > |:-------------------------------|---------------:| > | baseline-4k | 0.0% | > | anonfolio-lkml-v1 | 0.7% | > | anonfolio-lkml-v2-simple-order | -0.9% | > | anonfolio-lkml-v2 | 0.5% | > > simple-order regresses performance by 0.9% vs the baseline, for a total negative > swing of 1.6% vs v1. This is fixed by keeping the more complex order selection > mechanism from v1. > > > The remaining (kernel time) performance gap between v1 and v2 for the above > benchmarks is due to the removal of the "batch zap" patch in v2. Adding that > back in gives us the performance back. I intend to submit that as a separate > series once this series is accepted. > > > [1] https://lore.kernel.org/linux-mm/20230626171430.3167004-1-ryan.roberts@arm.com/ > [2] https://lore.kernel.org/linux-mm/20230315051444.3229621-1-willy@infradead.org/ > [3] https://gitlab.arm.com/linux-arm/linux-rr/-/tree/features/granule_perf/anonfolio-lkml_v2 > > Thanks, > Ryan Hi Ryan, is page migration already working as expected (what about page compaction?), and do we handle migration -ENOMEM when allocating a target page: do we split an fallback to 4k page migration?
On 05/07/2023 20:38, David Hildenbrand wrote: > On 03.07.23 15:53, Ryan Roberts wrote: >> Hi All, >> >> This is v2 of a series to implement variable order, large folios for anonymous >> memory. The objective of this is to improve performance by allocating larger >> chunks of memory during anonymous page faults. See [1] for background. >> [...] >> Thanks, >> Ryan > > Hi Ryan, > > is page migration already working as expected (what about page compaction?), and > do we handle migration -ENOMEM when allocating a target page: do we split an > fallback to 4k page migration? > Hi David, All, This series aims to be the bare minimum to demonstrate allocation of large anon folios. As such, there is a laundry list of things that need to be done for this feature to play nicely with other features. My preferred route is to merge this with it's Kconfig defaulted to disabled, and its Kconfig description clearly shouting that it's EXPERIMENTAL with an explanation of why (similar to READ_ONLY_THP_FOR_FS). That said, I've put together a table of the items that I'm aware of that need attention. It would be great if people can review and add any missing items. Then we can hopefully parallelize the implementation work. David, I don't think the items you raised are covered - would you mind providing a bit more detail so I can add them to the list? (or just add them to the list yourself, if you prefer). --- - item: mlock description: >- Large, pte-mapped folios are ignored when mlock is requested. Code comment for mlock_vma_folio() says "...filter out pte mappings of THPs, which cannot be consistently counted: a pte mapping of the THP head cannot be distinguished by the page alone." location: - mlock_pte_range() - mlock_vma_folio() assignee: Yin, Fengwei - item: numa balancing description: >- Large, pte-mapped folios are ignored by numa-balancing code. Commit comment (e81c480): "We're going to have THP mapped with PTEs. It will confuse numabalancing. Let's skip them for now." location: - do_numa_page() assignee: <none> - item: madvise description: >- MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes exclusive only if mapcount==1, else skips remainder of operation. For large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages and still be exclusive. Even better; don't split the folio if it fits entirely within the range? Discussion at https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ talks about changing folio mapcounting - may help determine if exclusive without pgtable scan? location: - madvise_cold_or_pageout_pte_range() - madvise_free_pte_range() assignee: <none> - item: shrink_folio_list description: >- Raised by Yu Zhao; I can't see the problem in the code - need clarification location: - shrink_folio_list() assignee: <none> - item: compaction description: >- Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for page-cache pages today. Is my understand correct? location: - <where?> assignee: <none> --- Thanks, Ryan
On 06.07.23 10:02, Ryan Roberts wrote: > On 05/07/2023 20:38, David Hildenbrand wrote: >> On 03.07.23 15:53, Ryan Roberts wrote: >>> Hi All, >>> >>> This is v2 of a series to implement variable order, large folios for anonymous >>> memory. The objective of this is to improve performance by allocating larger >>> chunks of memory during anonymous page faults. See [1] for background. >>> > > [...] > >>> Thanks, >>> Ryan >> >> Hi Ryan, >> >> is page migration already working as expected (what about page compaction?), and >> do we handle migration -ENOMEM when allocating a target page: do we split an >> fallback to 4k page migration? >> > > Hi David, All, Hi Ryan, thanks a lot for the list. But can you comment on the page migration part (IOW did you try it already)? For example, memory hotunplug, CMA, MCE handling, compaction all rely on page migration of something that was allocated using GFP_MOVABLE to actually work. Compaction seems to skip any higher-order folios, but the question is if the udnerlying migration itself works. If it already works: great! If not, this really has to be tackled early, because otherwise we'll be breaking the GFP_MOVABLE semantics. > > This series aims to be the bare minimum to demonstrate allocation of large anon > folios. As such, there is a laundry list of things that need to be done for this > feature to play nicely with other features. My preferred route is to merge this > with it's Kconfig defaulted to disabled, and its Kconfig description clearly > shouting that it's EXPERIMENTAL with an explanation of why (similar to > READ_ONLY_THP_FOR_FS). As long as we are not sure about the user space control and as long as basic functionality is not working (example, page migration), I would tend to not merge this early just for the sake of it. But yes, something like mlock can eventually be tackled later: as long as there is a runtime interface to disable it ;) > > That said, I've put together a table of the items that I'm aware of that need > attention. It would be great if people can review and add any missing items. > Then we can hopefully parallelize the implementation work. David, I don't think > the items you raised are covered - would you mind providing a bit more detail so > I can add them to the list? (or just add them to the list yourself, if you prefer). > > --- > > - item: > mlock > > description: >- > Large, pte-mapped folios are ignored when mlock is requested. Code comment > for mlock_vma_folio() says "...filter out pte mappings of THPs, which > cannot be consistently counted: a pte mapping of the THP head cannot be > distinguished by the page alone." > > location: > - mlock_pte_range() > - mlock_vma_folio() > > assignee: > Yin, Fengwei > > > - item: > numa balancing > > description: >- > Large, pte-mapped folios are ignored by numa-balancing code. Commit > comment (e81c480): "We're going to have THP mapped with PTEs. It will > confuse numabalancing. Let's skip them for now." > > location: > - do_numa_page() > > assignee: > <none> > > > - item: > madvise > > description: >- > MADV_COLD, MADV_PAGEOUT, MADV_FREE: For large folios, code assumes > exclusive only if mapcount==1, else skips remainder of operation. For > large, pte-mapped folios, exclusive folios can have mapcount upto nr_pages > and still be exclusive. Even better; don't split the folio if it fits > entirely within the range? Discussion at > > https://lore.kernel.org/linux-mm/6cec6f68-248e-63b4-5615-9e0f3f819a0a@redhat.com/ > talks about changing folio mapcounting - may help determine if exclusive > without pgtable scan? > > location: > - madvise_cold_or_pageout_pte_range() > - madvise_free_pte_range() > > assignee: > <none> > > > - item: > shrink_folio_list > > description: >- > Raised by Yu Zhao; I can't see the problem in the code - need > clarification > > location: > - shrink_folio_list() > > assignee: > <none> > > > - item: > compaction > > description: >- > Raised at LSFMM: Compaction skips non-order-0 pages. Already problem for > page-cache pages today. Is my understand correct? > > location: > - <where?> > > assignee: > <none> I'm still thinking about the whole mapcount thingy (and I burned way too much time on that yesterday), which is a big item for such a list and affects some of these items. A pagetable scan is pretty much irrelevant for order-2 pages. But once we're talking about higher orders we really don't want to do that. I'm preparing a writeup with users and challenges. Is swapping working as expected? zswap?
On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > On 06.07.23 10:02, Ryan Roberts wrote: > But can you comment on the page migration part (IOW did you try it already)? > > For example, memory hotunplug, CMA, MCE handling, compaction all rely on > page migration of something that was allocated using GFP_MOVABLE to actually > work. > > Compaction seems to skip any higher-order folios, but the question is if the > udnerlying migration itself works. > > If it already works: great! If not, this really has to be tackled early, > because otherwise we'll be breaking the GFP_MOVABLE semantics. I have looked at this a bit. _Migration_ should be fine. _Compaction_ is not. If you look at a function like folio_migrate_mapping(), it all seems appropriately folio-ised. There might be something in there that is slightly wrong, but that would just be a bug to fix, not a huge architectural problem. The problem comes in the callers of migrate_pages(). They pass a new_folio_t callback. alloc_migration_target() is the usual one passed and as far as I can tell is fine. I've seen no problems reported with it. compaction_alloc() is a disaster, and I don't know how to fix it. The compaction code has its own allocator which is populated with order-0 folios. How it populates that freelist is awful ... see split_map_pages() > Is swapping working as expected? zswap? Suboptimally. Swap will split folios in order to swap them. Somebody needs to fix that, but it should work.
On 07.07.23 15:12, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >> On 06.07.23 10:02, Ryan Roberts wrote: >> But can you comment on the page migration part (IOW did you try it already)? >> >> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >> page migration of something that was allocated using GFP_MOVABLE to actually >> work. >> >> Compaction seems to skip any higher-order folios, but the question is if the >> udnerlying migration itself works. >> >> If it already works: great! If not, this really has to be tackled early, >> because otherwise we'll be breaking the GFP_MOVABLE semantics. > > I have looked at this a bit. _Migration_ should be fine. _Compaction_ > is not. Thanks! Very nice if at least ordinary migration works. > > If you look at a function like folio_migrate_mapping(), it all seems > appropriately folio-ised. There might be something in there that is > slightly wrong, but that would just be a bug to fix, not a huge > architectural problem. > > The problem comes in the callers of migrate_pages(). They pass a > new_folio_t callback. alloc_migration_target() is the usual one passed > and as far as I can tell is fine. I've seen no problems reported with it. > > compaction_alloc() is a disaster, and I don't know how to fix it. > The compaction code has its own allocator which is populated with order-0 > folios. How it populates that freelist is awful ... see split_map_pages() Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. What should always work is the split->migrate. But that's definitely not what we want in many cases. > >> Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Somebody > needs to fix that, but it should work. Good! It would be great to have some kind of a feature matrix that tells us what works perfectly, sub-optimally, barely, not at all (and what has not been tested). Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP (like swapping, not even sure about). I suspect that KSM should work mostly fine with flexible-thp. When deduplciating, we'll simply split the compound page and proceed as expected. But might be worth testing as well.
On 07/07/2023 14:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. That's good to hear - I hadn't personally investigated. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() I think this compaction issue also affects large folios in the page cache? So really it is a pre-existing bug in the code base that needs to be fixed independently of large anon folios? Should I assume you are tackling this, Matthew? > > Yeah, all that code was written under the assumption that we're moving order-0 > pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory > addresses, so we can migrate from low memory addresses, effectively freeing up > low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from > high addresses, we might want to allocate "as large as possible" ("grab what we > can") from high addresses and then have our own kind of buddy for allocating > from that pool a compaction destination page, depending on our source page. Nasty. > > What should always work is the split->migrate. But that's definitely not what we > want in many cases. > >> >>> Is swapping working as expected? zswap? >> >> Suboptimally. Swap will split folios in order to swap them. Somebody >> needs to fix that, but it should work. > > Good! > > It would be great to have some kind of a feature matrix that tells us what works > perfectly, sub-optimally, barely, not at all (and what has not been tested). > Maybe (likely!) we'll also find things that are sub-optimal for ordinary THP > (like swapping, not even sure about). I'm building a list of known issues, but so far it has been based on code I've found during review and things raised by people in these threads. Are there test suites that explicitly test these features? If so I'll happily run them against large anon folios, but at the moment I'm ignorant I'm afraid. I have been trying to get mm selftests up and running, but I currently have a bunch of failures on arm64, even without any of my patches - somthing I'm working through. > > I suspect that KSM should work mostly fine with flexible-thp. When > deduplciating, we'll simply split the compound page and proceed as expected. But > might be worth testing as well. >
On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > On 07.07.23 15:12, Matthew Wilcox wrote: >> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>> On 06.07.23 10:02, Ryan Roberts wrote: >>> But can you comment on the page migration part (IOW did you try it already)? >>> >>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>> page migration of something that was allocated using GFP_MOVABLE to actually >>> work. >>> >>> Compaction seems to skip any higher-order folios, but the question is if the >>> udnerlying migration itself works. >>> >>> If it already works: great! If not, this really has to be tackled early, >>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >> >> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >> is not. > > Thanks! Very nice if at least ordinary migration works. > >> >> If you look at a function like folio_migrate_mapping(), it all seems >> appropriately folio-ised. There might be something in there that is >> slightly wrong, but that would just be a bug to fix, not a huge >> architectural problem. >> >> The problem comes in the callers of migrate_pages(). They pass a >> new_folio_t callback. alloc_migration_target() is the usual one passed >> and as far as I can tell is fine. I've seen no problems reported with it. >> >> compaction_alloc() is a disaster, and I don't know how to fix it. >> The compaction code has its own allocator which is populated with order-0 >> folios. How it populates that freelist is awful ... see split_map_pages() > > Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). > > From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. > > Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. We probably do not need a pool, since before migration, we have isolated folios to be migrated and can come up with a stats on how many folios there are at each order. Then, we can isolate free pages based on the stats and do not split free pages all the way down to order-0. We can sort the source folios based on their orders and isolate free pages from largest order to smallest order. That could avoid a free page pool. -- Best Regards, Yan, Zi
On Mon, Jul 10, 2023 at 11:07:47AM +0100, Ryan Roberts wrote: > I think this compaction issue also affects large folios in the page cache? So > really it is a pre-existing bug in the code base that needs to be fixed > independently of large anon folios? Should I assume you are tackling this, Matthew? It does need to be fixed independently of large anon folios. Said fix should probably be backported to 6.1 once it's suitably stable. However, I'm not working on it. I have a lot of projects and this one's a missed-opportunity, not a show-stopper. Sounds like Zi Yan might be interested in tackling it though!
On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > Is swapping working as expected? zswap? > > Suboptimally. Swap will split folios in order to swap them. Wouldn't that mean if high order folios are used a lot but swap is also used, until this is fixed you wouldn't get the expected reclaim gains for high order folios and we'd need compaction more then? > Somebody needs to fix that, but it should work. As we look at shmem stuff it was on the path so something we have considered doing. Ie, it's on our team's list of items to help with but currently on a backburner. Luis
On Tue, Jul 11, 2023 at 02:11:19PM -0700, Luis Chamberlain wrote: > On Fri, Jul 07, 2023 at 02:12:01PM +0100, Matthew Wilcox wrote: > > On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: > > > > > Is swapping working as expected? zswap? > > > > Suboptimally. Swap will split folios in order to swap them. > > Wouldn't that mean if high order folios are used a lot but swap is also > used, until this is fixed you wouldn't get the expected reclaim gains > for high order folios and we'd need compaction more then? They're split in shrink_folio_list(), so they stay intact until that point? > > Somebody needs to fix that, but it should work. > > As we look at shmem stuff it was on the path so something we have > considered doing. Ie, it's on our team's list of items to help with > but currently on a backburner. Something I was thinking about is that you'll need to prohibit swap devices or swap files being created on large block devices. Until we rewrite the entire swap subsystem ...
On 10/07/2023 17:53, Zi Yan wrote: > On 7 Jul 2023, at 9:24, David Hildenbrand wrote: > >> On 07.07.23 15:12, Matthew Wilcox wrote: >>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>> But can you comment on the page migration part (IOW did you try it already)? >>>> >>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>> work. >>>> >>>> Compaction seems to skip any higher-order folios, but the question is if the >>>> udnerlying migration itself works. >>>> >>>> If it already works: great! If not, this really has to be tackled early, >>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>> >>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>> is not. >> >> Thanks! Very nice if at least ordinary migration works. >> >>> >>> If you look at a function like folio_migrate_mapping(), it all seems >>> appropriately folio-ised. There might be something in there that is >>> slightly wrong, but that would just be a bug to fix, not a huge >>> architectural problem. >>> >>> The problem comes in the callers of migrate_pages(). They pass a >>> new_folio_t callback. alloc_migration_target() is the usual one passed >>> and as far as I can tell is fine. I've seen no problems reported with it. >>> >>> compaction_alloc() is a disaster, and I don't know how to fix it. >>> The compaction code has its own allocator which is populated with order-0 >>> folios. How it populates that freelist is awful ... see split_map_pages() >> >> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >> >> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >> >> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. > > We probably do not need a pool, since before migration, we have isolated folios to > be migrated and can come up with a stats on how many folios there are at each order. > Then, we can isolate free pages based on the stats and do not split free pages > all the way down to order-0. We can sort the source folios based on their orders > and isolate free pages from largest order to smallest order. That could avoid > a free page pool. Hi Zi, I just wanted to check; is this something you are working on or planning to work on? I'm trying to maintain a list of all the items that need to get sorted for large anon folios. It would be great to put your name against it! ;-) > > -- > Best Regards, > Yan, Zi
On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > On 10/07/2023 17:53, Zi Yan wrote: >> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >> >>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>> >>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>> work. >>>>> >>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>> udnerlying migration itself works. >>>>> >>>>> If it already works: great! If not, this really has to be tackled early, >>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>> >>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>> is not. >>> >>> Thanks! Very nice if at least ordinary migration works. >>> >>>> >>>> If you look at a function like folio_migrate_mapping(), it all seems >>>> appropriately folio-ised. There might be something in there that is >>>> slightly wrong, but that would just be a bug to fix, not a huge >>>> architectural problem. >>>> >>>> The problem comes in the callers of migrate_pages(). They pass a >>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>> >>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>> The compaction code has its own allocator which is populated with order-0 >>>> folios. How it populates that freelist is awful ... see split_map_pages() >>> >>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>> >>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>> >>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >> >> We probably do not need a pool, since before migration, we have isolated folios to >> be migrated and can come up with a stats on how many folios there are at each order. >> Then, we can isolate free pages based on the stats and do not split free pages >> all the way down to order-0. We can sort the source folios based on their orders >> and isolate free pages from largest order to smallest order. That could avoid >> a free page pool. > > Hi Zi, I just wanted to check; is this something you are working on or planning > to work on? I'm trying to maintain a list of all the items that need to get > sorted for large anon folios. It would be great to put your name against it! ;-) Sure. I can work on this one. -- Best Regards, Yan, Zi
On 19/07/2023 17:05, Zi Yan wrote: > On 19 Jul 2023, at 11:49, Ryan Roberts wrote: > >> On 10/07/2023 17:53, Zi Yan wrote: >>> On 7 Jul 2023, at 9:24, David Hildenbrand wrote: >>> >>>> On 07.07.23 15:12, Matthew Wilcox wrote: >>>>> On Fri, Jul 07, 2023 at 01:40:53PM +0200, David Hildenbrand wrote: >>>>>> On 06.07.23 10:02, Ryan Roberts wrote: >>>>>> But can you comment on the page migration part (IOW did you try it already)? >>>>>> >>>>>> For example, memory hotunplug, CMA, MCE handling, compaction all rely on >>>>>> page migration of something that was allocated using GFP_MOVABLE to actually >>>>>> work. >>>>>> >>>>>> Compaction seems to skip any higher-order folios, but the question is if the >>>>>> udnerlying migration itself works. >>>>>> >>>>>> If it already works: great! If not, this really has to be tackled early, >>>>>> because otherwise we'll be breaking the GFP_MOVABLE semantics. >>>>> >>>>> I have looked at this a bit. _Migration_ should be fine. _Compaction_ >>>>> is not. >>>> >>>> Thanks! Very nice if at least ordinary migration works. >>>> >>>>> >>>>> If you look at a function like folio_migrate_mapping(), it all seems >>>>> appropriately folio-ised. There might be something in there that is >>>>> slightly wrong, but that would just be a bug to fix, not a huge >>>>> architectural problem. >>>>> >>>>> The problem comes in the callers of migrate_pages(). They pass a >>>>> new_folio_t callback. alloc_migration_target() is the usual one passed >>>>> and as far as I can tell is fine. I've seen no problems reported with it. >>>>> >>>>> compaction_alloc() is a disaster, and I don't know how to fix it. >>>>> The compaction code has its own allocator which is populated with order-0 >>>>> folios. How it populates that freelist is awful ... see split_map_pages() >>>> >>>> Yeah, all that code was written under the assumption that we're moving order-0 pages (which is what the anon+pagecache pages part). >>>> >>>> From what I recall, we're allocating order-0 pages from the high memory addresses, so we can migrate from low memory addresses, effectively freeing up low memory addresses and filling high memory addresses. >>>> >>>> Adjusting that will be ... interesting. Instead of allocating order-0 pages from high addresses, we might want to allocate "as large as possible" ("grab what we can") from high addresses and then have our own kind of buddy for allocating from that pool a compaction destination page, depending on our source page. Nasty. >>> >>> We probably do not need a pool, since before migration, we have isolated folios to >>> be migrated and can come up with a stats on how many folios there are at each order. >>> Then, we can isolate free pages based on the stats and do not split free pages >>> all the way down to order-0. We can sort the source folios based on their orders >>> and isolate free pages from largest order to smallest order. That could avoid >>> a free page pool. >> >> Hi Zi, I just wanted to check; is this something you are working on or planning >> to work on? I'm trying to maintain a list of all the items that need to get >> sorted for large anon folios. It would be great to put your name against it! ;-) > > Sure. I can work on this one. Awesome - thanks! > > -- > Best Regards, > Yan, Zi