Message ID | 1550045191-27483-1-git-send-email-anshuman.khandual@arm.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: Introduce lazy exec permission setting on a page | expand |
On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote: > Setting an exec permission on a page normally triggers I-cache invalidation > which might be expensive. I-cache invalidation is not mandatory on a given > page if there is no immediate exec access on it. Non-fault modification of > user page table from generic memory paths like migration can be improved if > setting of the exec permission on the page can be deferred till actual use. > There was a performance report [1] which highlighted the problem. [...] > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html FTR, this performance regression has been addressed by commit 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That said, I still think this patch series is valuable for further optimising the page migration path on arm64 (and can be extended to other architectures that currently require I/D cache maintenance for executable pages). BTW, if you are going to post new versions of this series, please include linux-arch and linux-arm-kernel.
On Wed 13-02-19 11:21:36, Catalin Marinas wrote: > On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote: > > Setting an exec permission on a page normally triggers I-cache invalidation > > which might be expensive. I-cache invalidation is not mandatory on a given > > page if there is no immediate exec access on it. Non-fault modification of > > user page table from generic memory paths like migration can be improved if > > setting of the exec permission on the page can be deferred till actual use. > > There was a performance report [1] which highlighted the problem. > [...] > > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html > > FTR, this performance regression has been addressed by commit > 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That > said, I still think this patch series is valuable for further optimising > the page migration path on arm64 (and can be extended to other > architectures that currently require I/D cache maintenance for > executable pages). Are there any numbers to show the optimization impact?
On 2/13/19 12:06 AM, Anshuman Khandual wrote: > Setting an exec permission on a page normally triggers I-cache invalidation > which might be expensive. I-cache invalidation is not mandatory on a given > page if there is no immediate exec access on it. Non-fault modification of > user page table from generic memory paths like migration can be improved if > setting of the exec permission on the page can be deferred till actual use. > There was a performance report [1] which highlighted the problem. How does this happen? If the page was not executed, then it'll (presumably) be non-present which won't require icache invalidation. So, this would only be for pages that have been executed (and won't again before the next migration), *or* for pages that were mapped executable but never executed. Any idea which one it is? If it's pages that got mapped in but were never executed, how did that happen? Was it fault-around? If so, maybe it would just be simpler to not do fault-around for executable pages on these platforms.
On 02/13/2019 09:14 PM, Dave Hansen wrote: > On 2/13/19 12:06 AM, Anshuman Khandual wrote: >> Setting an exec permission on a page normally triggers I-cache invalidation >> which might be expensive. I-cache invalidation is not mandatory on a given >> page if there is no immediate exec access on it. Non-fault modification of >> user page table from generic memory paths like migration can be improved if >> setting of the exec permission on the page can be deferred till actual use. >> There was a performance report [1] which highlighted the problem. > > How does this happen? If the page was not executed, then it'll > (presumably) be non-present which won't require icache invalidation. > So, this would only be for pages that have been executed (and won't > again before the next migration), *or* for pages that were mapped > executable but never executed. I-cache invalidation happens while migrating a 'mapped and executable' page irrespective whether that page was really executed for being mapped there in the first place. > > Any idea which one it is? > I am not sure about this particular reported case. But was able to reproduce the problem through a test case where a buffer was mapped with R|W|X, get it faulted/mapped through write, migrate and then execute from it. > If it's pages that got mapped in but were never executed, how did that > happen? Was it fault-around? If so, maybe it would just be simpler to > not do fault-around for executable pages on these platforms. Page can get mapped through a different access (write) without being executed. Even if it got mapped through execution and subsequent invalidation, the invalidation does not have to be repeated again after migration without first getting an exec access subsequently. This series just tries to hold off the invalidation after migration till subsequent exec access.
On 02/13/2019 09:08 PM, Michal Hocko wrote: > On Wed 13-02-19 11:21:36, Catalin Marinas wrote: >> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote: >>> Setting an exec permission on a page normally triggers I-cache invalidation >>> which might be expensive. I-cache invalidation is not mandatory on a given >>> page if there is no immediate exec access on it. Non-fault modification of >>> user page table from generic memory paths like migration can be improved if >>> setting of the exec permission on the page can be deferred till actual use. >>> There was a performance report [1] which highlighted the problem. >> [...] >>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html >> >> FTR, this performance regression has been addressed by commit >> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That >> said, I still think this patch series is valuable for further optimising >> the page migration path on arm64 (and can be extended to other >> architectures that currently require I/D cache maintenance for >> executable pages). > > Are there any numbers to show the optimization impact? This series transfers execution cost linearly with nr_pages from migration path to subsequent exec access path for normal, THP and HugeTLB pages. The experiment is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for HugeTLB and THP migration enablement on arm64 platform. A. [Normal Pages] nr_pages migration1 migration2 execfault1 execfault2 1000 7.000000 3.000000 24.000000 31.000000 5000 38.000000 18.000000 127.000000 153.000000 10000 80.000000 40.000000 289.000000 343.000000 15000 120.000000 60.000000 435.000000 514.000000 19900 159.000000 79.000000 576.000000 681.000000 B. [THP Pages] nr_pages migration1 migration2 execfault1 execfault2 10 22.000000 3.000000 131.000000 146.000000 30 72.000000 15.000000 443.000000 503.000000 50 121.000000 24.000000 739.000000 837.000000 100 242.000000 49.000000 1485.000000 1673.000000 199 473.000000 98.000000 2685.000000 3327.000000 C. [HugeTLB Pages] nr_pages migration1 migration2 execfault1 execfault2 10 97.000000 79.000000 125.000000 144.000000 30 292.000000 235.000000 408.000000 463.000000 50 487.000000 392.000000 674.000000 777.000000 100 995.000000 802.000000 1480.000000 1671.000000 130 1300.000000 1048.000000 1925.000000 2172.000000 NOTE: migration1: Execution time (ms) for migrating nr_pages without patches migration2: Execution time (ms) for migrating nr_pages with patches execfault1: Execution time (ms) for executing nr_pages without patches execfault2: Execution time (ms) for executing nr_pages with patches
On Thu 14-02-19 11:34:09, Anshuman Khandual wrote: > > > On 02/13/2019 09:08 PM, Michal Hocko wrote: > > On Wed 13-02-19 11:21:36, Catalin Marinas wrote: > >> On Wed, Feb 13, 2019 at 01:36:27PM +0530, Anshuman Khandual wrote: > >>> Setting an exec permission on a page normally triggers I-cache invalidation > >>> which might be expensive. I-cache invalidation is not mandatory on a given > >>> page if there is no immediate exec access on it. Non-fault modification of > >>> user page table from generic memory paths like migration can be improved if > >>> setting of the exec permission on the page can be deferred till actual use. > >>> There was a performance report [1] which highlighted the problem. > >> [...] > >>> [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html > >> > >> FTR, this performance regression has been addressed by commit > >> 132fdc379eb1 ("arm64: Do not issue IPIs for user executable ptes"). That > >> said, I still think this patch series is valuable for further optimising > >> the page migration path on arm64 (and can be extended to other > >> architectures that currently require I/D cache maintenance for > >> executable pages). > > > > Are there any numbers to show the optimization impact? > > This series transfers execution cost linearly with nr_pages from migration path > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for > HugeTLB and THP migration enablement on arm64 platform. Please make sure that these numbers are in the changelog. I am also missing an explanation why this is an overal win. Why should we pay on the later access rather than the migration which is arguably a slower path. What is the usecase that benefits from the cost shift?
On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote: > On Thu 14-02-19 11:34:09, Anshuman Khandual wrote: > > On 02/13/2019 09:08 PM, Michal Hocko wrote: > > > Are there any numbers to show the optimization impact? > > > > This series transfers execution cost linearly with nr_pages from migration path > > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment > > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for > > HugeTLB and THP migration enablement on arm64 platform. > > Please make sure that these numbers are in the changelog. I am also > missing an explanation why this is an overal win. Why should we pay > on the later access rather than the migration which is arguably a slower > path. What is the usecase that benefits from the cost shift? Originally the investigation started because of a regression we had sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed separately, so the original value of this patchset has been diminished. Trying to frame the problem, let's analyse the overall cost of migration + execute. Removing other invariants like cost of the initial mapping of the pages or the mapping of new pages after migration, we have: M - number of mapped executable pages just before migration N - number of previously mapped pages that will be executed after migration (N <= M) D - cost of migrating page data I - cost of I-cache maintenance for a page F - cost of an instruction fault (handle_mm_fault() + set_pte_at() without the actual I-cache maintenance) Tc - total migration cost current kernel (including executing) Tp - total migration cost patched kernel (including executing) Tc = M * (D + I) Tp = M * D + N * (F + I) To be useful, we want this patchset to lead to: Tp < Tc Simplifying: M * D + N * (F + I) < M * (D + I) ... F < I * (M - N) / N So the question is, in a *real-world* scenario, what proportion of the mapped executable pages would still be executed from after migration. I'd leave this as a task for Anshuman to investigate and come up with some numbers (and it's fine if it's just in the noise, we won't need this patchset). Also note that there are ARM CPU implementations that don't need I-cache maintenance (the I side can snoop the D side), so for those this patchset introducing an additional cost. But we can make the decision in the arch code via pte_mklazyexec(). We implemented something similar in arm64 KVM (d0e22b4ac3ba "KVM: arm/arm64: Limit icache invalidation to prefetch aborts") but the use-case was different: previously KVM considered all pages executable though the vast majority were only data pages in guests.
On Thu 14-02-19 10:19:37, Catalin Marinas wrote: > On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote: > > On Thu 14-02-19 11:34:09, Anshuman Khandual wrote: > > > On 02/13/2019 09:08 PM, Michal Hocko wrote: > > > > Are there any numbers to show the optimization impact? > > > > > > This series transfers execution cost linearly with nr_pages from migration path > > > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment > > > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for > > > HugeTLB and THP migration enablement on arm64 platform. > > > > Please make sure that these numbers are in the changelog. I am also > > missing an explanation why this is an overal win. Why should we pay > > on the later access rather than the migration which is arguably a slower > > path. What is the usecase that benefits from the cost shift? > > Originally the investigation started because of a regression we had > sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed > separately, so the original value of this patchset has been diminished. > > Trying to frame the problem, let's analyse the overall cost of migration > + execute. Removing other invariants like cost of the initial mapping of > the pages or the mapping of new pages after migration, we have: > > M - number of mapped executable pages just before migration > N - number of previously mapped pages that will be executed after > migration (N <= M) > D - cost of migrating page data > I - cost of I-cache maintenance for a page > F - cost of an instruction fault (handle_mm_fault() + set_pte_at() > without the actual I-cache maintenance) > > Tc - total migration cost current kernel (including executing) > Tp - total migration cost patched kernel (including executing) > > Tc = M * (D + I) > Tp = M * D + N * (F + I) > > To be useful, we want this patchset to lead to: > > Tp < Tc > > Simplifying: > > M * D + N * (F + I) < M * (D + I) > ... > F < I * (M - N) / N > > So the question is, in a *real-world* scenario, what proportion of the > mapped executable pages would still be executed from after migration. > I'd leave this as a task for Anshuman to investigate and come up with > some numbers (and it's fine if it's just in the noise, we won't need > this patchset). Yeah, betting on accessing only a smaller subset of the migrated memory is something I figured out. But I am really missing a usecase or a larger set of them to actually benefit from it. We have different triggers for a migration. E.g. numa balancing. I would expect that migrated pages are likely to be accessed after migration because the primary reason to migrate them is that they are accessed from a remote node. Then we a compaction which is a completely different story. It is hard to assume any further access for migrated pages here. Then we have an explicit move_pages syscall and I would expect this to be somewhere in the middle. One would expect that the caller knows why the memory is migrated and it will be used but again, we cannot really assume anything. This would suggest that this depends on the migration reason quite a lot. So I would really like to see a more comprehensive analysis of different workloads to see whether this is really worth it. Thanks!
On 2/13/19 10:04 PM, Anshuman Khandual wrote: >> Are there any numbers to show the optimization impact? > This series transfers execution cost linearly with nr_pages from migration path > to subsequent exec access path for normal, THP and HugeTLB pages. The experiment > is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for > HugeTLB and THP migration enablement on arm64 platform. > > A. [Normal Pages] > > nr_pages migration1 migration2 execfault1 execfault2 > > 1000 7.000000 3.000000 24.000000 31.000000 > 5000 38.000000 18.000000 127.000000 153.000000 > 10000 80.000000 40.000000 289.000000 343.000000 > 15000 120.000000 60.000000 435.000000 514.000000 > 19900 159.000000 79.000000 576.000000 681.000000 Do these numbers comprehend the increased fault costs or just the decreased migration costs?
On 2/13/19 8:12 PM, Anshuman Khandual wrote: > On 02/13/2019 09:14 PM, Dave Hansen wrote: >> On 2/13/19 12:06 AM, Anshuman Khandual wrote: >>> Setting an exec permission on a page normally triggers I-cache invalidation >>> which might be expensive. I-cache invalidation is not mandatory on a given >>> page if there is no immediate exec access on it. Non-fault modification of >>> user page table from generic memory paths like migration can be improved if >>> setting of the exec permission on the page can be deferred till actual use. >>> There was a performance report [1] which highlighted the problem. >> >> How does this happen? If the page was not executed, then it'll >> (presumably) be non-present which won't require icache invalidation. >> So, this would only be for pages that have been executed (and won't >> again before the next migration), *or* for pages that were mapped >> executable but never executed. > I-cache invalidation happens while migrating a 'mapped and executable' page > irrespective whether that page was really executed for being mapped there > in the first place. Ahh, got it. I also assume that the Accessed bit on these platforms is also managed similar to how we do it on x86 such that it can't be used to drive invalidation decisions? >> Any idea which one it is? > > I am not sure about this particular reported case. But was able to reproduce > the problem through a test case where a buffer was mapped with R|W|X, get it > faulted/mapped through write, migrate and then execute from it. Could you make sure, please? Write and Execute at the same time are generally a "bad idea". Given the hardware, I'm not surprised that this problem pops up, but it would be great to find out if this is a real application, or a "doctor it hurts when I do this." >> If it's pages that got mapped in but were never executed, how did that >> happen? Was it fault-around? If so, maybe it would just be simpler to >> not do fault-around for executable pages on these platforms. > Page can get mapped through a different access (write) without being executed. > Even if it got mapped through execution and subsequent invalidation, the > invalidation does not have to be repeated again after migration without first > getting an exec access subsequently. This series just tries to hold off the > invalidation after migration till subsequent exec access. This set generally seems to be assuming an environment with "lots of migration, and not much execution". That seems like a kinda odd situation to me.
a On 02/14/2019 05:58 PM, Michal Hocko wrote: > On Thu 14-02-19 10:19:37, Catalin Marinas wrote: >> On Thu, Feb 14, 2019 at 09:38:44AM +0100, Michal Hocko wrote: >>> On Thu 14-02-19 11:34:09, Anshuman Khandual wrote: >>>> On 02/13/2019 09:08 PM, Michal Hocko wrote: >>>>> Are there any numbers to show the optimization impact? >>>> >>>> This series transfers execution cost linearly with nr_pages from migration path >>>> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment >>>> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for >>>> HugeTLB and THP migration enablement on arm64 platform. >>> >>> Please make sure that these numbers are in the changelog. I am also >>> missing an explanation why this is an overal win. Why should we pay >>> on the later access rather than the migration which is arguably a slower >>> path. What is the usecase that benefits from the cost shift? >> >> Originally the investigation started because of a regression we had >> sending IPIs on each set_pte_at(PROT_EXEC). This has been fixed >> separately, so the original value of this patchset has been diminished. >> >> Trying to frame the problem, let's analyse the overall cost of migration >> + execute. Removing other invariants like cost of the initial mapping of >> the pages or the mapping of new pages after migration, we have: >> >> M - number of mapped executable pages just before migration >> N - number of previously mapped pages that will be executed after >> migration (N <= M) >> D - cost of migrating page data >> I - cost of I-cache maintenance for a page >> F - cost of an instruction fault (handle_mm_fault() + set_pte_at() >> without the actual I-cache maintenance) >> >> Tc - total migration cost current kernel (including executing) >> Tp - total migration cost patched kernel (including executing) >> >> Tc = M * (D + I) >> Tp = M * D + N * (F + I) >> >> To be useful, we want this patchset to lead to: >> >> Tp < Tc >> >> Simplifying: >> >> M * D + N * (F + I) < M * (D + I) >> ... >> F < I * (M - N) / N >> >> So the question is, in a *real-world* scenario, what proportion of the >> mapped executable pages would still be executed from after migration. >> I'd leave this as a task for Anshuman to investigate and come up with >> some numbers (and it's fine if it's just in the noise, we won't need >> this patchset). > > Yeah, betting on accessing only a smaller subset of the migrated memory > is something I figured out. But I am really missing a usecase or a > larger set of them to actually benefit from it. We have different > triggers for a migration. E.g. numa balancing. I would expect that > migrated pages are likely to be accessed after migration because > the primary reason to migrate them is that they are accessed from a > remote node. Then we a compaction which is a completely different story. That access might not have been an exec fault it could have been bunch of write faults which triggered NUMA migration. So NUMA triggered migration does not necessarily mean continuing exec faults before and after migration. Compaction might move around mapped pages with exec permission which might not have any recent history of exec accesses before compaction or might not even see any future exec access as well. > It is hard to assume any further access for migrated pages here. Then we > have an explicit move_pages syscall and I would expect this to be > somewhere in the middle. One would expect that the caller knows why the > memory is migrated and it will be used but again, we cannot really > assume anything. What if the caller knows that it wont be used ever again or in near future and hence trying to migrate to a different node which has less expensive and slower memory. Kernel should not assume either way on it but can decide to be conservative in spending time in preparing for future exec faults. But being conservative during migration risks additional exec faults which would have been avoided if exec permission should have stayed on followed by an I-cache invalidation. Deferral of the I-cache invalidation requires removing the exec permission completely (unless there is some magic which I am not aware about) i.e unmapping page for exec permission and risking an exec fault next time around. This problem gets particularly amplified for mixed permission (WRITE | EXEC) user space mappings where things like NUMA migration, compaction etc probably gets triggered by write faults and additional exec permission there never really gets used. > > This would suggest that this depends on the migration reason quite a > lot. So I would really like to see a more comprehensive analysis of > different workloads to see whether this is really worth it. Sure. Could you please give some more details on how to go about this and what specifically you are looking for ? User initiated migration through systems calls seems bit tricky as an application can be written primarily to benefit from this series. If real world applications can help give some better insights then which ones I wonder. Or do we need to understand more about compaction and NUMA triggered migration which are kernel driven. Statistics from compaction/NUMA migration can reveal what ratio of the exec enabled mapping gets exec faulted again later on after kernel driven migrations (compaction/NUMA) which are more or less random without depending too much on application behavior. - Anshuman
On Fri 15-02-19 14:15:58, Anshuman Khandual wrote: > On 02/14/2019 05:58 PM, Michal Hocko wrote: > > It is hard to assume any further access for migrated pages here. Then we > > have an explicit move_pages syscall and I would expect this to be > > somewhere in the middle. One would expect that the caller knows why the > > memory is migrated and it will be used but again, we cannot really > > assume anything. > > What if the caller knows that it wont be used ever again or in near future > and hence trying to migrate to a different node which has less expensive and > slower memory. Kernel should not assume either way on it but can decide to > be conservative in spending time in preparing for future exec faults. > > But being conservative during migration risks additional exec faults which > would have been avoided if exec permission should have stayed on followed > by an I-cache invalidation. Deferral of the I-cache invalidation requires > removing the exec permission completely (unless there is some magic which > I am not aware about) i.e unmapping page for exec permission and risking > an exec fault next time around. > > This problem gets particularly amplified for mixed permission (WRITE | EXEC) > user space mappings where things like NUMA migration, compaction etc probably > gets triggered by write faults and additional exec permission there never > really gets used. Please quantify that and provide us with some _data_ > > This would suggest that this depends on the migration reason quite a > > lot. So I would really like to see a more comprehensive analysis of > > different workloads to see whether this is really worth it. > > Sure. Could you please give some more details on how to go about this and > what specifically you are looking for ? You are proposing an optimization without actually providing any justification. The overhead is not removed it is just shifted from one path to another. So you should have some pretty convincing arguments to back that shift as a general win. You can go an test on wider range of workloads and isolate the worst/best case behavior. I fully realize that this is tedious. Another option would be to define conditions when the optimization is going to be a huge win and have some convincing arguments that many/most workloads are falling into that category while pathological ones are not suffering much. This is no different from any other optimizations/heuristics we have. Btw. have you considered to have this optimization conditional based on the migration reason or vma flags?
On 02/15/2019 02:57 PM, Michal Hocko wrote: > On Fri 15-02-19 14:15:58, Anshuman Khandual wrote: >> On 02/14/2019 05:58 PM, Michal Hocko wrote: >>> It is hard to assume any further access for migrated pages here. Then we >>> have an explicit move_pages syscall and I would expect this to be >>> somewhere in the middle. One would expect that the caller knows why the >>> memory is migrated and it will be used but again, we cannot really >>> assume anything. >> >> What if the caller knows that it wont be used ever again or in near future >> and hence trying to migrate to a different node which has less expensive and >> slower memory. Kernel should not assume either way on it but can decide to >> be conservative in spending time in preparing for future exec faults. >> >> But being conservative during migration risks additional exec faults which >> would have been avoided if exec permission should have stayed on followed >> by an I-cache invalidation. Deferral of the I-cache invalidation requires >> removing the exec permission completely (unless there is some magic which >> I am not aware about) i.e unmapping page for exec permission and risking >> an exec fault next time around. >> >> This problem gets particularly amplified for mixed permission (WRITE | EXEC) >> user space mappings where things like NUMA migration, compaction etc probably >> gets triggered by write faults and additional exec permission there never >> really gets used. > > Please quantify that and provide us with some _data_> >>> This would suggest that this depends on the migration reason quite a >>> lot. So I would really like to see a more comprehensive analysis of >>> different workloads to see whether this is really worth it. >> >> Sure. Could you please give some more details on how to go about this and >> what specifically you are looking for ? > > You are proposing an optimization without actually providing any > justification. The overhead is not removed it is just shifted from one > path to another. So you should have some pretty convincing arguments > to back that shift as a general win. You can go an test on wider range > of workloads and isolate the worst/best case behavior. I fully realize > that this is tedious. Another option would be to define conditions when > the optimization is going to be a huge win and have some convincing Yeah conditional approach might narrow down the field and provide better probability for a general win. The system call (move_pages/mbind) based migrations from the user space are better placed for an win because the user might just want to put those pages aside for rare exec accesses in the future and the worst case cost for those deferral is not too high as well. A hint regarding probable rare exec access in the future for the kernel would have been better but I am afraid it would then require a new user interface. But I think lazy exec decision can be taken right away for MR_SYSCALL triggered migrations for VMAs with mixed permission ([VM_READ]|VM_WRITE|VM_EXEC) knowing the fact that in worst case the cost is just getting migrated. MR_NUMA_MISPLACED triggered migrations requires explicit tracking of fault type (exec/write/[read]) per VMA along with it's applicable permission to determine if exec permission deferral would be helpful or not. These stats can also be used for all other kernel or user initiated migrations like MR_COMPACTION, MR_MEMORY_FAILURE, MR_MEMORY_HOTPLUG and MR_CONTIG_RANGE. Would it be worth adding explicit fault type tracking per VMA ? Can it be used for some other purpose as well. > arguments that many/most workloads are falling into that category while > pathological ones are not suffering much. > > This is no different from any other optimizations/heuristics we have. Sure. Will think about this further. > > Btw. have you considered to have this optimization conditional based on > the migration reason or vma flags? Started considering it after our discussions here. It makes sense to look into the migration reason and the VMA flags right away but as I mentioned earlier VMA fault type stats can really help on this as well.
On 02/14/2019 09:08 PM, Dave Hansen wrote: > On 2/13/19 10:04 PM, Anshuman Khandual wrote: >>> Are there any numbers to show the optimization impact? >> This series transfers execution cost linearly with nr_pages from migration path >> to subsequent exec access path for normal, THP and HugeTLB pages. The experiment >> is on mainline kernel (1f947a7a011fcceb14cb912f548) along with some patches for >> HugeTLB and THP migration enablement on arm64 platform. >> >> A. [Normal Pages] >> >> nr_pages migration1 migration2 execfault1 execfault2 >> >> 1000 7.000000 3.000000 24.000000 31.000000 >> 5000 38.000000 18.000000 127.000000 153.000000 >> 10000 80.000000 40.000000 289.000000 343.000000 >> 15000 120.000000 60.000000 435.000000 514.000000 >> 19900 159.000000 79.000000 576.000000 681.000000 > > Do these numbers comprehend the increased fault costs or just the > decreased migration costs? Both. It transfers cost from migration path to exec fault path.
On 02/14/2019 10:25 PM, Dave Hansen wrote: > On 2/13/19 8:12 PM, Anshuman Khandual wrote: >> On 02/13/2019 09:14 PM, Dave Hansen wrote: >>> On 2/13/19 12:06 AM, Anshuman Khandual wrote: >>>> Setting an exec permission on a page normally triggers I-cache invalidation >>>> which might be expensive. I-cache invalidation is not mandatory on a given >>>> page if there is no immediate exec access on it. Non-fault modification of >>>> user page table from generic memory paths like migration can be improved if >>>> setting of the exec permission on the page can be deferred till actual use. >>>> There was a performance report [1] which highlighted the problem. >>> >>> How does this happen? If the page was not executed, then it'll >>> (presumably) be non-present which won't require icache invalidation. >>> So, this would only be for pages that have been executed (and won't >>> again before the next migration), *or* for pages that were mapped >>> executable but never executed. >> I-cache invalidation happens while migrating a 'mapped and executable' page >> irrespective whether that page was really executed for being mapped there >> in the first place. > > Ahh, got it. I also assume that the Accessed bit on these platforms is > also managed similar to how we do it on x86 such that it can't be used > to drive invalidation decisions? Drive I-cache invalidation ? Could you please elaborate on this. Is not that the access bit mechanism is to identify dirty pages after write faults when it is SW updated or write accesses when HW updated. In SW updated method, given PTE goes through pte_young() during page fault. Then how to differentiate exec fault/access from an write fault/access and decide to invalidate the I-cache. Just being curious. > >>> Any idea which one it is? >> >> I am not sure about this particular reported case. But was able to reproduce >> the problem through a test case where a buffer was mapped with R|W|X, get it >> faulted/mapped through write, migrate and then execute from it. > > Could you make sure, please? The test in the report [1] does not create any explicit PROT_EXEC maps and just attempts to migrate all pages of the process (which has 10 child processes) including the exec pages. So the only exec mappings would be the primary text segment and the mapped shared glibc segment. Looks like the shared libraries have some mapped pages. $cat /proc/[PID]/numa_maps | grep libc ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4 ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4 ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4 Will keep looking into this. > > Write and Execute at the same time are generally a "bad idea". Given But wont this be the case for all run-time generate code which gets written to a buffer before being executed from there. > the hardware, I'm not surprised that this problem pops up, but it would > be great to find out if this is a real application, or a "doctor it > hurts when I do this." Is not that a problem though :) > >>> If it's pages that got mapped in but were never executed, how did that >>> happen? Was it fault-around? If so, maybe it would just be simpler to >>> not do fault-around for executable pages on these platforms. >> Page can get mapped through a different access (write) without being executed. >> Even if it got mapped through execution and subsequent invalidation, the >> invalidation does not have to be repeated again after migration without first >> getting an exec access subsequently. This series just tries to hold off the >> invalidation after migration till subsequent exec access. > > This set generally seems to be assuming an environment with "lots of > migration, and not much execution". That seems like a kinda odd > situation to me. Irrespective of the reported problem which is user driven, there are many kernel triggered migrations which can accumulate I-cache invalidation cost over time on a memory heavy system with high number of exec enabled user pages. Will that be such a rare situation ! [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html
On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote: > On 02/14/2019 10:25 PM, Dave Hansen wrote: > > On 2/13/19 8:12 PM, Anshuman Khandual wrote: > >> On 02/13/2019 09:14 PM, Dave Hansen wrote: > >>> On 2/13/19 12:06 AM, Anshuman Khandual wrote: > >>>> Setting an exec permission on a page normally triggers I-cache invalidation > >>>> which might be expensive. I-cache invalidation is not mandatory on a given > >>>> page if there is no immediate exec access on it. Non-fault modification of > >>>> user page table from generic memory paths like migration can be improved if > >>>> setting of the exec permission on the page can be deferred till actual use. > >>>> There was a performance report [1] which highlighted the problem. > >>> > >>> How does this happen? If the page was not executed, then it'll > >>> (presumably) be non-present which won't require icache invalidation. > >>> So, this would only be for pages that have been executed (and won't > >>> again before the next migration), *or* for pages that were mapped > >>> executable but never executed. > >> I-cache invalidation happens while migrating a 'mapped and executable' page > >> irrespective whether that page was really executed for being mapped there > >> in the first place. > > > > Ahh, got it. I also assume that the Accessed bit on these platforms is > > also managed similar to how we do it on x86 such that it can't be used > > to drive invalidation decisions? > > Drive I-cache invalidation ? Could you please elaborate on this. Is not that > the access bit mechanism is to identify dirty pages after write faults when > it is SW updated or write accesses when HW updated. In SW updated method, given > PTE goes through pte_young() during page fault. Then how to differentiate exec > fault/access from an write fault/access and decide to invalidate the I-cache. > Just being curious. The access flag is used to identify young/old pages only (the dirty bit is used to track writes to a page). Depending on the Arm implementation, the access bit/flag could be managed by hardware transparently, so no fault taken to the kernel on accessing through an 'old' pte.
On 02/18/2019 02:34 PM, Catalin Marinas wrote: > On Mon, Feb 18, 2019 at 02:01:55PM +0530, Anshuman Khandual wrote: >> On 02/14/2019 10:25 PM, Dave Hansen wrote: >>> On 2/13/19 8:12 PM, Anshuman Khandual wrote: >>>> On 02/13/2019 09:14 PM, Dave Hansen wrote: >>>>> On 2/13/19 12:06 AM, Anshuman Khandual wrote: >>>>>> Setting an exec permission on a page normally triggers I-cache invalidation >>>>>> which might be expensive. I-cache invalidation is not mandatory on a given >>>>>> page if there is no immediate exec access on it. Non-fault modification of >>>>>> user page table from generic memory paths like migration can be improved if >>>>>> setting of the exec permission on the page can be deferred till actual use. >>>>>> There was a performance report [1] which highlighted the problem. >>>>> >>>>> How does this happen? If the page was not executed, then it'll >>>>> (presumably) be non-present which won't require icache invalidation. >>>>> So, this would only be for pages that have been executed (and won't >>>>> again before the next migration), *or* for pages that were mapped >>>>> executable but never executed. >>>> I-cache invalidation happens while migrating a 'mapped and executable' page >>>> irrespective whether that page was really executed for being mapped there >>>> in the first place. >>> >>> Ahh, got it. I also assume that the Accessed bit on these platforms is >>> also managed similar to how we do it on x86 such that it can't be used >>> to drive invalidation decisions? >> >> Drive I-cache invalidation ? Could you please elaborate on this. Is not that >> the access bit mechanism is to identify dirty pages after write faults when >> it is SW updated or write accesses when HW updated. In SW updated method, given >> PTE goes through pte_young() during page fault. Then how to differentiate exec >> fault/access from an write fault/access and decide to invalidate the I-cache. >> Just being curious. > > The access flag is used to identify young/old pages only (the dirty bit > is used to track writes to a page). Depending on the Arm implementation, > the access bit/flag could be managed by hardware transparently, so no > fault taken to the kernel on accessing through an 'old' pte. Then there is no way to identify an exec fault with either of the facilities of access/reference bit or dirty bit whether managed by SW or HW. Still wondering about previous comment where Dave mentioned how it can be used for I-cache invalidation.
On 2/18/19 12:31 AM, Anshuman Khandual wrote: >> Ahh, got it. I also assume that the Accessed bit on these platforms is >> also managed similar to how we do it on x86 such that it can't be used >> to drive invalidation decisions? > > Drive I-cache invalidation ? Could you please elaborate on this. Is not that > the access bit mechanism is to identify dirty pages after write faults when > it is SW updated or write accesses when HW updated. In SW updated method, given > PTE goes through pte_young() during page fault. Then how to differentiate exec > fault/access from an write fault/access and decide to invalidate the I-cache. > Just being curious. Let's say this was on x86 where the Accessed bit is set by the hardware on any access. Let's also say that Linux invalidated the TLB any time that bit was cleared in software (it doesn't, but let's pretend it did). In that case, if we needed to do icache invalidation, we could optimize it by only invalidating the icache when we see the Accessed bit set. That's because any execution would first set the Accessed bit before the icache was populated. So, my question >>>> Any idea which one it is? >>> >>> I am not sure about this particular reported case. But was able to reproduce >>> the problem through a test case where a buffer was mapped with R|W|X, get it >>> faulted/mapped through write, migrate and then execute from it. >> >> Could you make sure, please? > > The test in the report [1] does not create any explicit PROT_EXEC maps and just > attempts to migrate all pages of the process (which has 10 child processes) > including the exec pages. So the only exec mappings would be the primary text > segment and the mapped shared glibc segment. Looks like the shared libraries > have some mapped pages. Yeah, but the executable ones are also read-only in your example: > $cat /proc/[PID]/numa_maps | grep libc > > ffffaa4c9000 default file=/lib/aarch64-linux-gnu/libc-2.28.so mapped=150 mapmax=57 N0=150 kernelpagesize_kB=4 ^ These are all page-cache, executable and read-only. > ffffaa621000 default file=/lib/aarch64-linux-gnu/libc-2.28.so > ffffaa630000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=4 dirty=4 mapmax=11 N0=4 kernelpagesize_kB=4 > ffffaa634000 default file=/lib/aarch64-linux-gnu/libc-2.28.so anon=2 dirty=2 mapmax=11 N0=2 kernelpagesize_kB=4 This last one is the only read-write one and it's not executable. >> Write and Execute at the same time are generally a "bad idea". Given > > But wont this be the case for all run-time generate code which gets written to a > buffer before being executed from there. No. They usually are r=1,w=1,x=0, then transition to r=1,w=0,x=1. It's never simultaneously executable and writable. >> the hardware, I'm not surprised that this problem pops up, but it would >> be great to find out if this is a real application, or a "doctor it >> hurts when I do this." > > Is not that a problem though :) The point is that it's not a real-world problem. You can certainly expose this, but do *real* apps do this rather than something entirely synthetic? >> This set generally seems to be assuming an environment with "lots of >> migration, and not much execution". That seems like a kinda odd >> situation to me. > > Irrespective of the reported problem which is user driven, there are many kernel > triggered migrations which can accumulate I-cache invalidation cost over time on > a memory heavy system with high number of exec enabled user pages. Will that be > such a rare situation ! > > [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2018-December/620357.html I translate "trivial C application" to "highly synthetic microbenchmark". I suspect what's happening here is that somebody wrote a micro that worked well on x86, although it was being rather stupid. Somebody got an arm system, and voila: it's slower. Someone says "Oh, this arm system is slower than x86!" Again, the big questions you have real-world applications with writable, executable pages? The kernel essentially has *zero* of these because they're such a massive security risk. Adding this feature will encourage folks to replicate this massive security risk in userspace. Seems like a bad idea.