Message ID | 534EE6F2.3050005@linaro.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
From: David Long <dave.long@linaro.org> Date: Wed, 16 Apr 2014 16:24:18 -0400 > On 04/16/14 15:37, David Miller wrote: >> From: Oleg Nesterov <oleg@redhat.com> >> Date: Wed, 16 Apr 2014 21:18:25 +0200 >> >>> The last question... area->page = alloc_page(GFP_HIGHUSER), and I am >>> not sure that arch/arm/mm/highmem.c:kmap_atomic() can't break the >>> aliasing, __fix_to_virt() in this case will use the same (per-cpu) idx. >>> >>> Looks like, __kunmap_atomic()->__cpuc_flush_dcache_area() should take >>> care, but could you please ack/nack my understanding? >> >> Good point, it might therefore make sense to use a low-mem page. >> > > The following test code seems to have the same problems with stale user > icache. It works if I put the dcache flush back in. Am I missing > something? Weird, if we store to the kernel side it should be just a matter of clearing the I-cache out. There should be no D-cache aliasing whatsoever. Maybe you could print out area->vaddr and page_to_virt(area->page) so we can see if area->vaddr is choosen correctly? Although I notice that flush_cache_user_range() on ARM flushes both D and I caches. And I think that's what userspace ends up triggering when it uses the system call that exists to support self-modifying and JIT code generators. An ARM expert will have to chime in...
On 16 April 2014 14:21, David Miller <davem@davemloft.net> wrote: > From: David Long <dave.long@linaro.org> > Date: Wed, 16 Apr 2014 16:24:18 -0400 > >> On 04/16/14 15:37, David Miller wrote: >>> From: Oleg Nesterov <oleg@redhat.com> >>> Date: Wed, 16 Apr 2014 21:18:25 +0200 >>> >>>> The last question... area->page = alloc_page(GFP_HIGHUSER), and I am >>>> not sure that arch/arm/mm/highmem.c:kmap_atomic() can't break the >>>> aliasing, __fix_to_virt() in this case will use the same (per-cpu) idx. >>>> >>>> Looks like, __kunmap_atomic()->__cpuc_flush_dcache_area() should take >>>> care, but could you please ack/nack my understanding? >>> >>> Good point, it might therefore make sense to use a low-mem page. >>> >> >> The following test code seems to have the same problems with stale user >> icache. It works if I put the dcache flush back in. Am I missing >> something? > > Weird, if we store to the kernel side it should be just a matter of > clearing the I-cache out. There should be no D-cache aliasing > whatsoever. I don't think there is dcache aliasing, but dcache should be flushed anyway otherwise icache will not see updated values even it is invalidated. I.e dcache should be flushed up to dcache/icache unification layer. I.e icache does not read anything from dcache directly. It reads from L2. As it was noted below flush_cache_user_range will take care of both, but just __flush_icache_all is not enough. I will try to add non-aliased mapping logic from Dave's patch on top of previously posted patch (which goes under "write directly"), has default/arm split, etc ... Thanks, Victor > Maybe you could print out area->vaddr and > page_to_virt(area->page) so we can see if area->vaddr is choosen > correctly? > > Although I notice that flush_cache_user_range() on ARM flushes both D > and I caches. And I think that's what userspace ends up triggering > when it uses the system call that exists to support self-modifying and > JIT code generators. > > An ARM expert will have to chime in...
On Wed, Apr 16, 2014 at 05:21:53PM -0400, David Miller wrote: > Weird, if we store to the kernel side it should be just a matter of > clearing the I-cache out. There should be no D-cache aliasing > whatsoever. Maybe you could print out area->vaddr and > page_to_virt(area->page) so we can see if area->vaddr is choosen > correctly? > > Although I notice that flush_cache_user_range() on ARM flushes both D > and I caches. And I think that's what userspace ends up triggering > when it uses the system call that exists to support self-modifying and > JIT code generators. > > An ARM expert will have to chime in... So, David's patch is against the existing kernel (I checked the blob ID in the patch.) It looks like David just replaced flush_dcache_page() with flush_icache_all() as a hack. So my question is... between flush_dcache_page() and flush_icache_all(), what was the intermediary (if any) that was attempted? Now, I'm going to fill in some details for DaveM. The type of the I/D L1 caches found on any particular architecture version of CPU can be: Arch D-cache I-cache ARMv7 VIPT nonaliasing VIVT ASID tagged PIPT ------------------------------------------------- ARMv6 VIPT nonalising VIPT nonaliasing VIPT aliasing VIPT aliasing ------------------------------------------------- ARMv5 VIVT VIVT &older (For ARMv6, each can be either/or, though practically, there's no point to I-aliasing and D-nonaliasing since it implies the I-cache is bigger than the D-cache.) Our I-caches don't snoop/see the D-cache at all - so writes need to be pushed out to what we call the "point of unification" where the I and D streams meet. For anything we care about, that's normally the L2 cache - L1 cache is harvard, L2 cache is unified. Hence, we don't care which D-alias (if any) the data is written, so long as it's pushed out of the L1 data cache so that it's visible to the L1 instruction cache. If we're writing via a different mapping to that which is being executed, I think the safest thing to do is to flush it out of the L1 D-cache at the address it was written, and then flush any stale line from the L1 I-cache using the user address. This is quite a unique requirement, and we don't have anything which covers it. The closest you could get is to that using existing calls is: 1. write the new instruction 2. flush_dcache_page() 3. flush_cache_user_range() using the user address and I think that should be safe on all the above cache types.
On 04/16/14 18:25, Russell King - ARM Linux wrote: > On Wed, Apr 16, 2014 at 05:21:53PM -0400, David Miller wrote: >> Weird, if we store to the kernel side it should be just a matter of >> clearing the I-cache out. There should be no D-cache aliasing >> whatsoever. Maybe you could print out area->vaddr and >> page_to_virt(area->page) so we can see if area->vaddr is choosen >> correctly? >> >> Although I notice that flush_cache_user_range() on ARM flushes both D >> and I caches. And I think that's what userspace ends up triggering >> when it uses the system call that exists to support self-modifying and >> JIT code generators. >> >> An ARM expert will have to chime in... > > So, David's patch is against the existing kernel (I checked the blob ID > in the patch.) Sorry, that patch was against my uprobes-v7 branch which means it was v3.14-rc5 plus the uprobes work you pulled from me for v3.15. Thanks for reminding me it's time to update my repo. > It looks like David just replaced flush_dcache_page() with > flush_icache_all() as a hack. So my question is... between > flush_dcache_page() and flush_icache_all(), what was the intermediary > (if any) that was attempted? The other combinations I tried were: 1) existing dcache flush followed by flush_icache_all, which works; 2) existing dcache flush followed by: flush_icache_range(xol_vaddr, sizeof uprobe->arch.ixol); ...which also worked (xol_vaddr is the beginning of the two instruction out-of-line sequence, and the sizeof works out to be 8). I didn't bother trying flush_icache_user_range() because that is #define'd to be just flush_dcache_page() on ARM, which I don't understand at all. > > Now, I'm going to fill in some details for DaveM. The type of the I/D > L1 caches found on any particular architecture version of CPU can be: > > Arch D-cache I-cache > ARMv7 VIPT nonaliasing VIVT ASID tagged > PIPT > ------------------------------------------------- > ARMv6 VIPT nonalising VIPT nonaliasing > VIPT aliasing VIPT aliasing > ------------------------------------------------- > ARMv5 VIVT VIVT > &older > > (For ARMv6, each can be either/or, though practically, there's no point > to I-aliasing and D-nonaliasing since it implies the I-cache is bigger > than the D-cache.) > > Our I-caches don't snoop/see the D-cache at all - so writes need to be > pushed out to what we call the "point of unification" where the I and D > streams meet. For anything we care about, that's normally the L2 cache - > L1 cache is harvard, L2 cache is unified. > > Hence, we don't care which D-alias (if any) the data is written, so long > as it's pushed out of the L1 data cache so that it's visible to the L1 > instruction cache. > > If we're writing via a different mapping to that which is being executed, > I think the safest thing to do is to flush it out of the L1 D-cache at > the address it was written, and then flush any stale line from the L1 > I-cache using the user address. This is quite a unique requirement, and > we don't have anything which covers it. The closest you could get is > to that using existing calls is: > > 1. write the new instruction > 2. flush_dcache_page() > 3. flush_cache_user_range() using the user address > > and I think that should be safe on all the above cache types. > OK, still needing the dcache flush makes sense to me. I thought I was reading from (the other) David that it shouldn't be necessary, but I could not understand why. -dl
On 04/16/14 18:25, Russell King - ARM Linux wrote: > Our I-caches don't snoop/see the D-cache at all - so writes need to be > pushed out to what we call the "point of unification" where the I and D > streams meet. For anything we care about, that's normally the L2 cache - > L1 cache is harvard, L2 cache is unified. > > Hence, we don't care which D-alias (if any) the data is written, so long > as it's pushed out of the L1 data cache so that it's visible to the L1 > instruction cache. > > If we're writing via a different mapping to that which is being executed, > I think the safest thing to do is to flush it out of the L1 D-cache at > the address it was written, and then flush any stale line from the L1 > I-cache using the user address. This is quite a unique requirement, and > we don't have anything which covers it. The closest you could get is > to that using existing calls is: > > 1. write the new instruction > 2. flush_dcache_page() > 3. flush_cache_user_range() using the user address > > and I think that should be safe on all the above cache types. > It doesn't feel to me like we yet have a clear consensus on the appropriate near or long-term fix for this problem. I'm worried time is short to get a fix in for v3.15. I'm not sure how elegant that fix needs to be. I've gotten good test runs using a modified/simplified version of Victor's arch callback and a slight variation of Russell's sequence of operations from above: void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, const void *src, int len) { void *kaddr = kmap_atomic(page); #ifdef CONFIG_SMP preempt_disable(); #endif memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); clean_dcache_area(kaddr, len); flush_cache_user_range(vaddr, vaddr + len); #ifdef CONFIG_SMP preempt_enable(); #endif kunmap_atomic(kaddr); } I have to say using clean_dcache_area() to write back the two words that changed (and rest of the cache line of course) seems more appropriate than flushing a whole page. Are there implications in doing that which makes this a bad idea though? At any rate, for v3.15 do we want to persue the more complex solutions with "congruent" mappings and use of copy_to_user(), or just something like the above (plus the rest of Victors v3 patch)? I'm sure Oleg is even less happy than me about yet another arch_ callback but we can hold out the hope that a more elegant solution can follow in the next release. One that might introduce risk we can't accept in v3.15 right now (e.g.: mapping the xol area writeable for all architectures). I have also tested (somewhat) both Victor's unmodified v3 and v4 patches on exynos 5250 and found them to work. Thanks, -dl
On Mon, Apr 21, 2014 at 9:16 AM, David Long <dave.long@linaro.org> wrote: > > void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > const void *src, int len) > { > void *kaddr = kmap_atomic(page); > > #ifdef CONFIG_SMP > preempt_disable(); > #endif That #ifdef seems totally pointless. Make it unconditional, or remove it entirely (the kmap_atomic() already causes us to be non-preemptible in practice). Linus
On 21 April 2014 09:16, David Long <dave.long@linaro.org> wrote: > On 04/16/14 18:25, Russell King - ARM Linux wrote: > >> Our I-caches don't snoop/see the D-cache at all - so writes need to be >> pushed out to what we call the "point of unification" where the I and D >> streams meet. For anything we care about, that's normally the L2 cache - >> L1 cache is harvard, L2 cache is unified. >> >> Hence, we don't care which D-alias (if any) the data is written, so long >> as it's pushed out of the L1 data cache so that it's visible to the L1 >> instruction cache. >> >> If we're writing via a different mapping to that which is being executed, >> I think the safest thing to do is to flush it out of the L1 D-cache at >> the address it was written, and then flush any stale line from the L1 >> I-cache using the user address. This is quite a unique requirement, and >> we don't have anything which covers it. The closest you could get is >> to that using existing calls is: >> >> 1. write the new instruction >> 2. flush_dcache_page() >> 3. flush_cache_user_range() using the user address >> >> and I think that should be safe on all the above cache types. >> > > It doesn't feel to me like we yet have a clear consensus on the appropriate > near or long-term fix for this problem. I'm worried time is short to get a > fix in for v3.15. I'm not sure how elegant that fix needs to be. I've gotten > good test runs using a modified/simplified version of Victor's arch callback > and a slight variation of Russell's sequence of operations from above: > > void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr, > const void *src, int len) > { > void *kaddr = kmap_atomic(page); > > #ifdef CONFIG_SMP > preempt_disable(); > #endif > memcpy(kaddr + (vaddr & ~PAGE_MASK), src, len); > clean_dcache_area(kaddr, len); > flush_cache_user_range(vaddr, vaddr + len); I wonder would flush_cache_user_range addresses other cores icache invaludation issue? We discussed that user-land task while doing uprobes single step could be migrated to another core. So if ARM cpu that has cache_ops_need_broadcast() to true and migration happens, other core icache may still has old stale instruction by this address. Note ARM ptrace breakpoint write code in flush_ptrace_access deals with such situation. Given that cache_ops_need_broadcast() true is not typical and cache invalidation in this case could be slow, but we would like to be functionally correct even in such situations, rather than experience very very rare mysterious crash of user-land process under the trace. > #ifdef CONFIG_SMP > preempt_enable(); > #endif > kunmap_atomic(kaddr); > } > > > I have to say using clean_dcache_area() to write back the two words that changed > (and rest of the cache line of course) seems more appropriate than flushing a > whole page. Are there implications in doing that which makes this a bad idea > though? > > At any rate, for v3.15 do we want to persue the more complex solutions with > "congruent" mappings and use of copy_to_user(), or just something like the above > (plus the rest of Victors v3 patch)? I'm sure Oleg is even less happy than me > about yet another arch_ callback but we can hold out the hope that a more elegant > solution can follow in the next release. One that might introduce risk we can't > accept in v3.15 right now (e.g.: mapping the xol area writeable for all > architectures). > > I have also tested (somewhat) both Victor's unmodified v3 and v4 patches on > exynos 5250 and found them to work. It seems to me that we cannot find common solution for this issue and arch specific callback should be introduced. If arch callback is introduced I will be more comfortable to keep current behavior as default and as far as ARM specific implementation is concerned it would be good to have code/logic sharing with code that deals with ptrace breakpoint write. IMHO such solution would be something like V3 [1] version of the patch, Note [1] also needs to address Linus's comment about removing '#ifdef CONFIG_SMP' in code sequence similar to above. Thanks, Victor [1] http://lists.infradead.org/pipermail/linux-arm-kernel/2014-April/247793.html > Thanks, > -dl >
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c index 04709b6..10ad973 100644 --- a/kernel/events/uprobes.c +++ b/kernel/events/uprobes.c @@ -34,6 +34,7 @@ #include <linux/ptrace.h> /* user_enable_single_step */ #include <linux/kdebug.h> /* notifier mechanism */ #include "../../mm/internal.h" /* munlock_vma_page */ +#include <linux/mman.h> #include <linux/percpu-rwsem.h> #include <linux/task_work.h> @@ -1141,7 +1142,7 @@ static int xol_add_vma(struct mm_struct *mm, struct xol_area *area) if (!area->vaddr) { /* Try to map as high as possible, this is only a hint. */ area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE, - PAGE_SIZE, 0, 0); + PAGE_SIZE, page_to_pfn(area->page), MAP_SHARED); if (area->vaddr & ~PAGE_MASK) { ret = area->vaddr; goto fail; @@ -1175,7 +1176,7 @@ static struct xol_area *__create_xol_area(unsigned long vaddr) if (!area->bitmap) goto free_area; - area->page = alloc_page(GFP_HIGHUSER); + area->page = alloc_page(GFP_USER); if (!area->page) goto free_bitmap; @@ -1299,11 +1300,8 @@ static unsigned long xol_get_insn_slot(struct uprobe *uprobe) /* Initialize the slot */ copy_to_page(area->page, xol_vaddr, &uprobe->arch.ixol, sizeof(uprobe->arch.ixol)); - /* - * We probably need flush_icache_user_range() but it needs vma. - * This should work on supported architectures too. - */ - flush_dcache_page(area->page); +/* Temporary hard-core icache flush for testing */ + __flush_icache_all(); return xol_vaddr; }