Message ID | 20190823052335.572133-1-songliubraving@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | x86/mm: Do not split_large_page() for set_kernel_text_rw() | expand |
On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: > As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to > split_large_page() for all kernel text pages. This means a single kprobe > will put all kernel text in 4k pages: > > root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd > > root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events > root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable > > root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > 0xffffffff81000000-0xffffffff82400000 20M ro x pte > > To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check > in static_protections(). > > Two helper functions set_text_rw() and set_text_ro() are added to flip > _PAGE_RW bit for kernel text. > > [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") ARGH; so this is because ftrace flips the whole kernel range to RW and back for giggles? I'm thinking _that_ is a bug, it's a clear W^X violation.
Cc: Steven Rostedt and Suresh Siddha Hi Peter, > On Aug 23, 2019, at 2:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: >> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to >> split_large_page() for all kernel text pages. This means a single kprobe >> will put all kernel text in 4k pages: >> >> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel >> 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd >> >> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events >> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable >> >> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel >> 0xffffffff81000000-0xffffffff82400000 20M ro x pte >> >> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check >> in static_protections(). >> >> Two helper functions set_text_rw() and set_text_ro() are added to flip >> _PAGE_RW bit for kernel text. >> >> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") > > ARGH; so this is because ftrace flips the whole kernel range to RW and > back for giggles? I'm thinking _that_ is a bug, it's a clear W^X > violation. Thanks for your comments. Yes, it is related to ftrace, as we have CONFIG_KPROBES_ON_FTRACE. However, after digging around, I am not sure what is the expected behavior. Kernel text region has two mappings to it. For x86_64 and four-level page table, there are: 1. kernel identity mapping, from 0xffff888000100000; 2. kernel text mapping, from 0xffffffff81000000, Per comments in arch/x86/mm/init_64.c:set_kernel_text_rw(): /* * Make the kernel identity mapping for text RW. Kernel text * mapping will always be RO. Refer to the comment in * static_protections() in pageattr.c */ set_memory_rw(start, (end - start) >> PAGE_SHIFT); kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity mapping. However, my experiment shows that kprobe actually operates on the kernel text mapping (0xffffffff81000000-). It is the same w/ and w/o CONFIG_KPROBES_ON_FTRACE. Therefore, I am not sure whether the comment is out-dated (10-year old), or the kprobe is doing something wrong. More information about the issue we are looking at. We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in kernel text mapping into pte-mapped pages. This increases iTLB miss rate from about 300 per million instructions to about 700 per million instructions (for the application I test with). Per bisect, we found this behavior happens after commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I proposed this PATCH to fix/workaround this issue. However, per Peter's comment and my study of the code, this doesn't seem the real problem or the only here. I also tested that the PMD split issue doesn't happen w/o CONFIG_KPROBES_ON_FTRACE. In summary, I have the following questions: 1. Which mapping should kprobe work on? Kernel identity mapping or kernel text mapping? 2. FTRACE causes split of PMD mapped kernel text. How should we fix this? Thanks, Song
On Mon, Aug 26, 2019 at 04:40:23AM +0000, Song Liu wrote: > Cc: Steven Rostedt and Suresh Siddha > > Hi Peter, > > > On Aug 23, 2019, at 2:36 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: > >> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to > >> split_large_page() for all kernel text pages. This means a single kprobe > >> will put all kernel text in 4k pages: > >> > >> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > >> 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd > >> > >> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events > >> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable > >> > >> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > >> 0xffffffff81000000-0xffffffff82400000 20M ro x pte > >> > >> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check > >> in static_protections(). > >> > >> Two helper functions set_text_rw() and set_text_ro() are added to flip > >> _PAGE_RW bit for kernel text. > >> > >> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") > > > > ARGH; so this is because ftrace flips the whole kernel range to RW and > > back for giggles? I'm thinking _that_ is a bug, it's a clear W^X > > violation. > > Thanks for your comments. Yes, it is related to ftrace, as we have > CONFIG_KPROBES_ON_FTRACE. However, after digging around, I am not sure > what is the expected behavior. It changed recently; that is we got a lot more strict wrt W^X mappings. IIRC ftrace is the only known violator of W^X at this time. > Kernel text region has two mappings to it. For x86_64 and four-level > page table, there are: > > 1. kernel identity mapping, from 0xffff888000100000; > 2. kernel text mapping, from 0xffffffff81000000, Right; AFAICT this is so that kernel text fits in s32 immediates. > Per comments in arch/x86/mm/init_64.c:set_kernel_text_rw(): > > /* > * Make the kernel identity mapping for text RW. Kernel text > * mapping will always be RO. Refer to the comment in > * static_protections() in pageattr.c > */ > set_memory_rw(start, (end - start) >> PAGE_SHIFT); So only the high mapping is ever executable; the identity map should not be. Both should be RO. > kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity > mapping. Please provide more information; kprobes shouldn't be touching either mapping. That is, afaict kprobes uses text_poke() which uses a temporary mapping (in 'userspace' even) to alias the high text mapping. I'm also not sure how it would then result in any 4k text maps. Yes the alias is 4k, but it should not affect the actual high text map in any way. kprobes also allocates executable slots, but it does that in the module range (afaict), so that, again, should not affect the high text mapping. > We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ > CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in > kernel text mapping into pte-mapped pages. This increases iTLB > miss rate from about 300 per million instructions to about 700 per > million instructions (for the application I test with). > > Per bisect, we found this behavior happens after commit 585948f4f695 > ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I > proposed this PATCH to fix/workaround this issue. However, per > Peter's comment and my study of the code, this doesn't seem the > real problem or the only here. > > I also tested that the PMD split issue doesn't happen w/o > CONFIG_KPROBES_ON_FTRACE. Right, because then ftrace doesn't flip the whole kernel map writable; which it _really_ should stop doing anyway. But I'm still wondering what causes that first 4k split...
On Fri, 23 Aug 2019 11:36:37 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: > > As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to > > split_large_page() for all kernel text pages. This means a single kprobe > > will put all kernel text in 4k pages: > > > > root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > > 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd > > > > root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events > > root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable > > > > root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel > > 0xffffffff81000000-0xffffffff82400000 20M ro x pte > > > > To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check > > in static_protections(). > > > > Two helper functions set_text_rw() and set_text_ro() are added to flip > > _PAGE_RW bit for kernel text. > > > > [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") > > ARGH; so this is because ftrace flips the whole kernel range to RW and > back for giggles? I'm thinking _that_ is a bug, it's a clear W^X > violation. Since ftrace did this way before text_poke existed and way before anybody cared (back in 2007), it's not really a bug. Anyway, I believe Nadav has some patches that converts ftrace to use the shadow page modification trick somewhere. Or we also need the text_poke batch processing (did that get upstream?). Mapping in 40,000 pages one at a time is noticeable from a human stand point. -- Steve
On Mon, Aug 26, 2019 at 07:33:08AM -0400, Steven Rostedt wrote: > Anyway, I believe Nadav has some patches that converts ftrace to use > the shadow page modification trick somewhere. > > Or we also need the text_poke batch processing (did that get upstream?). It did. And I just did that patch; I'll send out in a bit. It seems to work, but this is the very first time I've looked at this code.
> On Aug 26, 2019, at 2:23 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > So only the high mapping is ever executable; the identity map should not > be. Both should be RO. > >> kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity >> mapping. > > Please provide more information; kprobes shouldn't be touching either > mapping. That is, afaict kprobes uses text_poke() which uses a temporary > mapping (in 'userspace' even) to alias the high text mapping. kprobe without CONFIG_KPROBES_ON_FTRACE uses text_poke(). But kprobe with CONFIG_KPROBES_ON_FTRACE uses another path. The split happens with set_kernel_text_rw() -> ... -> __change_page_attr() -> split_large_page(). The split is introduced by commit 585948f4f695. do_split in __change_page_attr() becomes true after commit 585948f4f695. This patch tries to fix/workaround this part. > > I'm also not sure how it would then result in any 4k text maps. Yes the > alias is 4k, but it should not affect the actual high text map in any > way. I am confused by the alias logic. set_kernel_text_rw() makes the high map rw, and split the PMD in the high map. > > kprobes also allocates executable slots, but it does that in the module > range (afaict), so that, again, should not affect the high text mapping. > >> We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ >> CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in >> kernel text mapping into pte-mapped pages. This increases iTLB >> miss rate from about 300 per million instructions to about 700 per >> million instructions (for the application I test with). >> >> Per bisect, we found this behavior happens after commit 585948f4f695 >> ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I >> proposed this PATCH to fix/workaround this issue. However, per >> Peter's comment and my study of the code, this doesn't seem the >> real problem or the only here. >> >> I also tested that the PMD split issue doesn't happen w/o >> CONFIG_KPROBES_ON_FTRACE. > > Right, because then ftrace doesn't flip the whole kernel map writable; > which it _really_ should stop doing anyway. > > But I'm still wondering what causes that first 4k split... Please see above. Thanks, Song
> On Aug 26, 2019, at 4:33 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Fri, 23 Aug 2019 11:36:37 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > >> On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: >>> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to >>> split_large_page() for all kernel text pages. This means a single kprobe >>> will put all kernel text in 4k pages: >>> >>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel >>> 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd >>> >>> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events >>> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable >>> >>> root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel >>> 0xffffffff81000000-0xffffffff82400000 20M ro x pte >>> >>> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check >>> in static_protections(). >>> >>> Two helper functions set_text_rw() and set_text_ro() are added to flip >>> _PAGE_RW bit for kernel text. >>> >>> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") >> >> ARGH; so this is because ftrace flips the whole kernel range to RW and >> back for giggles? I'm thinking _that_ is a bug, it's a clear W^X >> violation. > > Since ftrace did this way before text_poke existed and way before > anybody cared (back in 2007), it's not really a bug. > > Anyway, I believe Nadav has some patches that converts ftrace to use > the shadow page modification trick somewhere. For the record - here is my previous patch: https://lkml.org/lkml/2018/12/5/211
On Mon, 26 Aug 2019 15:41:24 +0000 Nadav Amit <namit@vmware.com> wrote: > > Anyway, I believe Nadav has some patches that converts ftrace to use > > the shadow page modification trick somewhere. > > For the record - here is my previous patch: > https://lkml.org/lkml/2018/12/5/211 FYI, when referencing older patches, please use lkml.kernel.org or lore.kernel.org, lkml.org is slow and obsolete. ie. http://lkml.kernel.org/r/20181205013408.47725-9-namit@vmware.com -- Steve
On Mon, Aug 26, 2019 at 03:41:24PM +0000, Nadav Amit wrote: > For the record - here is my previous patch: > https://lkml.org/lkml/2018/12/5/211 Thanks!
> On Aug 26, 2019, at 8:56 AM, Steven Rostedt <rostedt@goodmis.org> wrote: > > On Mon, 26 Aug 2019 15:41:24 +0000 > Nadav Amit <namit@vmware.com> wrote: > >>> Anyway, I believe Nadav has some patches that converts ftrace to use >>> the shadow page modification trick somewhere. >> >> For the record - here is my previous patch: >> https://lkml.org/lkml/2018/12/5/211 > > FYI, when referencing older patches, please use lkml.kernel.org or > lore.kernel.org, lkml.org is slow and obsolete. > > ie. http://lkml.kernel.org/r/20181205013408.47725-9-namit@vmware.com Will do so next time.
> On Aug 26, 2019, at 8:08 AM, Song Liu <songliubraving@fb.com> wrote: > > > >> On Aug 26, 2019, at 2:23 AM, Peter Zijlstra <peterz@infradead.org> wrote: >> >> So only the high mapping is ever executable; the identity map should not >> be. Both should be RO. >> >>> kprobe (with CONFIG_KPROBES_ON_FTRACE) should work on kernel identity >>> mapping. >> >> Please provide more information; kprobes shouldn't be touching either >> mapping. That is, afaict kprobes uses text_poke() which uses a temporary >> mapping (in 'userspace' even) to alias the high text mapping. > > kprobe without CONFIG_KPROBES_ON_FTRACE uses text_poke(). But kprobe with > CONFIG_KPROBES_ON_FTRACE uses another path. The split happens with > set_kernel_text_rw() -> ... -> __change_page_attr() -> split_large_page(). > The split is introduced by commit 585948f4f695. do_split in > __change_page_attr() becomes true after commit 585948f4f695. This patch > tries to fix/workaround this part. > >> >> I'm also not sure how it would then result in any 4k text maps. Yes the >> alias is 4k, but it should not affect the actual high text map in any >> way. > > I am confused by the alias logic. set_kernel_text_rw() makes the high map > rw, and split the PMD in the high map. > >> >> kprobes also allocates executable slots, but it does that in the module >> range (afaict), so that, again, should not affect the high text mapping. >> >>> We found with 5.2 kernel (no CONFIG_PAGE_TABLE_ISOLATION, w/ >>> CONFIG_KPROBES_ON_FTRACE), a single kprobe will split _all_ PMDs in >>> kernel text mapping into pte-mapped pages. This increases iTLB >>> miss rate from about 300 per million instructions to about 700 per >>> million instructions (for the application I test with). >>> >>> Per bisect, we found this behavior happens after commit 585948f4f695 >>> ("x86/mm/cpa: Avoid the 4k pages check completely"). That's why I >>> proposed this PATCH to fix/workaround this issue. However, per >>> Peter's comment and my study of the code, this doesn't seem the >>> real problem or the only here. >>> >>> I also tested that the PMD split issue doesn't happen w/o >>> CONFIG_KPROBES_ON_FTRACE. >> >> Right, because then ftrace doesn't flip the whole kernel map writable; >> which it _really_ should stop doing anyway. >> >> But I'm still wondering what causes that first 4k split... > > Please see above. Another data point: we can repro the issue on Linus's master with just ftrace: # start with PMD mapped root@virt-test:~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff81c00000 12M ro PSE x pmd # enable single ftrace root@virt-test:~# echo consume_skb > /sys/kernel/debug/tracing/set_ftrace_filter root@virt-test:~# echo function > /sys/kernel/debug/tracing/current_tracer # now the text is PTE mapped root@virt-test:~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff81c00000 12M ro x pte Song
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index a6b5c653727b..5745fdcc429e 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1276,7 +1276,7 @@ void set_kernel_text_rw(void) * mapping will always be RO. Refer to the comment in * static_protections() in pageattr.c */ - set_memory_rw(start, (end - start) >> PAGE_SHIFT); + set_text_rw(start, (end - start) >> PAGE_SHIFT); } void set_kernel_text_ro(void) @@ -1293,7 +1293,7 @@ void set_kernel_text_ro(void) /* * Set the kernel identity mapping for text RO. */ - set_memory_ro(start, (end - start) >> PAGE_SHIFT); + set_text_ro(start, (end - start) >> PAGE_SHIFT); } void mark_rodata_ro(void) diff --git a/arch/x86/mm/mm_internal.h b/arch/x86/mm/mm_internal.h index eeae142062ed..65b84b471770 100644 --- a/arch/x86/mm/mm_internal.h +++ b/arch/x86/mm/mm_internal.h @@ -24,4 +24,8 @@ void update_cache_mode_entry(unsigned entry, enum page_cache_mode cache); extern unsigned long tlb_single_page_flush_ceiling; +int set_text_rw(unsigned long addr, int numpages); + +int set_text_ro(unsigned long addr, int numpages); + #endif /* __X86_MM_INTERNAL_H */ diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c index 6a9a77a403c9..44a885df776d 100644 --- a/arch/x86/mm/pageattr.c +++ b/arch/x86/mm/pageattr.c @@ -66,6 +66,7 @@ static DEFINE_SPINLOCK(cpa_lock); #define CPA_ARRAY 2 #define CPA_PAGES_ARRAY 4 #define CPA_NO_CHECK_ALIAS 8 /* Do not search for aliases */ +#define CPA_FLIP_TEXT_RW 0x10 /* allow flip _PAGE_RW for kernel text */ #ifdef CONFIG_PROC_FS static unsigned long direct_pages_count[PG_LEVEL_NUM]; @@ -516,7 +517,7 @@ static inline void check_conflict(int warnlvl, pgprot_t prot, pgprotval_t val, */ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start, unsigned long pfn, unsigned long npg, - int warnlvl) + int warnlvl, unsigned int cpa_flags) { pgprotval_t forbidden, res; unsigned long end; @@ -535,9 +536,11 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start, check_conflict(warnlvl, prot, res, start, end, pfn, "Text NX"); forbidden = res; - res = protect_kernel_text_ro(start, end); - check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO"); - forbidden |= res; + if (!(cpa_flags & CPA_FLIP_TEXT_RW)) { + res = protect_kernel_text_ro(start, end); + check_conflict(warnlvl, prot, res, start, end, pfn, "Text RO"); + forbidden |= res; + } /* Check the PFN directly */ res = protect_pci_bios(pfn, pfn + npg - 1); @@ -819,7 +822,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address, * extra conditional required here. */ chk_prot = static_protections(old_prot, lpaddr, old_pfn, numpages, - CPA_CONFLICT); + CPA_CONFLICT, cpa->flags); if (WARN_ON_ONCE(pgprot_val(chk_prot) != pgprot_val(old_prot))) { /* @@ -855,7 +858,7 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address, * protection requirement in the large page. */ new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages, - CPA_DETECT); + CPA_DETECT, cpa->flags); /* * If there is a conflict, split the large page. @@ -906,7 +909,7 @@ static void split_set_pte(struct cpa_data *cpa, pte_t *pte, unsigned long pfn, if (!cpa->force_static_prot) goto set; - prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT); + prot = static_protections(ref_prot, address, pfn, npg, CPA_PROTECT, 0); if (pgprot_val(prot) == pgprot_val(ref_prot)) goto set; @@ -1504,7 +1507,7 @@ static int __change_page_attr(struct cpa_data *cpa, int primary) cpa_inc_4k_install(); new_prot = static_protections(new_prot, address, pfn, 1, - CPA_PROTECT); + CPA_PROTECT, 0); new_prot = pgprot_clear_protnone_bits(new_prot); @@ -1707,7 +1710,7 @@ static int change_page_attr_set_clr(unsigned long *addr, int numpages, cpa.curpage = 0; cpa.force_split = force_split; - if (in_flag & (CPA_ARRAY | CPA_PAGES_ARRAY)) + if (in_flag & (CPA_ARRAY | CPA_PAGES_ARRAY | CPA_FLIP_TEXT_RW)) cpa.flags |= in_flag; /* No alias checking for _NX bit modifications */ @@ -1983,11 +1986,24 @@ int set_memory_ro(unsigned long addr, int numpages) return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_RW), 0); } +int set_text_ro(unsigned long addr, int numpages) +{ + return change_page_attr_set_clr(&addr, numpages, __pgprot(0), + __pgprot(_PAGE_RW), 0, CPA_FLIP_TEXT_RW, + NULL); +} + int set_memory_rw(unsigned long addr, int numpages) { return change_page_attr_set(&addr, numpages, __pgprot(_PAGE_RW), 0); } +int set_text_rw(unsigned long addr, int numpages) +{ + return change_page_attr_set_clr(&addr, numpages, __pgprot(_PAGE_RW), + __pgprot(0), 0, CPA_FLIP_TEXT_RW, NULL); +} + int set_memory_np(unsigned long addr, int numpages) { return change_page_attr_clear(&addr, numpages, __pgprot(_PAGE_PRESENT), 0);
As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to split_large_page() for all kernel text pages. This means a single kprobe will put all kernel text in 4k pages: root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff82400000 20M ro PSE x pmd root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable root@ ~# grep ffff81000000- /sys/kernel/debug/page_tables/kernel 0xffffffff81000000-0xffffffff82400000 20M ro x pte To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check in static_protections(). Two helper functions set_text_rw() and set_text_ro() are added to flip _PAGE_RW bit for kernel text. [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") Cc: stable@vger.kernel.org # v4.20+ Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Song Liu <songliubraving@fb.com> --- arch/x86/mm/init_64.c | 4 ++-- arch/x86/mm/mm_internal.h | 4 ++++ arch/x86/mm/pageattr.c | 34 +++++++++++++++++++++++++--------- 3 files changed, 31 insertions(+), 11 deletions(-)