Message ID | 20230629082032.3481237-1-guoren@kernel.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | riscv: pageattr: Fixup synchronization problem between init_mm and active_mm | expand |
Context | Check | Description |
---|---|---|
conchuod/cover_letter | success | Single patches do not need cover letters |
conchuod/tree_selection | success | Guessed tree name to be fixes at HEAD b104dbedbe61 |
conchuod/fixes_present | success | Fixes tag present in non-next series |
conchuod/maintainers_pattern | success | MAINTAINERS pattern errors before the patch: 6 and now 6 |
conchuod/verify_signedoff | success | Signed-off-by tag matches author and committer |
conchuod/kdoc | success | Errors and warnings before: 0 this patch: 0 |
conchuod/build_rv64_clang_allmodconfig | success | Errors and warnings before: 8 this patch: 8 |
conchuod/module_param | success | Was 0 now: 0 |
conchuod/build_rv64_gcc_allmodconfig | success | Errors and warnings before: 8 this patch: 8 |
conchuod/build_rv32_defconfig | success | Build OK |
conchuod/dtb_warn_rv64 | success | Errors and warnings before: 20 this patch: 20 |
conchuod/header_inline | success | No static functions without inline keyword in header files |
conchuod/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 13 lines checked |
conchuod/build_rv64_nommu_k210_defconfig | success | Build OK |
conchuod/verify_fixes | success | Fixes tag looks correct |
conchuod/build_rv64_nommu_virt_defconfig | success | Build OK |
Hi Guo, On 29/06/2023 10:20, guoren@kernel.org wrote: > From: Guo Ren <guoren@linux.alibaba.com> > > The machine_kexec() uses set_memory_x to add the executable attribute to the > page table entry of control_code_buffer. It only modifies the init_mm but not > the current->active_mm. The current kexec process won't use init_mm directly, > and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 Is the removal of minor_pagefault an issue? I'm not sure I understand this part of the changelog. > ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, > when it met pud mapping on an MMU_SV39 machine, it caused the following: > > kexec_core: Starting new kernel > Will call new kernel at 00300000 from hart id 0 > FDT image at 747c7000 > Bye... > Unable to handle kernel paging request at virtual address ffffffda23b0d000 > Oops [#1] > Modules linked in: > CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 > Hardware name: Sophgo Mango (DT) > epc : 0xffffffda23b0d000 > ra : machine_kexec+0xa6/0xb0 > epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 > gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 > t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 > s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 > a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 > a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff > s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 > s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 > s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af > t5 : ffffffff815351b0 t6 : ffffffc80c173b50 > status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c > > Yes, Using set_memory_x API after boot has the limitation, and at least we > should synchronize the current->active_mm to fix the problem. > > Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > Signed-off-by: Guo Ren <guoren@kernel.org> > --- > arch/riscv/mm/pageattr.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > index ea3d61de065b..23d169c4ee81 100644 > --- a/arch/riscv/mm/pageattr.c > +++ b/arch/riscv/mm/pageattr.c > @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, > &masks); > mmap_write_unlock(&init_mm); > > + if (current->active_mm != &init_mm) { > + mmap_write_lock(current->active_mm); > + ret = walk_page_range_novma(current->active_mm, start, end, > + &pageattr_ops, NULL, &masks); > + mmap_write_unlock(current->active_mm); > + } > + > flush_tlb_kernel_range(start, end); > > return ret; I don't understand: any page table inherits the entries of swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is "automatically" synchronized, so why should we synchronize one 4K entry explicitly? A PGD entry would need to be synced, but not a PTE entry.
On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > Hi Guo, > > On 29/06/2023 10:20, guoren@kernel.org wrote: > > From: Guo Ren <guoren@linux.alibaba.com> > > > > The machine_kexec() uses set_memory_x to add the executable attribute to the > > page table entry of control_code_buffer. It only modifies the init_mm but not > > the current->active_mm. The current kexec process won't use init_mm directly, > > and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 > > > Is the removal of minor_pagefault an issue? I'm not sure I understand > this part of the changelog. I use two different work-around patches to answer your question: 1st: ----- diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 705d63a59aec..b8b200c81606 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) * only copy the information from the master page table, * nothing more. */ - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { + if (unlikely(addr >= 0x8000000000000000UL)) { vmalloc_fault(regs, code, addr); return; } ------ 2nd: ------ diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c index 8e65f0a953e5..270f50852886 100644 --- a/arch/riscv/mm/init.c +++ b/arch/riscv/mm/init.c @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) if (end >= __pa(PAGE_OFFSET) + memory_limit) end = __pa(PAGE_OFFSET) + memory_limit; - create_linear_mapping_range(start, end, 0); + create_linear_mapping_range(start, end, PMD_SIZE); } #ifdef CONFIG_STRICT_KERNEL_RWX ----- The removal of minor_pagefault could be an issue, but in this case it's the VMALLOC_START/END which prevents the minor_pagefault at first. I didn't say commit 7d3332be011e4 is the problem. > > > > ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, > > when it met pud mapping on an MMU_SV39 machine, it caused the following: > > > > kexec_core: Starting new kernel > > Will call new kernel at 00300000 from hart id 0 > > FDT image at 747c7000 > > Bye... > > Unable to handle kernel paging request at virtual address ffffffda23b0d000 > > Oops [#1] > > Modules linked in: > > CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 > > Hardware name: Sophgo Mango (DT) > > epc : 0xffffffda23b0d000 > > ra : machine_kexec+0xa6/0xb0 > > epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 > > gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 > > t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 > > s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 > > a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 > > a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff > > s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 > > s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > > s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 > > s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af > > t5 : ffffffff815351b0 t6 : ffffffc80c173b50 > > status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c > > > > Yes, Using set_memory_x API after boot has the limitation, and at least we > > should synchronize the current->active_mm to fix the problem. > > > > Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") > > Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > > Signed-off-by: Guo Ren <guoren@kernel.org> > > --- > > arch/riscv/mm/pageattr.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > > index ea3d61de065b..23d169c4ee81 100644 > > --- a/arch/riscv/mm/pageattr.c > > +++ b/arch/riscv/mm/pageattr.c > > @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, > > &masks); > > mmap_write_unlock(&init_mm); > > > > + if (current->active_mm != &init_mm) { > > + mmap_write_lock(current->active_mm); > > + ret = walk_page_range_novma(current->active_mm, start, end, > > + &pageattr_ops, NULL, &masks); > > + mmap_write_unlock(current->active_mm); > > + } > > + > > flush_tlb_kernel_range(start, end); > > > > return ret; > > > I don't understand: any page table inherits the entries of > swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is > "automatically" synchronized, so why should we synchronize one 4K entry > explicitly? A PGD entry would need to be synced, but not a PTE entry. The purpose of the second walk_page_range_novma() is for pgd's entries synchronization. I'm a bit lazy here, I agree, it's unnecessary to write lower level entries again. So I would use a simple pgd entries synchronization from vmalloc_fault in the next version of patch, all right? > -- Best Regards Guo Ren
On 04/07/2023 04:25, Guo Ren wrote: > On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@ghiti.fr> wrote: >> Hi Guo, >> >> On 29/06/2023 10:20, guoren@kernel.org wrote: >>> From: Guo Ren <guoren@linux.alibaba.com> >>> >>> The machine_kexec() uses set_memory_x to add the executable attribute to the >>> page table entry of control_code_buffer. It only modifies the init_mm but not >>> the current->active_mm. The current kexec process won't use init_mm directly, >>> and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 >> >> Is the removal of minor_pagefault an issue? I'm not sure I understand >> this part of the changelog. > I use two different work-around patches to answer your question: > 1st: > ----- > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > index 705d63a59aec..b8b200c81606 100644 > --- a/arch/riscv/mm/fault.c > +++ b/arch/riscv/mm/fault.c > @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) > * only copy the information from the master page table, > * nothing more. > */ > - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { > + if (unlikely(addr >= 0x8000000000000000UL)) { > vmalloc_fault(regs, code, addr); > return; > } > ------ > > 2nd: > ------ > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > index 8e65f0a953e5..270f50852886 100644 > --- a/arch/riscv/mm/init.c > +++ b/arch/riscv/mm/init.c > @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) > if (end >= __pa(PAGE_OFFSET) + memory_limit) > end = __pa(PAGE_OFFSET) + memory_limit; > > - create_linear_mapping_range(start, end, 0); > + create_linear_mapping_range(start, end, PMD_SIZE); > } > > #ifdef CONFIG_STRICT_KERNEL_RWX > ----- > > The removal of minor_pagefault could be an issue, but in this case > it's the VMALLOC_START/END which prevents the minor_pagefault at > first. I didn't say commit 7d3332be011e4 is the problem. Sorry I still don't understand what you mean here and why you mention the minor pagefault, could you explain again please? >> >>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, >>> when it met pud mapping on an MMU_SV39 machine, it caused the following: >>> >>> kexec_core: Starting new kernel >>> Will call new kernel at 00300000 from hart id 0 >>> FDT image at 747c7000 >>> Bye... >>> Unable to handle kernel paging request at virtual address ffffffda23b0d000 >>> Oops [#1] >>> Modules linked in: >>> CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 >>> Hardware name: Sophgo Mango (DT) >>> epc : 0xffffffda23b0d000 >>> ra : machine_kexec+0xa6/0xb0 >>> epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 >>> gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 >>> t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 >>> s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 >>> a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 >>> a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff >>> s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 >>> s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 >>> s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 >>> s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af >>> t5 : ffffffff815351b0 t6 : ffffffc80c173b50 >>> status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c >>> >>> Yes, Using set_memory_x API after boot has the limitation, and at least we >>> should synchronize the current->active_mm to fix the problem. >>> >>> Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>> Signed-off-by: Guo Ren <guoren@kernel.org> >>> --- >>> arch/riscv/mm/pageattr.c | 7 +++++++ >>> 1 file changed, 7 insertions(+) >>> >>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c >>> index ea3d61de065b..23d169c4ee81 100644 >>> --- a/arch/riscv/mm/pageattr.c >>> +++ b/arch/riscv/mm/pageattr.c >>> @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, >>> &masks); >>> mmap_write_unlock(&init_mm); >>> >>> + if (current->active_mm != &init_mm) { >>> + mmap_write_lock(current->active_mm); >>> + ret = walk_page_range_novma(current->active_mm, start, end, >>> + &pageattr_ops, NULL, &masks); >>> + mmap_write_unlock(current->active_mm); >>> + } >>> + >>> flush_tlb_kernel_range(start, end); >>> >>> return ret; >> >> I don't understand: any page table inherits the entries of >> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is >> "automatically" synchronized, so why should we synchronize one 4K entry >> explicitly? A PGD entry would need to be synced, but not a PTE entry. > The purpose of the second walk_page_range_novma() is for pgd's entries > synchronization. I'm a bit lazy here, I agree, it's unnecessary to > write lower level entries again. So I would use a simple pgd entries > synchronization from vmalloc_fault in the next version of patch, all > right? But vmalloc_fault was removed by commit 7d3332be011e4 for CONFIG_64BIT, so I don't get it: why would we need to synchronize a PGD entry in your case? Where does this new PGD come from? And the trap address is ffffffda23b0d000, which lies in the direct mapping, so why do you mention vmalloc_fault at all? Sorry if I'm missing something, hope you can clarify things! Thanks, Alex > > > -- > Best Regards > Guo Ren > > _______________________________________________ > linux-riscv mailing list > linux-riscv@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-riscv
On Wed, Jul 5, 2023 at 3:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > > On 04/07/2023 04:25, Guo Ren wrote: > > On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > >> Hi Guo, > >> > >> On 29/06/2023 10:20, guoren@kernel.org wrote: > >>> From: Guo Ren <guoren@linux.alibaba.com> > >>> > >>> The machine_kexec() uses set_memory_x to add the executable attribute to the > >>> page table entry of control_code_buffer. It only modifies the init_mm but not > >>> the current->active_mm. The current kexec process won't use init_mm directly, > >>> and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 > >> > >> Is the removal of minor_pagefault an issue? I'm not sure I understand > >> this part of the changelog. > > I use two different work-around patches to answer your question: > > 1st: > > ----- > > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > > index 705d63a59aec..b8b200c81606 100644 > > --- a/arch/riscv/mm/fault.c > > +++ b/arch/riscv/mm/fault.c > > @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) > > * only copy the information from the master page table, > > * nothing more. > > */ > > - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { > > + if (unlikely(addr >= 0x8000000000000000UL)) { > > vmalloc_fault(regs, code, addr); > > return; > > } > > ------ > > > > 2nd: > > ------ > > diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > > index 8e65f0a953e5..270f50852886 100644 > > --- a/arch/riscv/mm/init.c > > +++ b/arch/riscv/mm/init.c > > @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) > > if (end >= __pa(PAGE_OFFSET) + memory_limit) > > end = __pa(PAGE_OFFSET) + memory_limit; > > > > - create_linear_mapping_range(start, end, 0); > > + create_linear_mapping_range(start, end, PMD_SIZE); > > } > > > > #ifdef CONFIG_STRICT_KERNEL_RWX > > ----- > > > > The removal of minor_pagefault could be an issue, but in this case > > it's the VMALLOC_START/END which prevents the minor_pagefault at > > first. I didn't say commit 7d3332be011e4 is the problem. > > > Sorry I still don't understand what you mean here and why you mention > the minor pagefault, could you explain again please? > > > >> > >>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, > >>> when it met pud mapping on an MMU_SV39 machine, it caused the following: > >>> > >>> kexec_core: Starting new kernel > >>> Will call new kernel at 00300000 from hart id 0 > >>> FDT image at 747c7000 > >>> Bye... > >>> Unable to handle kernel paging request at virtual address ffffffda23b0d000 > >>> Oops [#1] > >>> Modules linked in: > >>> CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 > >>> Hardware name: Sophgo Mango (DT) > >>> epc : 0xffffffda23b0d000 > >>> ra : machine_kexec+0xa6/0xb0 > >>> epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 > >>> gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 > >>> t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 > >>> s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 > >>> a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 > >>> a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff > >>> s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 > >>> s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > >>> s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 > >>> s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af > >>> t5 : ffffffff815351b0 t6 : ffffffc80c173b50 > >>> status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c > >>> > >>> Yes, Using set_memory_x API after boot has the limitation, and at least we > >>> should synchronize the current->active_mm to fix the problem. > >>> > >>> Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") > >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > >>> Signed-off-by: Guo Ren <guoren@kernel.org> > >>> --- > >>> arch/riscv/mm/pageattr.c | 7 +++++++ > >>> 1 file changed, 7 insertions(+) > >>> > >>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > >>> index ea3d61de065b..23d169c4ee81 100644 > >>> --- a/arch/riscv/mm/pageattr.c > >>> +++ b/arch/riscv/mm/pageattr.c > >>> @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, > >>> &masks); > >>> mmap_write_unlock(&init_mm); > >>> > >>> + if (current->active_mm != &init_mm) { > >>> + mmap_write_lock(current->active_mm); > >>> + ret = walk_page_range_novma(current->active_mm, start, end, > >>> + &pageattr_ops, NULL, &masks); > >>> + mmap_write_unlock(current->active_mm); > >>> + } > >>> + > >>> flush_tlb_kernel_range(start, end); > >>> > >>> return ret; > >> > >> I don't understand: any page table inherits the entries of > >> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is > >> "automatically" synchronized, so why should we synchronize one 4K entry > >> explicitly? A PGD entry would need to be synced, but not a PTE entry. > > The purpose of the second walk_page_range_novma() is for pgd's entries > > synchronization. I'm a bit lazy here, I agree, it's unnecessary to > > write lower level entries again. So I would use a simple pgd entries > > synchronization from vmalloc_fault in the next version of patch, all > > right? > > > But vmalloc_fault was removed by commit 7d3332be011e4 for CONFIG_64BIT, > so I don't get it: why would we need to synchronize a PGD entry in your > case? Where does this new PGD come from? And the trap address is > ffffffda23b0d000, which lies in the direct mapping, so why do you > mention vmalloc_fault at all? The machine_kexec() uses set_memory_x to modify the direct mapping attributes from RW to RWX. But set_memory_x only changes the init_mm's attributes, not current->active_mm, so when kexec jumps into control_buffer, the instruction page fault happens, and there is no minor_pagefault for it, then panic. I found the bug on an MMU_sv39 machine, and the direct mapping used a 1GB PUD, the pgd entries. This patch could solve the problem by synchronizing a PGD entry between init_mm and current->active_mm. > > Sorry if I'm missing something, hope you can clarify things! > > Thanks, > > Alex > > > > > > > > -- > > Best Regards > > Guo Ren > > > > _______________________________________________ > > linux-riscv mailing list > > linux-riscv@lists.infradead.org > > http://lists.infradead.org/mailman/listinfo/linux-riscv
+ fixed Paul's address On 05/07/2023 11:15, Guo Ren wrote: > On Wed, Jul 5, 2023 at 3:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote: >> >> On 04/07/2023 04:25, Guo Ren wrote: >>> On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@ghiti.fr> wrote: >>>> Hi Guo, >>>> >>>> On 29/06/2023 10:20, guoren@kernel.org wrote: >>>>> From: Guo Ren <guoren@linux.alibaba.com> >>>>> >>>>> The machine_kexec() uses set_memory_x to add the executable attribute to the >>>>> page table entry of control_code_buffer. It only modifies the init_mm but not >>>>> the current->active_mm. The current kexec process won't use init_mm directly, >>>>> and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 >>>> Is the removal of minor_pagefault an issue? I'm not sure I understand >>>> this part of the changelog. >>> I use two different work-around patches to answer your question: >>> 1st: >>> ----- >>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c >>> index 705d63a59aec..b8b200c81606 100644 >>> --- a/arch/riscv/mm/fault.c >>> +++ b/arch/riscv/mm/fault.c >>> @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) >>> * only copy the information from the master page table, >>> * nothing more. >>> */ >>> - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { >>> + if (unlikely(addr >= 0x8000000000000000UL)) { >>> vmalloc_fault(regs, code, addr); >>> return; >>> } >>> ------ >>> >>> 2nd: >>> ------ >>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c >>> index 8e65f0a953e5..270f50852886 100644 >>> --- a/arch/riscv/mm/init.c >>> +++ b/arch/riscv/mm/init.c >>> @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) >>> if (end >= __pa(PAGE_OFFSET) + memory_limit) >>> end = __pa(PAGE_OFFSET) + memory_limit; >>> >>> - create_linear_mapping_range(start, end, 0); >>> + create_linear_mapping_range(start, end, PMD_SIZE); >>> } >>> >>> #ifdef CONFIG_STRICT_KERNEL_RWX >>> ----- >>> >>> The removal of minor_pagefault could be an issue, but in this case >>> it's the VMALLOC_START/END which prevents the minor_pagefault at >>> first. I didn't say commit 7d3332be011e4 is the problem. >> >> Sorry I still don't understand what you mean here and why you mention >> the minor pagefault, could you explain again please? >> >> >>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, >>>>> when it met pud mapping on an MMU_SV39 machine, it caused the following: >>>>> >>>>> kexec_core: Starting new kernel >>>>> Will call new kernel at 00300000 from hart id 0 >>>>> FDT image at 747c7000 >>>>> Bye... >>>>> Unable to handle kernel paging request at virtual address ffffffda23b0d000 >>>>> Oops [#1] >>>>> Modules linked in: >>>>> CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 >>>>> Hardware name: Sophgo Mango (DT) >>>>> epc : 0xffffffda23b0d000 >>>>> ra : machine_kexec+0xa6/0xb0 >>>>> epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 >>>>> gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 >>>>> t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 >>>>> s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 >>>>> a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 >>>>> a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff >>>>> s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 >>>>> s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 >>>>> s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 >>>>> s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af >>>>> t5 : ffffffff815351b0 t6 : ffffffc80c173b50 >>>>> status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c >>>>> >>>>> Yes, Using set_memory_x API after boot has the limitation, and at least we >>>>> should synchronize the current->active_mm to fix the problem. >>>>> >>>>> Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") >>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> >>>>> Signed-off-by: Guo Ren <guoren@kernel.org> >>>>> --- >>>>> arch/riscv/mm/pageattr.c | 7 +++++++ >>>>> 1 file changed, 7 insertions(+) >>>>> >>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c >>>>> index ea3d61de065b..23d169c4ee81 100644 >>>>> --- a/arch/riscv/mm/pageattr.c >>>>> +++ b/arch/riscv/mm/pageattr.c >>>>> @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, >>>>> &masks); >>>>> mmap_write_unlock(&init_mm); >>>>> >>>>> + if (current->active_mm != &init_mm) { >>>>> + mmap_write_lock(current->active_mm); >>>>> + ret = walk_page_range_novma(current->active_mm, start, end, >>>>> + &pageattr_ops, NULL, &masks); >>>>> + mmap_write_unlock(current->active_mm); >>>>> + } >>>>> + >>>>> flush_tlb_kernel_range(start, end); >>>>> >>>>> return ret; >>>> I don't understand: any page table inherits the entries of >>>> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is >>>> "automatically" synchronized, so why should we synchronize one 4K entry >>>> explicitly? A PGD entry would need to be synced, but not a PTE entry. >>> The purpose of the second walk_page_range_novma() is for pgd's entries >>> synchronization. I'm a bit lazy here, I agree, it's unnecessary to >>> write lower level entries again. So I would use a simple pgd entries >>> synchronization from vmalloc_fault in the next version of patch, all >>> right? >> >> But vmalloc_fault was removed by commit 7d3332be011e4 for CONFIG_64BIT, >> so I don't get it: why would we need to synchronize a PGD entry in your >> case? Where does this new PGD come from? And the trap address is >> ffffffda23b0d000, which lies in the direct mapping, so why do you >> mention vmalloc_fault at all? > The machine_kexec() uses set_memory_x to modify the direct mapping > attributes from RW to RWX. But set_memory_x only changes the init_mm's > attributes, not current->active_mm, so when kexec jumps into > control_buffer, the instruction page fault happens, and there is no > minor_pagefault for it, then panic. > > I found the bug on an MMU_sv39 machine, and the direct mapping used a > 1GB PUD, the pgd entries. This patch could solve the problem by > synchronizing a PGD entry between init_mm and current->active_mm. Ok I get it now, thanks. So a few thoughts: - the trap address is in the linear mapping, so there won't be any sync in the trap handler. Even if we implemented such behavior, waiting for a fault to synchronize those mappings is error-prone (see commit 7d3332be011e4) - this is Fixes: 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping") - that only happens if we use the largest mapping possible size for the current mode: one simple solution is to restrict to the second largest size (2MB for sv39, 1GB for sv48, 512GB for sv57). - your solution only works in this case, but we should fix the larger problem: we need to sync all the page tables if such things happen, not just the "calling" one. - set_memory* changing the protection of a whole PUD mapping for a single page is a problem: if we set_memory_ro() a single page, we end up with a 1GB linear mapping set to read only (which will obviously be problematic!) - can we really use set_memory* on the direct mapping? Maybe we should simply not, and the solution is to fix machine_kexec() so it remaps this page outside the linear mapping. - at the moment, we do not deal with the linear mapping in set_memory* (for example, when used with modules, we protect the module mapping in the vmalloc region but don't do anything with its linear mapping alias, I have that on my todo list for a moment now). The best very short-term solution to me is to restrict to the second largest size (easy to hack that in best_map_size()). And to me the long term solution is to split those mappings: that fixes all the issues mentioned above + the thing with the modules (but I have to dig a bit more, no other archs seems to do that). That's interesting, thanks for bringing that up, let me know what you think, I'll keep digging! Alex > >> Sorry if I'm missing something, hope you can clarify things! >> >> Thanks, >> >> Alex >> >> >>> >>> -- >>> Best Regards >>> Guo Ren >>> >>> _______________________________________________ >>> linux-riscv mailing list >>> linux-riscv@lists.infradead.org >>> http://lists.infradead.org/mailman/listinfo/linux-riscv > >
On Wed, Jul 5, 2023 at 8:06 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > > + fixed Paul's address > > > On 05/07/2023 11:15, Guo Ren wrote: > > On Wed, Jul 5, 2023 at 3:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > >> > >> On 04/07/2023 04:25, Guo Ren wrote: > >>> On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex@ghiti.fr> wrote: > >>>> Hi Guo, > >>>> > >>>> On 29/06/2023 10:20, guoren@kernel.org wrote: > >>>>> From: Guo Ren <guoren@linux.alibaba.com> > >>>>> > >>>>> The machine_kexec() uses set_memory_x to add the executable attribute to the > >>>>> page table entry of control_code_buffer. It only modifies the init_mm but not > >>>>> the current->active_mm. The current kexec process won't use init_mm directly, > >>>>> and it depends on minor_pagefault, which is removed by commit 7d3332be011e4 > >>>> Is the removal of minor_pagefault an issue? I'm not sure I understand > >>>> this part of the changelog. > >>> I use two different work-around patches to answer your question: > >>> 1st: > >>> ----- > >>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > >>> index 705d63a59aec..b8b200c81606 100644 > >>> --- a/arch/riscv/mm/fault.c > >>> +++ b/arch/riscv/mm/fault.c > >>> @@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs) > >>> * only copy the information from the master page table, > >>> * nothing more. > >>> */ > >>> - if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) { > >>> + if (unlikely(addr >= 0x8000000000000000UL)) { > >>> vmalloc_fault(regs, code, addr); > >>> return; > >>> } > >>> ------ > >>> > >>> 2nd: > >>> ------ > >>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c > >>> index 8e65f0a953e5..270f50852886 100644 > >>> --- a/arch/riscv/mm/init.c > >>> +++ b/arch/riscv/mm/init.c > >>> @@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void) > >>> if (end >= __pa(PAGE_OFFSET) + memory_limit) > >>> end = __pa(PAGE_OFFSET) + memory_limit; > >>> > >>> - create_linear_mapping_range(start, end, 0); > >>> + create_linear_mapping_range(start, end, PMD_SIZE); > >>> } > >>> > >>> #ifdef CONFIG_STRICT_KERNEL_RWX > >>> ----- > >>> > >>> The removal of minor_pagefault could be an issue, but in this case > >>> it's the VMALLOC_START/END which prevents the minor_pagefault at > >>> first. I didn't say commit 7d3332be011e4 is the problem. > >> > >> Sorry I still don't understand what you mean here and why you mention > >> the minor pagefault, could you explain again please? > >> > >> > >>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So, > >>>>> when it met pud mapping on an MMU_SV39 machine, it caused the following: > >>>>> > >>>>> kexec_core: Starting new kernel > >>>>> Will call new kernel at 00300000 from hart id 0 > >>>>> FDT image at 747c7000 > >>>>> Bye... > >>>>> Unable to handle kernel paging request at virtual address ffffffda23b0d000 > >>>>> Oops [#1] > >>>>> Modules linked in: > >>>>> CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15 > >>>>> Hardware name: Sophgo Mango (DT) > >>>>> epc : 0xffffffda23b0d000 > >>>>> ra : machine_kexec+0xa6/0xb0 > >>>>> epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10 > >>>>> gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000 > >>>>> t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50 > >>>>> s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000 > >>>>> a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000 > >>>>> a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff > >>>>> s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000 > >>>>> s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000 > >>>>> s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000 > >>>>> s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af > >>>>> t5 : ffffffff815351b0 t6 : ffffffc80c173b50 > >>>>> status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c > >>>>> > >>>>> Yes, Using set_memory_x API after boot has the limitation, and at least we > >>>>> should synchronize the current->active_mm to fix the problem. > >>>>> > >>>>> Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support") > >>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com> > >>>>> Signed-off-by: Guo Ren <guoren@kernel.org> > >>>>> --- > >>>>> arch/riscv/mm/pageattr.c | 7 +++++++ > >>>>> 1 file changed, 7 insertions(+) > >>>>> > >>>>> diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c > >>>>> index ea3d61de065b..23d169c4ee81 100644 > >>>>> --- a/arch/riscv/mm/pageattr.c > >>>>> +++ b/arch/riscv/mm/pageattr.c > >>>>> @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, > >>>>> &masks); > >>>>> mmap_write_unlock(&init_mm); > >>>>> > >>>>> + if (current->active_mm != &init_mm) { > >>>>> + mmap_write_lock(current->active_mm); > >>>>> + ret = walk_page_range_novma(current->active_mm, start, end, > >>>>> + &pageattr_ops, NULL, &masks); > >>>>> + mmap_write_unlock(current->active_mm); > >>>>> + } > >>>>> + > >>>>> flush_tlb_kernel_range(start, end); > >>>>> > >>>>> return ret; > >>>> I don't understand: any page table inherits the entries of > >>>> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is > >>>> "automatically" synchronized, so why should we synchronize one 4K entry > >>>> explicitly? A PGD entry would need to be synced, but not a PTE entry. > >>> The purpose of the second walk_page_range_novma() is for pgd's entries > >>> synchronization. I'm a bit lazy here, I agree, it's unnecessary to > >>> write lower level entries again. So I would use a simple pgd entries > >>> synchronization from vmalloc_fault in the next version of patch, all > >>> right? > >> > >> But vmalloc_fault was removed by commit 7d3332be011e4 for CONFIG_64BIT, > >> so I don't get it: why would we need to synchronize a PGD entry in your > >> case? Where does this new PGD come from? And the trap address is > >> ffffffda23b0d000, which lies in the direct mapping, so why do you > >> mention vmalloc_fault at all? > > The machine_kexec() uses set_memory_x to modify the direct mapping > > attributes from RW to RWX. But set_memory_x only changes the init_mm's > > attributes, not current->active_mm, so when kexec jumps into > > control_buffer, the instruction page fault happens, and there is no > > minor_pagefault for it, then panic. > > > > I found the bug on an MMU_sv39 machine, and the direct mapping used a > > 1GB PUD, the pgd entries. This patch could solve the problem by > > synchronizing a PGD entry between init_mm and current->active_mm. > > > Ok I get it now, thanks. So a few thoughts: > > - the trap address is in the linear mapping, so there won't be any sync > in the trap handler. Even if we implemented such behavior, waiting for a > fault to synchronize those mappings is error-prone (see commit > 7d3332be011e4) > > - this is Fixes: 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the > linear mapping") > > - that only happens if we use the largest mapping possible size for the > current mode: one simple solution is to restrict to the second largest > size (2MB for sv39, 1GB for sv48, 512GB for sv57). > > - your solution only works in this case, but we should fix the larger > problem: we need to sync all the page tables if such things happen, not > just the "calling" one. > > - set_memory* changing the protection of a whole PUD mapping for a > single page is a problem: if we set_memory_ro() a single page, we end up > with a 1GB linear mapping set to read only (which will obviously be > problematic!) > > - can we really use set_memory* on the direct mapping? Maybe we should > simply not, and the solution is to fix machine_kexec() so it remaps this > page outside the linear mapping. Okay, I agree to limit usage of set_memory*. And I would let machine_kexec() remaps control_buffer_page in the next version of patch instead of modifing linear mapping. > > - at the moment, we do not deal with the linear mapping in set_memory* > (for example, when used with modules, we protect the module mapping in > the vmalloc region but don't do anything with its linear mapping alias, > I have that on my todo list for a moment now). > > The best very short-term solution to me is to restrict to the second > largest size (easy to hack that in best_map_size()). > > And to me the long term solution is to split those mappings: that fixes > all the issues mentioned above + the thing with the modules (but I have > to dig a bit more, no other archs seems to do that). > > That's interesting, thanks for bringing that up, let me know what you > think, I'll keep digging! > > Alex > > > > > >> Sorry if I'm missing something, hope you can clarify things! > >> > >> Thanks, > >> > >> Alex > >> > >> > >>> > >>> -- > >>> Best Regards > >>> Guo Ren > >>> > >>> _______________________________________________ > >>> linux-riscv mailing list > >>> linux-riscv@lists.infradead.org > >>> http://lists.infradead.org/mailman/listinfo/linux-riscv > > > >
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c index ea3d61de065b..23d169c4ee81 100644 --- a/arch/riscv/mm/pageattr.c +++ b/arch/riscv/mm/pageattr.c @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask, &masks); mmap_write_unlock(&init_mm); + if (current->active_mm != &init_mm) { + mmap_write_lock(current->active_mm); + ret = walk_page_range_novma(current->active_mm, start, end, + &pageattr_ops, NULL, &masks); + mmap_write_unlock(current->active_mm); + } + flush_tlb_kernel_range(start, end); return ret;