diff mbox series

Linux Xen PV CPA W^X violation false-positives

Message ID 20240124165401.35784-1-jandryuk@gmail.com (mailing list archive)
State New, archived
Headers show
Series Linux Xen PV CPA W^X violation false-positives | expand

Commit Message

Jason Andryuk Jan. 24, 2024, 4:54 p.m. UTC
Xen PV domains show CPA W^X violations like:

CPA detected W^X violation: 0000000000000064 -> 0000000000000067 range: 0xffff888000010000 - 0xffff888000010fff PFN 10
WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 __change_page_attr_set_clr+0x113a/0x11c0
Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O)
CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G           O       6.1.38 #1
Workqueue: events bpf_prog_free_deferred
RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0
Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 2a fd ff ff 48 8b 85 60 ff ff ff 48
RSP: e02b:ffffc90000367c48 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 000ffffffffef064 RCX: 0000000000000000
RDX: 0000000000000003 RSI: 00000000fffff7ff RDI: 00000000ffffffff
RBP: ffffc90000367d48 R08: 0000000000000000 R09: ffffc90000367aa0
R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000067
R13: 0000000000000001 R14: ffff888000010000 R15: ffffc90000367d60
FS:  0000000000000000(0000) GS:ffff88800b800000(0000) knlGS:0000000000000000
CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fdbaeda01c0 CR3: 0000000004312000 CR4: 0000000000050660
Call Trace:
 <TASK>
 ? show_regs.cold+0x1a/0x1f
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? __warn+0x7b/0xc0
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? report_bug+0x111/0x1a0
 ? handle_bug+0x4d/0xa0
 ? exc_invalid_op+0x19/0x70
 ? asm_exc_invalid_op+0x1b/0x20
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? __change_page_attr_set_clr+0x113a/0x11c0
 ? debug_smp_processor_id+0x17/0x20
 ? ___cache_free+0x2e/0x1e0
 ? _raw_spin_unlock+0x1e/0x40
 ? __purge_vmap_area_lazy+0x2ea/0x6b0
 set_direct_map_default_noflush+0x7c/0xa0
 __vunmap+0x1ac/0x280
 __vfree+0x1d/0x60
 vfree+0x27/0x40
 __bpf_prog_free+0x44/0x50
 bpf_prog_free_deferred+0x104/0x120
 process_one_work+0x1ca/0x3d0
 ? process_one_work+0x3d0/0x3d0
 worker_thread+0x45/0x3c0
 ? process_one_work+0x3d0/0x3d0
 kthread+0xe2/0x110
 ? kthread_complete_and_exit+0x20/0x20
 ret_from_fork+0x1f/0x30
 </TASK>
---[ end trace 0000000000000000 ]---

Xen provides a set of page tables that the guest executes out of when it
starts.  The L1 entries are shared between level2_ident_pgt and
level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in
the level2_ident_pgt entries.  verify_rwx() only checks the l1 entry and
reports a false-positive violation.

Here is a dump of some kernel virtual addresses and the corresponding
L1 and L2 entries:
This is the start of the directmap (ident) and they have NX (bit 63) set
in the PMD.
ndvm-pv (1): [    0.466778] va=ffff888000000000 pte=0010000000000027 level: 1
ndvm-pv (1): [    0.466788] va=ffff888000000000 pmd=800000000242c067 level: 2
Directmap for kernel text:
ndvm-pv (1): [    0.466795] va=ffff888001000000 pte=0010000001000065 level: 1
ndvm-pv (1): [    0.466801] va=ffff888001000000 pmd=8000000002434067 level: 2
ndvm-pv (1): [    0.466807] va=ffff888001010000 pte=0010000001010065 level: 1
ndvm-pv (1): [    0.466814] va=ffff888001010000 pmd=8000000002434067 level: 2
The start of the kernel text highmap is unmapped:
ndvm-pv (1): [    0.466820] va=ffffffff80000000 pte=0000000000000000 level: 3
ndvm-pv (1): [    0.466826] va=ffffffff80000000 pmd=0000000000000000 level: 3
Kernel PMD for .text has NX bit clear
ndvm-pv (1): [    0.466832] va=ffffffff81000000 pte=0010000001000065 level: 1
ndvm-pv (1): [    0.466838] va=ffffffff81000000 pmd=0000000002434067 level: 2
Kernel PTE for rodata_end has NX bit set
ndvm-pv (1): [    0.466846] va=ffffffff81e62000 pte=8010000001e62025 level: 1
ndvm-pv (1): [    0.466874] va=ffffffff81e62000 pmd=000000000243b067 level: 2
Directmap of rodata_end
ndvm-pv (1): [    0.466907] va=ffff888001e62000 pte=8010000001e62025 level: 1
ndvm-pv (1): [    0.466913] va=ffff888001e62000 pmd=800000000243b067 level: 2
Directmap of a low RAM address
ndvm-pv (1): [    0.466920] va=ffff888000010000 pte=0010000000010027 level: 1
ndvm-pv (1): [    0.466926] va=ffff888000010000 pmd=800000000242c067 level: 2
Directmap of another RAM address close to but below kernel text
ndvm-pv (1): [    0.466932] va=ffff88800096c000 pte=001000000096c027 level: 1
ndvm-pv (1): [    0.466938] va=ffff88800096c000 pmd=8000000002430067 level: 2

Here are some L2 entries showing the differing NX bits for l2_ident vs.
l2_kernel while they point at the same L1 addresses
ndvm-pv (1): [    0.466944]  l2_ident[  0] pmd=800000000242c067
ndvm-pv (1): [    0.466949]  l2_ident[  1] pmd=800000000242d067
ndvm-pv (1): [    0.466955]  l2_ident[  8] pmd=8000000002434067
ndvm-pv (1): [    0.466959]  l2_ident[  9] pmd=8000000002435067
ndvm-pv (1): [    0.466964]  l2_ident[ 14] pmd=800000000243a067
ndvm-pv (1): [    0.466969]  l2_ident[ 15] pmd=800000000243b067
ndvm-pv (1): [    0.466974] l2_kernel[  8] pmd=0000000002434067
ndvm-pv (1): [    0.466979] l2_kernel[  9] pmd=0000000002435067
ndvm-pv (1): [    0.466984] l2_kernel[ 14] pmd=000000000243a067
ndvm-pv (1): [    0.466989] l2_kernel[ 15] pmd=000000000243b067

One option is to add a fallback check for verify_rwx() to check the PMD
permissions to silence the warning.  Something like below.  I think it's
not readily generalizable as it hardcodes checking the PMD.  That works
for Xen where L1 PTEs are always used, but wouldn't work for Non-Xen.

The other option would be to duplicate L1 page tables.  Xen PV doesn't
support large pages, so the kernel highmap can't use large pages.  The
increased memory would add up though.

Regards,
Jason
---
 arch/x86/mm/pat/set_memory.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Jürgen Groß March 27, 2024, 11:46 a.m. UTC | #1
On 24.01.24 17:54, Jason Andryuk wrote:
> Xen PV domains show CPA W^X violations like:
> 
> CPA detected W^X violation: 0000000000000064 -> 0000000000000067 range: 0xffff888000010000 - 0xffff888000010fff PFN 10
> WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 __change_page_attr_set_clr+0x113a/0x11c0
> Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O)
> CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G           O       6.1.38 #1
> Workqueue: events bpf_prog_free_deferred
> RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0
> Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 2a fd ff ff 48 8b 85 60 ff ff ff 48
> RSP: e02b:ffffc90000367c48 EFLAGS: 00010282
> RAX: 0000000000000000 RBX: 000ffffffffef064 RCX: 0000000000000000
> RDX: 0000000000000003 RSI: 00000000fffff7ff RDI: 00000000ffffffff
> RBP: ffffc90000367d48 R08: 0000000000000000 R09: ffffc90000367aa0
> R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000067
> R13: 0000000000000001 R14: ffff888000010000 R15: ffffc90000367d60
> FS:  0000000000000000(0000) GS:ffff88800b800000(0000) knlGS:0000000000000000
> CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fdbaeda01c0 CR3: 0000000004312000 CR4: 0000000000050660
> Call Trace:
>   <TASK>
>   ? show_regs.cold+0x1a/0x1f
>   ? __change_page_attr_set_clr+0x113a/0x11c0
>   ? __warn+0x7b/0xc0
>   ? __change_page_attr_set_clr+0x113a/0x11c0
>   ? report_bug+0x111/0x1a0
>   ? handle_bug+0x4d/0xa0
>   ? exc_invalid_op+0x19/0x70
>   ? asm_exc_invalid_op+0x1b/0x20
>   ? __change_page_attr_set_clr+0x113a/0x11c0
>   ? __change_page_attr_set_clr+0x113a/0x11c0
>   ? debug_smp_processor_id+0x17/0x20
>   ? ___cache_free+0x2e/0x1e0
>   ? _raw_spin_unlock+0x1e/0x40
>   ? __purge_vmap_area_lazy+0x2ea/0x6b0
>   set_direct_map_default_noflush+0x7c/0xa0
>   __vunmap+0x1ac/0x280
>   __vfree+0x1d/0x60
>   vfree+0x27/0x40
>   __bpf_prog_free+0x44/0x50
>   bpf_prog_free_deferred+0x104/0x120
>   process_one_work+0x1ca/0x3d0
>   ? process_one_work+0x3d0/0x3d0
>   worker_thread+0x45/0x3c0
>   ? process_one_work+0x3d0/0x3d0
>   kthread+0xe2/0x110
>   ? kthread_complete_and_exit+0x20/0x20
>   ret_from_fork+0x1f/0x30
>   </TASK>
> ---[ end trace 0000000000000000 ]---
> 
> Xen provides a set of page tables that the guest executes out of when it
> starts.  The L1 entries are shared between level2_ident_pgt and
> level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in
> the level2_ident_pgt entries.  verify_rwx() only checks the l1 entry and
> reports a false-positive violation.
> 
> Here is a dump of some kernel virtual addresses and the corresponding
> L1 and L2 entries:
> This is the start of the directmap (ident) and they have NX (bit 63) set
> in the PMD.
> ndvm-pv (1): [    0.466778] va=ffff888000000000 pte=0010000000000027 level: 1
> ndvm-pv (1): [    0.466788] va=ffff888000000000 pmd=800000000242c067 level: 2
> Directmap for kernel text:
> ndvm-pv (1): [    0.466795] va=ffff888001000000 pte=0010000001000065 level: 1
> ndvm-pv (1): [    0.466801] va=ffff888001000000 pmd=8000000002434067 level: 2
> ndvm-pv (1): [    0.466807] va=ffff888001010000 pte=0010000001010065 level: 1
> ndvm-pv (1): [    0.466814] va=ffff888001010000 pmd=8000000002434067 level: 2
> The start of the kernel text highmap is unmapped:
> ndvm-pv (1): [    0.466820] va=ffffffff80000000 pte=0000000000000000 level: 3
> ndvm-pv (1): [    0.466826] va=ffffffff80000000 pmd=0000000000000000 level: 3
> Kernel PMD for .text has NX bit clear
> ndvm-pv (1): [    0.466832] va=ffffffff81000000 pte=0010000001000065 level: 1
> ndvm-pv (1): [    0.466838] va=ffffffff81000000 pmd=0000000002434067 level: 2
> Kernel PTE for rodata_end has NX bit set
> ndvm-pv (1): [    0.466846] va=ffffffff81e62000 pte=8010000001e62025 level: 1
> ndvm-pv (1): [    0.466874] va=ffffffff81e62000 pmd=000000000243b067 level: 2
> Directmap of rodata_end
> ndvm-pv (1): [    0.466907] va=ffff888001e62000 pte=8010000001e62025 level: 1
> ndvm-pv (1): [    0.466913] va=ffff888001e62000 pmd=800000000243b067 level: 2
> Directmap of a low RAM address
> ndvm-pv (1): [    0.466920] va=ffff888000010000 pte=0010000000010027 level: 1
> ndvm-pv (1): [    0.466926] va=ffff888000010000 pmd=800000000242c067 level: 2
> Directmap of another RAM address close to but below kernel text
> ndvm-pv (1): [    0.466932] va=ffff88800096c000 pte=001000000096c027 level: 1
> ndvm-pv (1): [    0.466938] va=ffff88800096c000 pmd=8000000002430067 level: 2
> 
> Here are some L2 entries showing the differing NX bits for l2_ident vs.
> l2_kernel while they point at the same L1 addresses
> ndvm-pv (1): [    0.466944]  l2_ident[  0] pmd=800000000242c067
> ndvm-pv (1): [    0.466949]  l2_ident[  1] pmd=800000000242d067
> ndvm-pv (1): [    0.466955]  l2_ident[  8] pmd=8000000002434067
> ndvm-pv (1): [    0.466959]  l2_ident[  9] pmd=8000000002435067
> ndvm-pv (1): [    0.466964]  l2_ident[ 14] pmd=800000000243a067
> ndvm-pv (1): [    0.466969]  l2_ident[ 15] pmd=800000000243b067
> ndvm-pv (1): [    0.466974] l2_kernel[  8] pmd=0000000002434067
> ndvm-pv (1): [    0.466979] l2_kernel[  9] pmd=0000000002435067
> ndvm-pv (1): [    0.466984] l2_kernel[ 14] pmd=000000000243a067
> ndvm-pv (1): [    0.466989] l2_kernel[ 15] pmd=000000000243b067
> 
> One option is to add a fallback check for verify_rwx() to check the PMD
> permissions to silence the warning.  Something like below.  I think it's
> not readily generalizable as it hardcodes checking the PMD.  That works
> for Xen where L1 PTEs are always used, but wouldn't work for Non-Xen.

I don't think this would be a real issue, as it is only Xen PV code setting
the NX bit is PMD entries.

And in case it really becomes an issue, the higher level page tables could be
checked, too.

> The other option would be to duplicate L1 page tables.  Xen PV doesn't
> support large pages, so the kernel highmap can't use large pages.  The
> increased memory would add up though.

Indeed, so I don't think this would be a good idea.

> 
> Regards,
> Jason
> ---
>   arch/x86/mm/pat/set_memory.c | 14 ++++++++++++++
>   1 file changed, 14 insertions(+)
> 
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index e9b448d1b1b7..904129b411ee 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -641,6 +641,20 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
>   	if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) != _PAGE_RW)
>   		return new;
>   
> +	if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) == _PAGE_RW) {

This if is a little bit strange, as the condition can't ever be false.

I'd rather test "if (npg == 1)" as this is the case where a PMD entry is really
existing, ...

> +		pmd_t *pmd = lookup_pmd_address(start);
> +
> +		if (pmd && pmd_val(*pmd) & _PAGE_NX) {

... removing the need to test for pmd to be not NULL.

> +			pr_debug_once("CPA PMD 0x%016lx NX prevents PTE W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
> +				      pmd_flags(*pmd),
> +				      (unsigned long long)pgprot_val(old),
> +				      (unsigned long long)pgprot_val(new),
> +				      start, end, pfn);

I'd scratch that pr_debug(), as it doesn't really help.

> +
> +			return new;
> +		}
> +	}
> +
>   	end = start + npg * PAGE_SIZE - 1;
>   	WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
>   		  (unsigned long long)pgprot_val(old),

Jason, do you want to send a V2 with your Signed-off, or would you like me to
try upstreaming the patch?


Juergen
Jason Andryuk March 28, 2024, 1:24 a.m. UTC | #2
On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
>
> On 24.01.24 17:54, Jason Andryuk wrote:
> > +
> > +                     return new;
> > +             }
> > +     }
> > +
> >       end = start + npg * PAGE_SIZE - 1;
> >       WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
> >                 (unsigned long long)pgprot_val(old),
>
> Jason, do you want to send a V2 with your Signed-off, or would you like me to
> try upstreaming the patch?

Hi Jürgen,

Yes, please upstream your approach.  I wasn't sure how to deal with
it, so it was more of a bug report.

Thanks,
Jason
Jürgen Groß March 28, 2024, 1 p.m. UTC | #3
Hi Jason,

On 28.03.24 02:24, Jason Andryuk wrote:
> On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
>>
>> On 24.01.24 17:54, Jason Andryuk wrote:
>>> +
>>> +                     return new;
>>> +             }
>>> +     }
>>> +
>>>        end = start + npg * PAGE_SIZE - 1;
>>>        WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
>>>                  (unsigned long long)pgprot_val(old),
>>
>> Jason, do you want to send a V2 with your Signed-off, or would you like me to
>> try upstreaming the patch?
> 
> Hi Jürgen,
> 
> Yes, please upstream your approach.  I wasn't sure how to deal with
> it, so it was more of a bug report.

The final solution was a bit more complicated, as there are some
corner cases to be considered. OTOH it is now complete by looking
at all used translation entries.

Are you able to test the attached patch? I don't see the original
issue and can only verify the patch doesn't cause any regression.


Juergen
Jason Andryuk March 29, 2024, 1:25 a.m. UTC | #4
On Thu, Mar 28, 2024 at 9:00 AM Jürgen Groß <jgross@suse.com> wrote:
>
> Hi Jason,
>
> On 28.03.24 02:24, Jason Andryuk wrote:
> > On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
> >>
> >> On 24.01.24 17:54, Jason Andryuk wrote:
> >>> +
> >>> +                     return new;
> >>> +             }
> >>> +     }
> >>> +
> >>>        end = start + npg * PAGE_SIZE - 1;
> >>>        WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
> >>>                  (unsigned long long)pgprot_val(old),
> >>
> >> Jason, do you want to send a V2 with your Signed-off, or would you like me to
> >> try upstreaming the patch?
> >
> > Hi Jürgen,
> >
> > Yes, please upstream your approach.  I wasn't sure how to deal with
> > it, so it was more of a bug report.
>
> The final solution was a bit more complicated, as there are some
> corner cases to be considered. OTOH it is now complete by looking
> at all used translation entries.
>
> Are you able to test the attached patch? I don't see the original
> issue and can only verify the patch doesn't cause any regression.

I'm no longer involved with OpenXT, but I reached out to some of the
developers.  Hopefully they try this out and respond here.

The backtrace in this thread is from BPF - I don't know how that was
triggered.  The other case I saw was in dom0.  That looked like it was
from freeing a module's (nouveau) .init section.  I don't seem to be
able to reproduce that on a non-OpenXT box.

Thanks,
Jason
Anthony PERARD April 8, 2024, 10:22 a.m. UTC | #5
On Thu, Mar 28, 2024 at 02:00:14PM +0100, Jürgen Groß wrote:
> Hi Jason,
> 
> On 28.03.24 02:24, Jason Andryuk wrote:
> > On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
> > > 
> > > On 24.01.24 17:54, Jason Andryuk wrote:
> > > > +
> > > > +                     return new;
> > > > +             }
> > > > +     }
> > > > +
> > > >        end = start + npg * PAGE_SIZE - 1;
> > > >        WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
> > > >                  (unsigned long long)pgprot_val(old),
> > > 
> > > Jason, do you want to send a V2 with your Signed-off, or would you like me to
> > > try upstreaming the patch?
> > 
> > Hi Jürgen,
> > 
> > Yes, please upstream your approach.  I wasn't sure how to deal with
> > it, so it was more of a bug report.
> 
> The final solution was a bit more complicated, as there are some
> corner cases to be considered. OTOH it is now complete by looking
> at all used translation entries.
> 
> Are you able to test the attached patch? I don't see the original
> issue and can only verify the patch doesn't cause any regression.
> 
> 
> Juergen

Hi Jürgen,

I gave a try to the patch in this email with osstest, and I can't find a
single "CPA detected W^X violation" log entry, when there's seems to be
many in osstest in general, from dom0 it seems as it's on the host
serial console usually.

http://logs.test-lab.xenproject.org/osstest/logs/185252/

If you look in several "serial-$host.log*" files, there will be the
"CPA detected" message, but they happen on previous test run.

I did an other smaller run before this one, and same thing:
http://logs.test-lab.xenproject.org/osstest/logs/185186/

And this other run as well, which I failed to setup properly with lots
of broken, but no failure due to the patch and I can't find any "CPA
detected" messages.
http://logs.test-lab.xenproject.org/osstest/logs/185248/

I hope that helps?

Cheers,


> From fd25a67d92e44b61d05d92658b23d026202a1656 Mon Sep 17 00:00:00 2001
> From: Juergen Gross <jgross@suse.com>
> To: x86@kernel.org
> To: linux-kernel@vger.kernel.org
> To: xen-devel@lists.xenproject.org
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: "H. Peter Anvin" <hpa@zytor.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Date: Thu, 28 Mar 2024 12:24:48 +0100
> Subject: [PATCH] x86/pat: fix W^X violation false-positives when running as
>  Xen PV guest
> 
> When running as Xen PV guest in some cases W^X violation WARN()s have
> been observed. Those WARN()s are produced by verify_rwx(), which looks
> into the PTE to verify that writable kernel pages have the NX bit set
> in order to avoid code modifications of the kernel by rogue code.
> 
> As the NX bits of all levels of translation entries are or-ed and the
> RW bits of all levels are and-ed, looking just into the PTE isn't enough
> for the decision that a writable page is executable, too. When running
> as a Xen PV guest, kernel initialization will set the NX bit in PMD
> entries of the initial page tables covering the .data segment.
> 
> When finding the PTE to have set the RW bit but no NX bit, higher level
> entries must be looked at. Only when all levels have the RW bit set and
> no NX bit set, the W^X violation should be flagged.
> 
> Additionally show_fault_oops() has a similar problem: it will issue the
> "kernel tried to execute NX-protected page" message only if it finds
> the NX bit set in the leaf translation entry, while any NX bit in
> non-leaf entries are being ignored for issuing the message.
> 
> Modify lookup_address_in_pgd() to return the effective NX and RW bit
> values of the non-leaf translation entries and evaluate those as well
> in verify_rwx() and show_fault_oops().
> 
> Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations")
> Reported-by: Jason Andryuk <jandryuk@gmail.com>
> Signed-off-by: Juergen Gross <jgross@suse.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |  2 +-
>  arch/x86/kernel/sev.c                |  3 +-
>  arch/x86/mm/fault.c                  |  7 ++--
>  arch/x86/mm/pat/set_memory.c         | 56 +++++++++++++++++++++-------
>  arch/x86/virt/svm/sev.c              |  3 +-
>  5 files changed, 52 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 0b748ee16b3d..91ab538d3872 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -565,7 +565,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>  extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
> -				    unsigned int *level);
> +				    unsigned int *level, bool *nx, bool *rw);
>  extern pmd_t *lookup_pmd_address(unsigned long address);
>  extern phys_addr_t slow_virt_to_phys(void *__address);
>  extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index b59b09c2f284..e833183d1adb 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -515,12 +515,13 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt
>  	unsigned long va = (unsigned long)vaddr;
>  	unsigned int level;
>  	phys_addr_t pa;
> +	bool nx, rw;
>  	pgd_t *pgd;
>  	pte_t *pte;
>  
>  	pgd = __va(read_cr3_pa());
>  	pgd = &pgd[pgd_index(va)];
> -	pte = lookup_address_in_pgd(pgd, va, &level);
> +	pte = lookup_address_in_pgd(pgd, va, &level, &nx, &rw);
>  	if (!pte) {
>  		ctxt->fi.vector     = X86_TRAP_PF;
>  		ctxt->fi.cr2        = vaddr;
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 622d12ec7f08..eb8e897a5653 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -514,18 +514,19 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
>  
>  	if (error_code & X86_PF_INSTR) {
>  		unsigned int level;
> +		bool nx, rw;
>  		pgd_t *pgd;
>  		pte_t *pte;
>  
>  		pgd = __va(read_cr3_pa());
>  		pgd += pgd_index(address);
>  
> -		pte = lookup_address_in_pgd(pgd, address, &level);
> +		pte = lookup_address_in_pgd(pgd, address, &level, &nx, &rw);
>  
> -		if (pte && pte_present(*pte) && !pte_exec(*pte))
> +		if (pte && pte_present(*pte) && (!pte_exec(*pte) || nx))
>  			pr_crit("kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n",
>  				from_kuid(&init_user_ns, current_uid()));
> -		if (pte && pte_present(*pte) && pte_exec(*pte) &&
> +		if (pte && pte_present(*pte) && pte_exec(*pte) && !nx &&
>  				(pgd_flags(*pgd) & _PAGE_USER) &&
>  				(__read_cr4() & X86_CR4_SMEP))
>  			pr_crit("unable to execute userspace code (SMEP?) (uid: %d)\n",
> diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
> index 80c9037ffadf..baa4dc4748e9 100644
> --- a/arch/x86/mm/pat/set_memory.c
> +++ b/arch/x86/mm/pat/set_memory.c
> @@ -619,7 +619,8 @@ static inline pgprot_t static_protections(pgprot_t prot, unsigned long start,
>   * Validate strict W^X semantics.
>   */
>  static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long start,
> -				  unsigned long pfn, unsigned long npg)
> +				  unsigned long pfn, unsigned long npg,
> +				  bool nx, bool rw)
>  {
>  	unsigned long end;
>  
> @@ -641,6 +642,10 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
>  	if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) != _PAGE_RW)
>  		return new;
>  
> +	/* Non-leaf translation entries can disable writing or execution. */
> +	if (!rw || nx)
> +		return new;
> +
>  	end = start + npg * PAGE_SIZE - 1;
>  	WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
>  		  (unsigned long long)pgprot_val(old),
> @@ -660,17 +665,22 @@ static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
>   * Return a pointer to the entry and the level of the mapping.
>   */
>  pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
> -			     unsigned int *level)
> +			     unsigned int *level, bool *nx, bool *rw)
>  {
>  	p4d_t *p4d;
>  	pud_t *pud;
>  	pmd_t *pmd;
>  
>  	*level = PG_LEVEL_NONE;
> +	*nx = false;
> +	*rw = true;
>  
>  	if (pgd_none(*pgd))
>  		return NULL;
>  
> +	*nx |= pgd_flags(*pgd) & _PAGE_NX;
> +	*rw &= pgd_flags(*pgd) & _PAGE_RW;
> +
>  	p4d = p4d_offset(pgd, address);
>  	if (p4d_none(*p4d))
>  		return NULL;
> @@ -679,6 +689,9 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>  	if (p4d_leaf(*p4d) || !p4d_present(*p4d))
>  		return (pte_t *)p4d;
>  
> +	*nx |= p4d_flags(*p4d) & _PAGE_NX;
> +	*rw &= p4d_flags(*p4d) & _PAGE_RW;
> +
>  	pud = pud_offset(p4d, address);
>  	if (pud_none(*pud))
>  		return NULL;
> @@ -687,6 +700,9 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>  	if (pud_leaf(*pud) || !pud_present(*pud))
>  		return (pte_t *)pud;
>  
> +	*nx |= pud_flags(*pud) & _PAGE_NX;
> +	*rw &= pud_flags(*pud) & _PAGE_RW;
> +
>  	pmd = pmd_offset(pud, address);
>  	if (pmd_none(*pmd))
>  		return NULL;
> @@ -695,6 +711,9 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>  	if (pmd_leaf(*pmd) || !pmd_present(*pmd))
>  		return (pte_t *)pmd;
>  
> +	*nx |= pmd_flags(*pmd) & _PAGE_NX;
> +	*rw &= pmd_flags(*pmd) & _PAGE_RW;
> +
>  	*level = PG_LEVEL_4K;
>  
>  	return pte_offset_kernel(pmd, address);
> @@ -710,18 +729,24 @@ pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
>   */
>  pte_t *lookup_address(unsigned long address, unsigned int *level)
>  {
> -	return lookup_address_in_pgd(pgd_offset_k(address), address, level);
> +	bool nx, rw;
> +
> +	return lookup_address_in_pgd(pgd_offset_k(address), address, level,
> +				     &nx, &rw);
>  }
>  EXPORT_SYMBOL_GPL(lookup_address);
>  
>  static pte_t *_lookup_address_cpa(struct cpa_data *cpa, unsigned long address,
> -				  unsigned int *level)
> +				  unsigned int *level, bool *nx, bool *rw)
>  {
> -	if (cpa->pgd)
> -		return lookup_address_in_pgd(cpa->pgd + pgd_index(address),
> -					       address, level);
> +	pgd_t *pgd;
> +
> +	if (!cpa->pgd)
> +		pgd = pgd_offset_k(address);
> +	else
> +		pgd = cpa->pgd + pgd_index(address);
>  
> -	return lookup_address(address, level);
> +	return lookup_address_in_pgd(pgd, address, level, nx, rw);
>  }
>  
>  /*
> @@ -849,12 +874,13 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
>  	pgprot_t old_prot, new_prot, req_prot, chk_prot;
>  	pte_t new_pte, *tmp;
>  	enum pg_level level;
> +	bool nx, rw;
>  
>  	/*
>  	 * Check for races, another CPU might have split this page
>  	 * up already:
>  	 */
> -	tmp = _lookup_address_cpa(cpa, address, &level);
> +	tmp = _lookup_address_cpa(cpa, address, &level, &nx, &rw);
>  	if (tmp != kpte)
>  		return 1;
>  
> @@ -965,7 +991,8 @@ static int __should_split_large_page(pte_t *kpte, unsigned long address,
>  	new_prot = static_protections(req_prot, lpaddr, old_pfn, numpages,
>  				      psize, CPA_DETECT);
>  
> -	new_prot = verify_rwx(old_prot, new_prot, lpaddr, old_pfn, numpages);
> +	new_prot = verify_rwx(old_prot, new_prot, lpaddr, old_pfn, numpages,
> +			      nx, rw);
>  
>  	/*
>  	 * If there is a conflict, split the large page.
> @@ -1046,6 +1073,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
>  	pte_t *pbase = (pte_t *)page_address(base);
>  	unsigned int i, level;
>  	pgprot_t ref_prot;
> +	bool nx, rw;
>  	pte_t *tmp;
>  
>  	spin_lock(&pgd_lock);
> @@ -1053,7 +1081,7 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
>  	 * Check for races, another CPU might have split this page
>  	 * up for us already:
>  	 */
> -	tmp = _lookup_address_cpa(cpa, address, &level);
> +	tmp = _lookup_address_cpa(cpa, address, &level, &nx, &rw);
>  	if (tmp != kpte) {
>  		spin_unlock(&pgd_lock);
>  		return 1;
> @@ -1594,10 +1622,11 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
>  	int do_split, err;
>  	unsigned int level;
>  	pte_t *kpte, old_pte;
> +	bool nx, rw;
>  
>  	address = __cpa_addr(cpa, cpa->curpage);
>  repeat:
> -	kpte = _lookup_address_cpa(cpa, address, &level);
> +	kpte = _lookup_address_cpa(cpa, address, &level, &nx, &rw);
>  	if (!kpte)
>  		return __cpa_process_fault(cpa, address, primary);
>  
> @@ -1619,7 +1648,8 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
>  		new_prot = static_protections(new_prot, address, pfn, 1, 0,
>  					      CPA_PROTECT);
>  
> -		new_prot = verify_rwx(old_prot, new_prot, address, pfn, 1);
> +		new_prot = verify_rwx(old_prot, new_prot, address, pfn, 1,
> +				      nx, rw);
>  
>  		new_prot = pgprot_clear_protnone_bits(new_prot);
>  
> diff --git a/arch/x86/virt/svm/sev.c b/arch/x86/virt/svm/sev.c
> index cffe1157a90a..9131d80cff36 100644
> --- a/arch/x86/virt/svm/sev.c
> +++ b/arch/x86/virt/svm/sev.c
> @@ -338,12 +338,13 @@ void snp_dump_hva_rmpentry(unsigned long hva)
>  {
>  	unsigned long paddr;
>  	unsigned int level;
> +	bool nx, rw;
>  	pgd_t *pgd;
>  	pte_t *pte;
>  
>  	pgd = __va(read_cr3_pa());
>  	pgd += pgd_index(hva);
> -	pte = lookup_address_in_pgd(pgd, hva, &level);
> +	pte = lookup_address_in_pgd(pgd, hva, &level, &nx, &rw);
>  
>  	if (!pte) {
>  		pr_err("Can't dump RMP entry for HVA %lx: no PTE/PFN found\n", hva);
> -- 
> 2.35.3
>
Anthony PERARD April 8, 2024, 10:25 a.m. UTC | #6
On Mon, Apr 08, 2024 at 11:22:59AM +0100, Anthony PERARD wrote:
> On Thu, Mar 28, 2024 at 02:00:14PM +0100, Jürgen Groß wrote:
> > Hi Jason,
> > 
> > On 28.03.24 02:24, Jason Andryuk wrote:
> > > On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
> > > > 
> > > > On 24.01.24 17:54, Jason Andryuk wrote:
> > > > > +
> > > > > +                     return new;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > >        end = start + npg * PAGE_SIZE - 1;
> > > > >        WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
> > > > >                  (unsigned long long)pgprot_val(old),
> > > > 
> > > > Jason, do you want to send a V2 with your Signed-off, or would you like me to
> > > > try upstreaming the patch?
> > > 
> > > Hi Jürgen,
> > > 
> > > Yes, please upstream your approach.  I wasn't sure how to deal with
> > > it, so it was more of a bug report.
> > 
> > The final solution was a bit more complicated, as there are some
> > corner cases to be considered. OTOH it is now complete by looking
> > at all used translation entries.
> > 
> > Are you able to test the attached patch? I don't see the original
> > issue and can only verify the patch doesn't cause any regression.
> > 
> > 
> > Juergen
> 
> Hi Jürgen,
> 
> I gave a try to the patch in this email with osstest, and I can't find a
> single "CPA detected W^X violation" log entry, when there's seems to be
> many in osstest in general, from dom0 it seems as it's on the host
> serial console usually.
> 
> http://logs.test-lab.xenproject.org/osstest/logs/185252/
> 
> If you look in several "serial-$host.log*" files, there will be the
> "CPA detected" message, but they happen on previous test run.
> 
> I did an other smaller run before this one, and same thing:
> http://logs.test-lab.xenproject.org/osstest/logs/185186/
> 
> And this other run as well, which I failed to setup properly with lots
> of broken, but no failure due to the patch and I can't find any "CPA
> detected" messages.
> http://logs.test-lab.xenproject.org/osstest/logs/185248/
> 
> I hope that helps?

BTW, I did apply the patch to Linux 6.1:
https://xenbits.xen.org/gitweb/?p=people/aperard/linux.git;a=shortlog;h=refs/heads/9b5a105a-650c-4b33-9f4e-03612fb6701c@suse.com
Jürgen Groß April 9, 2024, 9:26 a.m. UTC | #7
On 08.04.24 12:22, Anthony PERARD wrote:
> On Thu, Mar 28, 2024 at 02:00:14PM +0100, Jürgen Groß wrote:
>> Hi Jason,
>>
>> On 28.03.24 02:24, Jason Andryuk wrote:
>>> On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß <jgross@suse.com> wrote:
>>>>
>>>> On 24.01.24 17:54, Jason Andryuk wrote:
>>>>> +
>>>>> +                     return new;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>>         end = start + npg * PAGE_SIZE - 1;
>>>>>         WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
>>>>>                   (unsigned long long)pgprot_val(old),
>>>>
>>>> Jason, do you want to send a V2 with your Signed-off, or would you like me to
>>>> try upstreaming the patch?
>>>
>>> Hi Jürgen,
>>>
>>> Yes, please upstream your approach.  I wasn't sure how to deal with
>>> it, so it was more of a bug report.
>>
>> The final solution was a bit more complicated, as there are some
>> corner cases to be considered. OTOH it is now complete by looking
>> at all used translation entries.
>>
>> Are you able to test the attached patch? I don't see the original
>> issue and can only verify the patch doesn't cause any regression.
>>
>>
>> Juergen
> 
> Hi Jürgen,
> 
> I gave a try to the patch in this email with osstest, and I can't find a
> single "CPA detected W^X violation" log entry, when there's seems to be
> many in osstest in general, from dom0 it seems as it's on the host
> serial console usually.
> 
> http://logs.test-lab.xenproject.org/osstest/logs/185252/
> 
> If you look in several "serial-$host.log*" files, there will be the
> "CPA detected" message, but they happen on previous test run.
> 
> I did an other smaller run before this one, and same thing:
> http://logs.test-lab.xenproject.org/osstest/logs/185186/
> 
> And this other run as well, which I failed to setup properly with lots
> of broken, but no failure due to the patch and I can't find any "CPA
> detected" messages.
> http://logs.test-lab.xenproject.org/osstest/logs/185248/
> 
> I hope that helps?

Yes, it does. Thanks for testing.


Juergen
diff mbox series

Patch

diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index e9b448d1b1b7..904129b411ee 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -641,6 +641,20 @@  static inline pgprot_t verify_rwx(pgprot_t old, pgprot_t new, unsigned long star
 	if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) != _PAGE_RW)
 		return new;
 
+	if ((pgprot_val(new) & (_PAGE_RW | _PAGE_NX)) == _PAGE_RW) {
+		pmd_t *pmd = lookup_pmd_address(start);
+
+		if (pmd && pmd_val(*pmd) & _PAGE_NX) {
+			pr_debug_once("CPA PMD 0x%016lx NX prevents PTE W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
+				      pmd_flags(*pmd),
+				      (unsigned long long)pgprot_val(old),
+				      (unsigned long long)pgprot_val(new),
+				      start, end, pfn);
+
+			return new;
+		}
+	}
+
 	end = start + npg * PAGE_SIZE - 1;
 	WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n",
 		  (unsigned long long)pgprot_val(old),