diff mbox series

[Part2,RFC,v4,07/40] x86/sev: Split the physmap when adding the page in RMP table

Message ID 20210707183616.5620-8-brijesh.singh@amd.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand

Commit Message

Brijesh Singh July 7, 2021, 6:35 p.m. UTC
The integrity guarantee of SEV-SNP is enforced through the RMP table.
The RMP is used in conjuntion with standard x86 and IOMMU page
tables to enforce memory restrictions and page access rights. The
RMP is indexed by system physical address, and is checked at the end
of CPU and IOMMU table walks. The RMP check is enforced as soon as
SEV-SNP is enabled globally in the system. Not every memory access
requires an RMP check. In particular, the read accesses from the
hypervisor do not require RMP checks because the data confidentiality
is already protected via memory encryption. When hardware encounters
an RMP checks failure, it raise a page-fault exception. The RMP bit in
fault error code can be used to determine if the fault was due to an
RMP checks failure.

A write from the hypervisor goes through the RMP checks. When the
hypervisor writes to pages, hardware checks to ensures that the assigned
bit in the RMP is zero (i.e page is shared). If the page table entry that
gives the sPA indicates that the target page size is a large page, then
all RMP entries for the 4KB constituting pages of the target must have the
assigned bit 0. If one of entry does not have assigned bit 0 then hardware
will raise an RMP violation. To resolve it, split the page table entry
leading to target page into 4K.

This poses a challenge in the Linux memory model. The Linux kernel
creates a direct mapping of all the physical memory -- referred to as
the physmap. The physmap may contain a valid mapping of guest owned pages.
During the page table walk, the host access may get into the situation
where one of the pages within the large page is owned by the guest (i.e
assigned bit is set in RMP). A write to a non-guest within the large page
will raise an RMP violation. Call set_memory_4k() to split the physmap
before adding the page in the RMP table. This ensures that the pages
added in the RMP table are used as 4K in the physmap.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/kernel/sev.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Sean Christopherson July 14, 2021, 10:25 p.m. UTC | #1
On Wed, Jul 07, 2021, Brijesh Singh wrote:
> The integrity guarantee of SEV-SNP is enforced through the RMP table.
> The RMP is used in conjuntion with standard x86 and IOMMU page
> tables to enforce memory restrictions and page access rights. The
> RMP is indexed by system physical address, and is checked at the end
> of CPU and IOMMU table walks. The RMP check is enforced as soon as
> SEV-SNP is enabled globally in the system. Not every memory access
> requires an RMP check. In particular, the read accesses from the
> hypervisor do not require RMP checks because the data confidentiality
> is already protected via memory encryption. When hardware encounters
> an RMP checks failure, it raise a page-fault exception. The RMP bit in
> fault error code can be used to determine if the fault was due to an
> RMP checks failure.
> 
> A write from the hypervisor goes through the RMP checks. When the
> hypervisor writes to pages, hardware checks to ensures that the assigned
> bit in the RMP is zero (i.e page is shared). If the page table entry that
> gives the sPA indicates that the target page size is a large page, then
> all RMP entries for the 4KB constituting pages of the target must have the
> assigned bit 0. If one of entry does not have assigned bit 0 then hardware
> will raise an RMP violation. To resolve it, split the page table entry
> leading to target page into 4K.

Isn't the above just saying:

  All RMP entries covered by a large page must match the shared vs. encrypted
  state of the page, e.g. host large pages must have assigned=0 for all relevant
  RMP entries.

> This poses a challenge in the Linux memory model. The Linux kernel
> creates a direct mapping of all the physical memory -- referred to as
> the physmap. The physmap may contain a valid mapping of guest owned pages.
> During the page table walk, the host access may get into the situation
> where one of the pages within the large page is owned by the guest (i.e
> assigned bit is set in RMP). A write to a non-guest within the large page
> will raise an RMP violation. Call set_memory_4k() to split the physmap
> before adding the page in the RMP table. This ensures that the pages
> added in the RMP table are used as 4K in the physmap.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/kernel/sev.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
> index 949efe530319..a482e01f880a 100644
> --- a/arch/x86/kernel/sev.c
> +++ b/arch/x86/kernel/sev.c
> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>  	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>  		return -ENXIO;
>  
> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);

IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
the large pages are never recovered?

I believe a better approach would be to do something similar to memfd_secret[*],
which encountered a similar problem with the direct map.  Instead of forcing the
direct map to be forever 4k, unmap the direct map when making a page guest private,
and restore the direct map when it's made shared (or freed).

I thought memfd_secret had also solved the problem of restoring large pages in
the direct map, but at a glance I can't tell if that's actually implemented
anywhere.  But, even if it's not currently implemented, I think it makes sense
to mimic the memfd_secret approach so that both features can benefit if large
page preservation/restoration is ever added.

[*] https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org

> +	if (ret) {
> +		pr_err("Failed to split physical address 0x%lx (%d)\n", spa, ret);
> +		return ret;
> +	}
> +
>  	/* Retry if another processor is modifying the RMP entry. */
>  	do {
>  		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */
> -- 
> 2.17.1
>
Brijesh Singh July 15, 2021, 5:05 p.m. UTC | #2
On 7/14/21 5:25 PM, Sean Christopherson wrote:
>> A write from the hypervisor goes through the RMP checks. When the
>> hypervisor writes to pages, hardware checks to ensures that the assigned
>> bit in the RMP is zero (i.e page is shared). If the page table entry that
>> gives the sPA indicates that the target page size is a large page, then
>> all RMP entries for the 4KB constituting pages of the target must have the
>> assigned bit 0. If one of entry does not have assigned bit 0 then hardware
>> will raise an RMP violation. To resolve it, split the page table entry
>> leading to target page into 4K.
> 
> Isn't the above just saying:
> 
>    All RMP entries covered by a large page must match the shared vs. encrypted
>    state of the page, e.g. host large pages must have assigned=0 for all relevant
>    RMP entries.
> 

Yes.


>> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>>   	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>>   		return -ENXIO;
>>   
>> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
> 
> IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
> the large pages are never recovered?
> 
> I believe a better approach would be to do something similar to memfd_secret[*],
> which encountered a similar problem with the direct map.  Instead of forcing the
> direct map to be forever 4k, unmap the direct map when making a page guest private,
> and restore the direct map when it's made shared (or freed).
> 
> I thought memfd_secret had also solved the problem of restoring large pages in
> the direct map, but at a glance I can't tell if that's actually implemented
> anywhere.  But, even if it's not currently implemented, I think it makes sense
> to mimic the memfd_secret approach so that both features can benefit if large
> page preservation/restoration is ever added.
> 

thanks for the memfd_secrets pointer. At the lowest level it shares the
same logic to split the physmap. We both end up calling to
change_page_attrs_set_clr() which split the page and updates the page
table attributes.

Given this, I believe in future if the change_page_attrs_set_clr() is 
enhanced to track the splitting of the pages and restore it later then 
it should work transparently.

thanks
Sean Christopherson July 15, 2021, 5:51 p.m. UTC | #3
On Thu, Jul 15, 2021, Brijesh Singh wrote:
> 
> On 7/14/21 5:25 PM, Sean Christopherson wrote:
> > > @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
> > >   	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
> > >   		return -ENXIO;
> > > +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
> > 
> > IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
> > the large pages are never recovered?
> > 
> > I believe a better approach would be to do something similar to memfd_secret[*],
> > which encountered a similar problem with the direct map.  Instead of forcing the
> > direct map to be forever 4k, unmap the direct map when making a page guest private,
> > and restore the direct map when it's made shared (or freed).
> > 
> > I thought memfd_secret had also solved the problem of restoring large pages in
> > the direct map, but at a glance I can't tell if that's actually implemented
> > anywhere.  But, even if it's not currently implemented, I think it makes sense
> > to mimic the memfd_secret approach so that both features can benefit if large
> > page preservation/restoration is ever added.
> > 
> 
> thanks for the memfd_secrets pointer. At the lowest level it shares the
> same logic to split the physmap. We both end up calling to
> change_page_attrs_set_clr() which split the page and updates the page
> table attributes.
> 
> Given this, I believe in future if the change_page_attrs_set_clr() is
> enhanced to track the splitting of the pages and restore it later then it
> should work transparently.

But something actually needs to initiate the restore.  If the RMPUDATE path just
force 4k pages then there will never be a restore.  And zapping the direct map
for private pages is a good thing, e.g. prevents the kernel from reading garbage,
which IIUC isn't enforced by the RMP?
Brijesh Singh July 15, 2021, 6:14 p.m. UTC | #4
On 7/15/21 12:51 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>
>> On 7/14/21 5:25 PM, Sean Christopherson wrote:
>>>> @@ -2375,6 +2375,12 @@ int rmpupdate(struct page *page, struct rmpupdate *val)
>>>>    	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
>>>>    		return -ENXIO;
>>>> +	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
>>>
>>> IIUC, this shatters the direct map for page that's assigned to an SNP guest, and
>>> the large pages are never recovered?
>>>
>>> I believe a better approach would be to do something similar to memfd_secret[*],
>>> which encountered a similar problem with the direct map.  Instead of forcing the
>>> direct map to be forever 4k, unmap the direct map when making a page guest private,
>>> and restore the direct map when it's made shared (or freed).
>>>
>>> I thought memfd_secret had also solved the problem of restoring large pages in
>>> the direct map, but at a glance I can't tell if that's actually implemented
>>> anywhere.  But, even if it's not currently implemented, I think it makes sense
>>> to mimic the memfd_secret approach so that both features can benefit if large
>>> page preservation/restoration is ever added.
>>>
>>
>> thanks for the memfd_secrets pointer. At the lowest level it shares the
>> same logic to split the physmap. We both end up calling to
>> change_page_attrs_set_clr() which split the page and updates the page
>> table attributes.
>>
>> Given this, I believe in future if the change_page_attrs_set_clr() is
>> enhanced to track the splitting of the pages and restore it later then it
>> should work transparently.
> 
> But something actually needs to initiate the restore.  If the RMPUDATE path just
> force 4k pages then there will never be a restore.  And zapping the direct map
> for private pages is a good thing, e.g. prevents the kernel from reading garbage,
> which IIUC isn't enforced by the RMP?
> 

Yes, something need to initiated the restore. Since the restore support 
is not present today so its difficult to say how it will be look. I am 
thinking that restore thread may use some kind of notifier to check with 
the caller whether its safe to restore the page ranges. In case of the 
SEV-SNP, the SNP registered notifier will reject if the guest is running.

The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() 
and it is designed to remove/add the present bit in the direct map. We 
can't use them, because in our case the page may get accessed by the KVM 
(e.g kvm_guest_write, kvm_guest_map etc).

thanks
Sean Christopherson July 15, 2021, 6:39 p.m. UTC | #5
On Thu, Jul 15, 2021, Brijesh Singh wrote:
> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
> is designed to remove/add the present bit in the direct map. We can't use
> them, because in our case the page may get accessed by the KVM (e.g
> kvm_guest_write, kvm_guest_map etc).

But KVM should never access a guest private page, i.e. the direct map should
always be restored to PRESENT before KVM attempts to access the page.
Brijesh Singh July 15, 2021, 7:38 p.m. UTC | #6
On 7/15/21 1:39 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>> is designed to remove/add the present bit in the direct map. We can't use
>> them, because in our case the page may get accessed by the KVM (e.g
>> kvm_guest_write, kvm_guest_map etc).
> 
> But KVM should never access a guest private page, i.e. the direct map should
> always be restored to PRESENT before KVM attempts to access the page.
> 

Yes, KVM should *never* access the guest private pages. So, we could 
potentially enhance the RMPUPDATE() to check for the assigned and act 
accordingly.

Are you thinking something along the line of this:

int rmpupdate(struct page *page, struct rmpupdate *val)
{
	...
	
	/*
	 * If page is getting assigned in the RMP entry then unmap
	 * it from the direct map before its added in the RMP table.
	 */
	if (val.assigned)
		set_direct_map_invalid_noflush(page_to_virt(page), 1);

	...

	/*
	 * If the page is getting unassigned then restore the mapping
	 * in the direct map after its removed from the RMP table.
	 */
	if (!val.assigned)
		set_direct_map_default_noflush(page_to_virt(page), 1);
	
	...
}

thanks
Sean Christopherson July 15, 2021, 10:01 p.m. UTC | #7
On Thu, Jul 15, 2021, Brijesh Singh wrote:
> 
> 
> On 7/15/21 1:39 PM, Sean Christopherson wrote:
> > On Thu, Jul 15, 2021, Brijesh Singh wrote:
> > > The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
> > > is designed to remove/add the present bit in the direct map. We can't use
> > > them, because in our case the page may get accessed by the KVM (e.g
> > > kvm_guest_write, kvm_guest_map etc).
> > 
> > But KVM should never access a guest private page, i.e. the direct map should
> > always be restored to PRESENT before KVM attempts to access the page.
> > 
> 
> Yes, KVM should *never* access the guest private pages. So, we could
> potentially enhance the RMPUPDATE() to check for the assigned and act
> accordingly.
> 
> Are you thinking something along the line of this:
> 
> int rmpupdate(struct page *page, struct rmpupdate *val)
> {
> 	...
> 	
> 	/*
> 	 * If page is getting assigned in the RMP entry then unmap
> 	 * it from the direct map before its added in the RMP table.
> 	 */
> 	if (val.assigned)
> 		set_direct_map_invalid_noflush(page_to_virt(page), 1);
> 
> 	...
> 
> 	/*
> 	 * If the page is getting unassigned then restore the mapping
> 	 * in the direct map after its removed from the RMP table.
> 	 */
> 	if (!val.assigned)
> 		set_direct_map_default_noflush(page_to_virt(page), 1);
> 	
> 	...
> }

Yep.

However, looking at the KVM usage, rmpupdate() appears to be broken.  When
handling a page state change, the guest can specify a 2mb page.  In that case,
rmpupdate() will be called once for a 2mb page, but this flow assumes a single
4kb page.  The current code works because set_memory_4k() will cause the entire
2mb page to be shattered, but it's technically wrong and switching to the above
would cause problems.
Brijesh Singh July 15, 2021, 10:11 p.m. UTC | #8
On 7/15/21 5:01 PM, Sean Christopherson wrote:
> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>
>> On 7/15/21 1:39 PM, Sean Christopherson wrote:
>>> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>>>> is designed to remove/add the present bit in the direct map. We can't use
>>>> them, because in our case the page may get accessed by the KVM (e.g
>>>> kvm_guest_write, kvm_guest_map etc).
>>> But KVM should never access a guest private page, i.e. the direct map should
>>> always be restored to PRESENT before KVM attempts to access the page.
>>>
>> Yes, KVM should *never* access the guest private pages. So, we could
>> potentially enhance the RMPUPDATE() to check for the assigned and act
>> accordingly.
>>
>> Are you thinking something along the line of this:
>>
>> int rmpupdate(struct page *page, struct rmpupdate *val)
>> {
>> 	...
>> 	
>> 	/*
>> 	 * If page is getting assigned in the RMP entry then unmap
>> 	 * it from the direct map before its added in the RMP table.
>> 	 */
>> 	if (val.assigned)
>> 		set_direct_map_invalid_noflush(page_to_virt(page), 1);
>>
>> 	...
>>
>> 	/*
>> 	 * If the page is getting unassigned then restore the mapping
>> 	 * in the direct map after its removed from the RMP table.
>> 	 */
>> 	if (!val.assigned)
>> 		set_direct_map_default_noflush(page_to_virt(page), 1);
>> 	
>> 	...
>> }
> Yep.
>
> However, looking at the KVM usage, rmpupdate() appears to be broken.  When
> handling a page state change, the guest can specify a 2mb page.  In that case,
> rmpupdate() will be called once for a 2mb page, but this flow assumes a single
> 4kb page.  The current code works because set_memory_4k() will cause the entire
> 2mb page to be shattered, but it's technically wrong and switching to the above
> would cause problems.


Yep, this was just an example to make sure I am able to follow you
correctly. In the actual patch I am going to read the pagesize from the
RMPUPDATE structure and  calculated npages for the
set_direct_map_default(...). As you said it was not needed in the case
of set_memory_4k() because the function force splits the large page.
Whereas with set_direct_map_default(), it first checks whether the split
is required, if not, then skip and update the attributes.

-Brijesh
Vlastimil Babka July 30, 2021, 11:31 a.m. UTC | #9
On 7/15/21 9:38 PM, Brijesh Singh wrote:
> 
> 
> On 7/15/21 1:39 PM, Sean Christopherson wrote:
>> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>>> is designed to remove/add the present bit in the direct map. We can't use
>>> them, because in our case the page may get accessed by the KVM (e.g
>>> kvm_guest_write, kvm_guest_map etc).
>>
>> But KVM should never access a guest private page, i.e. the direct map should
>> always be restored to PRESENT before KVM attempts to access the page.
>>
> 
> Yes, KVM should *never* access the guest private pages. So, we could potentially
> enhance the RMPUPDATE() to check for the assigned and act accordingly.

I think I'm not the first one suggesting it, but IMHO the best solution would be
to leave direct map alone (except maybe in some debugging/development mode where
you could do the unmapping to catch unexpected host accesses), and once there's
a situation with host accessing a shared page of the guest, it would temporarily
kmap() it outside of the direct map. Shouldn't these situations be rare enough,
and already recognizable due to the need to set up the page sharing in the first
place, that this approach would be feasible?

> Are you thinking something along the line of this:
> 
> int rmpupdate(struct page *page, struct rmpupdate *val)
> {
>     ...
>     
>     /*
>      * If page is getting assigned in the RMP entry then unmap
>      * it from the direct map before its added in the RMP table.
>      */
>     if (val.assigned)
>         set_direct_map_invalid_noflush(page_to_virt(page), 1);
> 
>     ...
> 
>     /*
>      * If the page is getting unassigned then restore the mapping
>      * in the direct map after its removed from the RMP table.
>      */
>     if (!val.assigned)
>         set_direct_map_default_noflush(page_to_virt(page), 1);
>     
>     ...
> }
> 
> thanks
Brijesh Singh July 30, 2021, 4:10 p.m. UTC | #10
On 7/30/21 6:31 AM, Vlastimil Babka wrote:
> On 7/15/21 9:38 PM, Brijesh Singh wrote:
>>
>> On 7/15/21 1:39 PM, Sean Christopherson wrote:
>>> On Thu, Jul 15, 2021, Brijesh Singh wrote:
>>>> The memfd_secrets uses the set_direct_map_{invalid,default}_noflush() and it
>>>> is designed to remove/add the present bit in the direct map. We can't use
>>>> them, because in our case the page may get accessed by the KVM (e.g
>>>> kvm_guest_write, kvm_guest_map etc).
>>> But KVM should never access a guest private page, i.e. the direct map should
>>> always be restored to PRESENT before KVM attempts to access the page.
>>>
>> Yes, KVM should *never* access the guest private pages. So, we could potentially
>> enhance the RMPUPDATE() to check for the assigned and act accordingly.
> I think I'm not the first one suggesting it, but IMHO the best solution would be
> to leave direct map alone (except maybe in some debugging/development mode where
> you could do the unmapping to catch unexpected host accesses), and once there's
> a situation with host accessing a shared page of the guest, it would temporarily
> kmap() it outside of the direct map. Shouldn't these situations be rare enough,
> and already recognizable due to the need to set up the page sharing in the first
> place, that this approach would be feasible?

The host accessing a guest shared is not rare, some shared pages are
accessed on every VM-Entry and Exit. However, there are limited number
of such pages and we could track and temporary kmap() outside of the
direct map. The main challenge is when a host pages are added as
'private' in the RMP table.

e.g

page = alloc_page(GFP_KERNEL_ACCOUNT);

// firmware context page

rmp_make_firmware(page_to_pfn(page), PG_LEVEL_4K);

or

// VMSA page

rmp_make_private(page_to_pfn(page), 0, sev->asid, PG_LEVEL_4K);

The page is added as 4K in the RMP entry; If the pfn is part of the
large direct map then hardware will raise an RMP violation and to
resolve the fault we must split the direct map.

thanks

>
>> Are you thinking something along the line of this:
>>
>> int rmpupdate(struct page *page, struct rmpupdate *val)
>> {
>>     ...
>>     
>>     /*
>>      * If page is getting assigned in the RMP entry then unmap
>>      * it from the direct map before its added in the RMP table.
>>      */
>>     if (val.assigned)
>>         set_direct_map_invalid_noflush(page_to_virt(page), 1);
>>
>>     ...
>>
>>     /*
>>      * If the page is getting unassigned then restore the mapping
>>      * in the direct map after its removed from the RMP table.
>>      */
>>     if (!val.assigned)
>>         set_direct_map_default_noflush(page_to_virt(page), 1);
>>     
>>     ...
>> }
>>
>> thanks
diff mbox series

Patch

diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 949efe530319..a482e01f880a 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -2375,6 +2375,12 @@  int rmpupdate(struct page *page, struct rmpupdate *val)
 	if (!cpu_feature_enabled(X86_FEATURE_SEV_SNP))
 		return -ENXIO;
 
+	ret = set_memory_4k((unsigned long)page_to_virt(page), 1);
+	if (ret) {
+		pr_err("Failed to split physical address 0x%lx (%d)\n", spa, ret);
+		return ret;
+	}
+
 	/* Retry if another processor is modifying the RMP entry. */
 	do {
 		/* Binutils version 2.36 supports the RMPUPDATE mnemonic. */