[v17,19/35] arch/mm: Export direct {un,}map functions

Message ID	20240222-gunyah-v17-19-1e9da6763d38@quicinc.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org> From: Elliot Berman <quic_eberman@quicinc.com> Date: Thu, 22 Feb 2024 15:16:42 -0800 Subject: [PATCH v17 19/35] arch/mm: Export direct {un,}map functions MIME-Version: 1.0 Message-ID: <20240222-gunyah-v17-19-1e9da6763d38@quicinc.com> References: <20240222-gunyah-v17-0-1e9da6763d38@quicinc.com> In-Reply-To: <20240222-gunyah-v17-0-1e9da6763d38@quicinc.com> To: Alex Elder <elder@linaro.org>, Srinivas Kandagatla <srinivas.kandagatla@linaro.org>, Murali Nalajal <quic_mnalajal@quicinc.com>, Trilok Soni <quic_tsoni@quicinc.com>, Srivatsa Vaddagiri <quic_svaddagi@quicinc.com>, Carl van Schaik <quic_cvanscha@quicinc.com>, Philip Derrin <quic_pderrin@quicinc.com>, Prakruthi Deepak Heragu <quic_pheragu@quicinc.com>, Jonathan Corbet <corbet@lwn.net>, Rob Herring <robh+dt@kernel.org>, Krzysztof Kozlowski <krzysztof.kozlowski+dt@linaro.org>, Conor Dooley <conor+dt@kernel.org>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Konrad Dybcio <konrad.dybcio@linaro.org>, Bjorn Andersson <andersson@kernel.org>, Dmitry Baryshkov <dmitry.baryshkov@linaro.org>, "Fuad Tabba" <tabba@google.com>, Sean Christopherson <seanjc@google.com>, "Andrew Morton" <akpm@linux-foundation.org> CC: <linux-arm-msm@vger.kernel.org>, <linux-doc@vger.kernel.org>, <linux-kernel@vger.kernel.org>, <devicetree@vger.kernel.org>, <linux-arm-kernel@lists.infradead.org>, <linux-mm@kvack.org>, Elliot Berman <quic_eberman@quicinc.com> Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+linux-arm-kernel=archiver.kernel.org@lists.infradead.org
Series	Drivers for Gunyah hypervisor \| expand [v17,00/35] Drivers for Gunyah hypervisor [v17,01/35] docs: gunyah: Introduce Gunyah Hypervisor [v17,02/35] dt-bindings: Add binding for gunyah hypervisor [v17,03/35] gunyah: Common types and error codes for Gunyah hypercalls [v17,04/35] virt: gunyah: Add hypercalls to identify Gunyah [v17,05/35] virt: gunyah: Add hypervisor driver [v17,06/35] virt: gunyah: msgq: Add hypercalls to send and receive messages [v17,07/35] gunyah: rsc_mgr: Add resource manager RPC core [v17,08/35] gunyah: vm_mgr: Introduce basic VM Manager [v17,09/35] gunyah: rsc_mgr: Add VM lifecycle RPC [v17,10/35] gunyah: vm_mgr: Add VM start/stop [v17,11/35] virt: gunyah: Translate gh_rm_hyp_resource into gunyah_resource [v17,12/35] virt: gunyah: Add resource tickets [v17,13/35] gunyah: vm_mgr: Add framework for VM Functions [v17,14/35] virt: gunyah: Add hypercalls for running a vCPU [v17,15/35] virt: gunyah: Add proxy-scheduled vCPUs [v17,16/35] gunyah: Add hypercalls for demand paging [v17,17/35] gunyah: rsc_mgr: Add memory parcel RPC [v17,18/35] mm/interval_tree: Export iter_first/iter_next [v17,19/35] arch/mm: Export direct {un,}map functions [v17,20/35] virt: gunyah: Add interfaces to map memory into guest address space [v17,21/35] gunyah: rsc_mgr: Add platform ops on mem_lend/mem_reclaim [v17,22/35] virt: gunyah: Add Qualcomm Gunyah platform ops [v17,23/35] virt: gunyah: Implement guestmemfd [v17,24/35] virt: gunyah: Add ioctl to bind guestmem to VMs [v17,25/35] virt: gunyah: guestmem: Initialize RM mem parcels from guestmem [v17,26/35] virt: gunyah: Share guest VM dtb configuration to Gunyah [v17,27/35] gunyah: rsc_mgr: Add RPC to enable demand paging [v17,28/35] virt: gunyah: Enable demand paging [v17,29/35] gunyah: rsc_mgr: Add RPC to set VM boot context [v17,30/35] virt: gunyah: Allow userspace to initialize context of primary vCPU [v17,31/35] virt: gunyah: Add hypercalls for sending doorbell [v17,32/35] virt: gunyah: Add irqfd interface [v17,33/35] virt: gunyah: Add IO handlers [v17,34/35] virt: gunyah: Add ioeventfd [v17,35/35] MAINTAINERS: Add Gunyah hypervisor drivers section

Elliot Berman Feb. 22, 2024, 11:16 p.m. UTC

Firmware and hypervisor drivers can donate system heap memory to their
respective firmware/hypervisor entities. Those drivers should unmap the
pages from the kernel's logical map before doing so.

Export can_set_direct_map, set_direct_map_invalid_noflush, and
set_direct_map_default_noflush.

Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
---
 arch/arm64/mm/pageattr.c | 3 +++
 1 file changed, 3 insertions(+)

Christoph Hellwig Feb. 23, 2024, 7:09 a.m. UTC | #1

On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> Firmware and hypervisor drivers can donate system heap memory to their
> respective firmware/hypervisor entities. Those drivers should unmap the
> pages from the kernel's logical map before doing so.
> 
> Export can_set_direct_map, set_direct_map_invalid_noflush, and
> set_direct_map_default_noflush.

Err, not they should not.  And not using such super low-level interfaces
from modular code.

Elliot Berman Feb. 24, 2024, 12:37 a.m. UTC | #2

On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > Firmware and hypervisor drivers can donate system heap memory to their
> > respective firmware/hypervisor entities. Those drivers should unmap the
> > pages from the kernel's logical map before doing so.
> > 
> > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > set_direct_map_default_noflush.
> 
> Err, not they should not.  And not using such super low-level interfaces
> from modular code.

Hi Cristoph,

We've observed a few times that Linux can unintentionally access a page
we've unmapped from host's stage 2 page table via an unaligned load from
an adjacent page. The stage 2 is managed by Gunyah. There are few
scenarios where even though we allocate and own a page from buddy,
someone else could try to access the page without going through the
hypervisor driver. One such instance we know about is
load_unaligned_zeropad() via pathlookup_at() [1].

load_unaligned_zeropad() could be called near the end of a page. If the
next page isn't mapped by the kernel in the stage one page tables, then
the access from to the unmapped page from load_unaligned_zeropad() will
land in __do_kernel_fault(), call fixup_exception(), and fill the
remainder of the load with zeroes. If the page in question is mapped in
stage 1 but was unmapped from stage 2, then the access lands back in
Linux in do_sea(), leading to a panic().

Our preference would be to add fixup_exception() to S2 PTW errors for
two reasons:
1. It's cheaper to do performance wise: we've already manipulated S2
   page table and prevent intentional access to the page because
   pKVM/Gunyah drivers know that access to the page has been lost.
2. Page-granular S1 mappings only happen on arm64 with rodata=full.

In an off-list discussion with the Android pkvm folks, their preference
was to have the pages unmapped from stage 1. I've gone with that
approach to get started but welcome discussion on the best approach.

The Android (downstream) implementation of arm64 pkvm is currently
implementing a hack where s2 ptw faults are given back to the host as s1
ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
allowing the kernel to fixup the exception.

arm64 pKVM will also face this issue when implementing guest_memfd or
when donating more memory to the hyp for s2 page tables, etc. As far as
I can tell, this isn't an issue for arm64 pKVM today because memory
isn't being dynamically donated to the hypervisor.

Thanks,
Elliot

[1]:
path_lookupat+0x340/0x3228
filename_lookup+0xbc/0x1c0
__arm64_sys_newfstatat+0xb0/0x4a0
invoke_syscall+0x58/0x118

Christoph Hellwig Feb. 26, 2024, 11:06 a.m. UTC | #3

The point is that we can't we just allow modules to unmap data from
the kernel mapping, no matter how noble your intentions are.

David Hildenbrand Feb. 26, 2024, 11:53 a.m. UTC | #4

On 26.02.24 12:06, Christoph Hellwig wrote:
> The point is that we can't we just allow modules to unmap data from
> the kernel mapping, no matter how noble your intentions are.

I absolutely agree.

Elliot Berman Feb. 26, 2024, 5:27 p.m. UTC | #5

On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> On 26.02.24 12:06, Christoph Hellwig wrote:
> > The point is that we can't we just allow modules to unmap data from
> > the kernel mapping, no matter how noble your intentions are.
> 
> I absolutely agree.
> 

Hi David and Chirstoph,

Are your preferences that we should make Gunyah builtin only or should add
fixing up S2 PTW errors (or something else)?

Also, do you extend that preference to modifying S2 mappings? This would
require any hypervisor driver that supports confidential compute
usecases to only ever be builtin.

Is your concern about unmapping data from kernel mapping, then module
being unloaded, and then having no way to recover the mapping? Would a
permanent module be better? The primary reason we were wanting to have
it as module was to avoid having driver in memory if you're not a Gunyah
guest.

Thanks,
Elliot

David Hildenbrand Feb. 27, 2024, 9:49 a.m. UTC | #6

On 26.02.24 18:27, Elliot Berman wrote:
> On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
>> On 26.02.24 12:06, Christoph Hellwig wrote:
>>> The point is that we can't we just allow modules to unmap data from
>>> the kernel mapping, no matter how noble your intentions are.
>>
>> I absolutely agree.
>>
> 
> Hi David and Chirstoph,
> 
> Are your preferences that we should make Gunyah builtin only or should add
> fixing up S2 PTW errors (or something else)?

Having that built into the kernel certainly does sound better than 
exposing that functionality to arbitrary OOT modules. But still, this 
feels like it is using a "too-low-level" interface.

> 
> Also, do you extend that preference to modifying S2 mappings? This would
> require any hypervisor driver that supports confidential compute
> usecases to only ever be builtin.
> 
> Is your concern about unmapping data from kernel mapping, then module
> being unloaded, and then having no way to recover the mapping? Would a
> permanent module be better? The primary reason we were wanting to have
> it as module was to avoid having driver in memory if you're not a Gunyah
> guest.

What I didn't grasp from this patch description: is the area where a 
driver would unmap/remap that memory somehow known ahead of time and 
limited?

How would the driver obtain that memory it would try to unmap/remap the 
direct map of? Simply allocate some pages and then unmap the direct map?

For example, we do have mm/secretmem.c, where we unmap the directmap on 
allocation and remap when freeing a page. A nice abstraction on 
alloc/free, so one cannot really do a lot of harm.

Further, we enlightened the remainder of the system about secretmem, 
such that we can detect that the directmap is no longer there. As one 
example, see the secretmem_active() check in kernel/power/hibernate.c.

A similar abstraction would make sense (I remember a discussion about 
having secretmem functionality in guest_memfd, would that help?), but 
the question is "which" memory you want to unmap the direct map of, and 
how the driver became "owner" of that memory such that it would really 
be allowed to mess with the directmap.

Elliot Berman March 1, 2024, 1:35 a.m. UTC | #7

On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote:
> On 26.02.24 18:27, Elliot Berman wrote:
> > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> > > On 26.02.24 12:06, Christoph Hellwig wrote:
> > > > The point is that we can't we just allow modules to unmap data from
> > > > the kernel mapping, no matter how noble your intentions are.
> > > 
> > > I absolutely agree.
> > > 
> > 
> > Hi David and Chirstoph,
> > 
> > Are your preferences that we should make Gunyah builtin only or should add
> > fixing up S2 PTW errors (or something else)?
> 
> Having that built into the kernel certainly does sound better than exposing
> that functionality to arbitrary OOT modules. But still, this feels like it
> is using a "too-low-level" interface.
> 

What are your thoughts about fixing up the stage-2 fault instead? I
think this gives mmu-based isolation a slight speed boost because we
avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah)
knows that the page isn't mapped. Whether we get S2 or S1 fault, the
kernel is likely going to crash, except in the rare case where we want
to fix the exception. In that case, we can modify the S2 fault handler
to call fixup_exception() when appropriate.

> > 
> > Also, do you extend that preference to modifying S2 mappings? This would
> > require any hypervisor driver that supports confidential compute
> > usecases to only ever be builtin.
> > 
> > Is your concern about unmapping data from kernel mapping, then module
> > being unloaded, and then having no way to recover the mapping? Would a
> > permanent module be better? The primary reason we were wanting to have
> > it as module was to avoid having driver in memory if you're not a Gunyah
> > guest.
> 
> What I didn't grasp from this patch description: is the area where a driver
> would unmap/remap that memory somehow known ahead of time and limited?
> 
> How would the driver obtain that memory it would try to unmap/remap the
> direct map of? Simply allocate some pages and then unmap the direct map?

That's correct.

> 
> For example, we do have mm/secretmem.c, where we unmap the directmap on
> allocation and remap when freeing a page. A nice abstraction on alloc/free,
> so one cannot really do a lot of harm.
> 
> Further, we enlightened the remainder of the system about secretmem, such
> that we can detect that the directmap is no longer there. As one example,
> see the secretmem_active() check in kernel/power/hibernate.c.
> 

I'll take a look at this. guest_memfd might be able to use PM notifiers here
instead, but I'll dig in the archives to see why secretmem isn't using that.

> A similar abstraction would make sense (I remember a discussion about having
> secretmem functionality in guest_memfd, would that help?), but the question
> is "which" memory you want to unmap the direct map of, and how the driver
> became "owner" of that memory such that it would really be allowed to mess
> with the directmap.

Quentin Perret March 4, 2024, 1:10 p.m. UTC | #8

On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > Firmware and hypervisor drivers can donate system heap memory to their
> > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > pages from the kernel's logical map before doing so.
> > > 
> > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > set_direct_map_default_noflush.
> > 
> > Err, not they should not.  And not using such super low-level interfaces
> > from modular code.
> 
> Hi Cristoph,
>  
> We've observed a few times that Linux can unintentionally access a page
> we've unmapped from host's stage 2 page table via an unaligned load from
> an adjacent page. The stage 2 is managed by Gunyah. There are few
> scenarios where even though we allocate and own a page from buddy,
> someone else could try to access the page without going through the
> hypervisor driver. One such instance we know about is
> load_unaligned_zeropad() via pathlookup_at() [1].
>  
> load_unaligned_zeropad() could be called near the end of a page. If the
> next page isn't mapped by the kernel in the stage one page tables, then
> the access from to the unmapped page from load_unaligned_zeropad() will
> land in __do_kernel_fault(), call fixup_exception(), and fill the
> remainder of the load with zeroes. If the page in question is mapped in
> stage 1 but was unmapped from stage 2, then the access lands back in
> Linux in do_sea(), leading to a panic().
>  
> Our preference would be to add fixup_exception() to S2 PTW errors for
> two reasons:
> 1. It's cheaper to do performance wise: we've already manipulated S2
>    page table and prevent intentional access to the page because
>    pKVM/Gunyah drivers know that access to the page has been lost.
> 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
>  
> In an off-list discussion with the Android pkvm folks, their preference
> was to have the pages unmapped from stage 1. I've gone with that
> approach to get started but welcome discussion on the best approach.
>  
> The Android (downstream) implementation of arm64 pkvm is currently
> implementing a hack where s2 ptw faults are given back to the host as s1
> ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> allowing the kernel to fixup the exception.
>  
> arm64 pKVM will also face this issue when implementing guest_memfd or
> when donating more memory to the hyp for s2 page tables, etc. As far as
> I can tell, this isn't an issue for arm64 pKVM today because memory
> isn't being dynamically donated to the hypervisor.

FWIW pKVM already donates memory dynamically to the hypervisor, to store
e.g. guest VM metadata and page-tables, and we've never seen that
problem as far as I can recall.

A key difference is that pKVM injects a data abort back into the kernel
in case of a stage-2 fault, so the whole EXTABLE trick/hack in
load_unaligned_zeropad() should work fine out of the box.

As discussed offline, Gunyah injecting an SEA into the kernel is
questionable, but I understand that the architecture is a bit lacking in
this department, and that's probably the next best thing.

Could the Gunyah driver allocate from a CMA region instead? That would
surely simplify unmapping from EL1 stage-1 (similar to how drivers
usually donate memory to TZ).

Thanks,
Quentin

Elliot Berman March 4, 2024, 11:37 p.m. UTC | #9

On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > pages from the kernel's logical map before doing so.
> > > > 
> > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > set_direct_map_default_noflush.
> > > 
> > > Err, not they should not.  And not using such super low-level interfaces
> > > from modular code.
> > 
> > Hi Cristoph,
> >  
> > We've observed a few times that Linux can unintentionally access a page
> > we've unmapped from host's stage 2 page table via an unaligned load from
> > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > scenarios where even though we allocate and own a page from buddy,
> > someone else could try to access the page without going through the
> > hypervisor driver. One such instance we know about is
> > load_unaligned_zeropad() via pathlookup_at() [1].
> >  
> > load_unaligned_zeropad() could be called near the end of a page. If the
> > next page isn't mapped by the kernel in the stage one page tables, then
> > the access from to the unmapped page from load_unaligned_zeropad() will
> > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > remainder of the load with zeroes. If the page in question is mapped in
> > stage 1 but was unmapped from stage 2, then the access lands back in
> > Linux in do_sea(), leading to a panic().
> >  
> > Our preference would be to add fixup_exception() to S2 PTW errors for
> > two reasons:
> > 1. It's cheaper to do performance wise: we've already manipulated S2
> >    page table and prevent intentional access to the page because
> >    pKVM/Gunyah drivers know that access to the page has been lost.
> > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> >  
> > In an off-list discussion with the Android pkvm folks, their preference
> > was to have the pages unmapped from stage 1. I've gone with that
> > approach to get started but welcome discussion on the best approach.
> >  
> > The Android (downstream) implementation of arm64 pkvm is currently
> > implementing a hack where s2 ptw faults are given back to the host as s1
> > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > allowing the kernel to fixup the exception.
> >  
> > arm64 pKVM will also face this issue when implementing guest_memfd or
> > when donating more memory to the hyp for s2 page tables, etc. As far as
> > I can tell, this isn't an issue for arm64 pKVM today because memory
> > isn't being dynamically donated to the hypervisor.
> 
> FWIW pKVM already donates memory dynamically to the hypervisor, to store
> e.g. guest VM metadata and page-tables, and we've never seen that
> problem as far as I can recall.
> 
> A key difference is that pKVM injects a data abort back into the kernel
> in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> load_unaligned_zeropad() should work fine out of the box.
> 
> As discussed offline, Gunyah injecting an SEA into the kernel is
> questionable, but I understand that the architecture is a bit lacking in
> this department, and that's probably the next best thing.
>
> Could the Gunyah driver allocate from a CMA region instead? That would
> surely simplify unmapping from EL1 stage-1 (similar to how drivers
> usually donate memory to TZ).

In my opinion, CMA is overly restrictive because we'd have to define the
region up front and we don't know how much memory the virtual machines
the user will want to launch.

Thanks,
Elliot

Quentin Perret March 5, 2024, 3:30 p.m. UTC | #10

On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote:
> On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > > pages from the kernel's logical map before doing so.
> > > > > 
> > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > > set_direct_map_default_noflush.
> > > > 
> > > > Err, not they should not.  And not using such super low-level interfaces
> > > > from modular code.
> > > 
> > > Hi Cristoph,
> > >  
> > > We've observed a few times that Linux can unintentionally access a page
> > > we've unmapped from host's stage 2 page table via an unaligned load from
> > > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > > scenarios where even though we allocate and own a page from buddy,
> > > someone else could try to access the page without going through the
> > > hypervisor driver. One such instance we know about is
> > > load_unaligned_zeropad() via pathlookup_at() [1].
> > >  
> > > load_unaligned_zeropad() could be called near the end of a page. If the
> > > next page isn't mapped by the kernel in the stage one page tables, then
> > > the access from to the unmapped page from load_unaligned_zeropad() will
> > > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > > remainder of the load with zeroes. If the page in question is mapped in
> > > stage 1 but was unmapped from stage 2, then the access lands back in
> > > Linux in do_sea(), leading to a panic().
> > >  
> > > Our preference would be to add fixup_exception() to S2 PTW errors for
> > > two reasons:
> > > 1. It's cheaper to do performance wise: we've already manipulated S2
> > >    page table and prevent intentional access to the page because
> > >    pKVM/Gunyah drivers know that access to the page has been lost.
> > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> > >  
> > > In an off-list discussion with the Android pkvm folks, their preference
> > > was to have the pages unmapped from stage 1. I've gone with that
> > > approach to get started but welcome discussion on the best approach.
> > >  
> > > The Android (downstream) implementation of arm64 pkvm is currently
> > > implementing a hack where s2 ptw faults are given back to the host as s1
> > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > > allowing the kernel to fixup the exception.
> > >  
> > > arm64 pKVM will also face this issue when implementing guest_memfd or
> > > when donating more memory to the hyp for s2 page tables, etc. As far as
> > > I can tell, this isn't an issue for arm64 pKVM today because memory
> > > isn't being dynamically donated to the hypervisor.
> > 
> > FWIW pKVM already donates memory dynamically to the hypervisor, to store
> > e.g. guest VM metadata and page-tables, and we've never seen that
> > problem as far as I can recall.
> > 
> > A key difference is that pKVM injects a data abort back into the kernel
> > in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> > load_unaligned_zeropad() should work fine out of the box.
> > 
> > As discussed offline, Gunyah injecting an SEA into the kernel is
> > questionable, but I understand that the architecture is a bit lacking in
> > this department, and that's probably the next best thing.
> >
> > Could the Gunyah driver allocate from a CMA region instead? That would
> > surely simplify unmapping from EL1 stage-1 (similar to how drivers
> > usually donate memory to TZ).
> 
> In my opinion, CMA is overly restrictive because we'd have to define the
> region up front and we don't know how much memory the virtual machines
> the user will want to launch.

I was thinking of using CMA to allocate pages needed to store guest
metadata and such at EL2, but not to back the actual guest pages
themselves. That still means overallocating somehow, but that should
hopefully be much smaller and be less of a problem?

For the actual guest pages, the gunyah variant of guestmem will have to
unmap the pages from the direct map itself, but I'd be personally happy
with making that part non-modular to avoid the issue Christoph and
others have raised.

Thanks,
Quentin

Elliot Berman March 5, 2024, 8:26 p.m. UTC | #11

On Tue, Mar 05, 2024 at 03:30:58PM +0000, Quentin Perret wrote:
> On Monday 04 Mar 2024 at 15:37:41 (-0800), Elliot Berman wrote:
> > On Mon, Mar 04, 2024 at 01:10:48PM +0000, Quentin Perret wrote:
> > > On Friday 23 Feb 2024 at 16:37:23 (-0800), Elliot Berman wrote:
> > > > On Thu, Feb 22, 2024 at 11:09:40PM -0800, Christoph Hellwig wrote:
> > > > > On Thu, Feb 22, 2024 at 03:16:42PM -0800, Elliot Berman wrote:
> > > > > > Firmware and hypervisor drivers can donate system heap memory to their
> > > > > > respective firmware/hypervisor entities. Those drivers should unmap the
> > > > > > pages from the kernel's logical map before doing so.
> > > > > > 
> > > > > > Export can_set_direct_map, set_direct_map_invalid_noflush, and
> > > > > > set_direct_map_default_noflush.
> > > > > 
> > > > > Err, not they should not.  And not using such super low-level interfaces
> > > > > from modular code.
> > > > 
> > > > Hi Cristoph,
> > > >  
> > > > We've observed a few times that Linux can unintentionally access a page
> > > > we've unmapped from host's stage 2 page table via an unaligned load from
> > > > an adjacent page. The stage 2 is managed by Gunyah. There are few
> > > > scenarios where even though we allocate and own a page from buddy,
> > > > someone else could try to access the page without going through the
> > > > hypervisor driver. One such instance we know about is
> > > > load_unaligned_zeropad() via pathlookup_at() [1].
> > > >  
> > > > load_unaligned_zeropad() could be called near the end of a page. If the
> > > > next page isn't mapped by the kernel in the stage one page tables, then
> > > > the access from to the unmapped page from load_unaligned_zeropad() will
> > > > land in __do_kernel_fault(), call fixup_exception(), and fill the
> > > > remainder of the load with zeroes. If the page in question is mapped in
> > > > stage 1 but was unmapped from stage 2, then the access lands back in
> > > > Linux in do_sea(), leading to a panic().
> > > >  
> > > > Our preference would be to add fixup_exception() to S2 PTW errors for
> > > > two reasons:
> > > > 1. It's cheaper to do performance wise: we've already manipulated S2
> > > >    page table and prevent intentional access to the page because
> > > >    pKVM/Gunyah drivers know that access to the page has been lost.
> > > > 2. Page-granular S1 mappings only happen on arm64 with rodata=full.
> > > >  
> > > > In an off-list discussion with the Android pkvm folks, their preference
> > > > was to have the pages unmapped from stage 1. I've gone with that
> > > > approach to get started but welcome discussion on the best approach.
> > > >  
> > > > The Android (downstream) implementation of arm64 pkvm is currently
> > > > implementing a hack where s2 ptw faults are given back to the host as s1
> > > > ptw faults (i.e. __do_kernel_fault() gets called and not do_sea()) --
> > > > allowing the kernel to fixup the exception.
> > > >  
> > > > arm64 pKVM will also face this issue when implementing guest_memfd or
> > > > when donating more memory to the hyp for s2 page tables, etc. As far as
> > > > I can tell, this isn't an issue for arm64 pKVM today because memory
> > > > isn't being dynamically donated to the hypervisor.
> > > 
> > > FWIW pKVM already donates memory dynamically to the hypervisor, to store
> > > e.g. guest VM metadata and page-tables, and we've never seen that
> > > problem as far as I can recall.
> > > 
> > > A key difference is that pKVM injects a data abort back into the kernel
> > > in case of a stage-2 fault, so the whole EXTABLE trick/hack in
> > > load_unaligned_zeropad() should work fine out of the box.
> > > 
> > > As discussed offline, Gunyah injecting an SEA into the kernel is
> > > questionable, but I understand that the architecture is a bit lacking in
> > > this department, and that's probably the next best thing.
> > >
> > > Could the Gunyah driver allocate from a CMA region instead? That would
> > > surely simplify unmapping from EL1 stage-1 (similar to how drivers
> > > usually donate memory to TZ).
> > 
> > In my opinion, CMA is overly restrictive because we'd have to define the
> > region up front and we don't know how much memory the virtual machines
> > the user will want to launch.
> 
> I was thinking of using CMA to allocate pages needed to store guest
> metadata and such at EL2, but not to back the actual guest pages
> themselves. That still means overallocating somehow, but that should
> hopefully be much smaller and be less of a problem?

Ah, I understood the context now. Yes, we might need to use CMA region
when donating memory to Gunyah if we have to ensure the memory is
unmapped from stage 1, since we wouldn't use guest_memfd for that.

> 
> For the actual guest pages, the gunyah variant of guestmem will have to
> unmap the pages from the direct map itself, but I'd be personally happy

I still disagree that this is a Gunyah-specific problem. As far as we
can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
couldn't give a validation translation of the IPA from stage 1. IMO,
downstream/Android pKVM is violating spec for ESR_EL1 by using the
S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
that guests need to be pKVM-enlightened. If we are adding pKVM
enlightment in the exception handlers, can we add Gunyah enlightment to
handle the same?

Thanks,
Elliot 

[1]: https://developer.arm.com/documentation/ddi0601/2023-12/AArch64-Registers/ESR-EL1--Exception-Syndrome-Register--EL1-?lang=en#fieldset_0-24_0_16-7_7

Quentin Perret March 6, 2024, 12:05 p.m. UTC | #12

On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote:
> I still disagree that this is a Gunyah-specific problem. As far as we
> can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
> couldn't give a validation translation of the IPA from stage 1. IMO,
> downstream/Android pKVM is violating spec for ESR_EL1 by using the
> S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
> that guests need to be pKVM-enlightened.

Not really, in pKVM we have a very clear distinction between host Linux
and guests, and only the host needs to be enlightened. But luckily,
since pKVM is part of Linux, this is pretty much an internal kernel
thing, so we're very flexible and if the S1PTW trick ever conflicts
with something else (e.g. NV) we can fairly easily switch to another
approach. We can tolerate non-architectural tricks like that between
pKVM and host Linux because that is not ABI, but we certainly can't do
that for guests.

> If we are adding pKVM
> enlightment in the exception handlers, can we add Gunyah enlightment to
> handle the same?

If you mean extending the Linux SEA handler so it does what Gunyah
wants, then I'm personally not supportive of that idea since the
'contract' between Linux and Gunyah _is_ the architecture.

The only ways I could see Gunyah delegate stage-2 fault handling to
Linux cleanly is:

 - either talk to Arm to introduce a new ESR specifically for this,
   which doesn't sound entirely crazy to me;

 - or have Gunyah and Linux negotiate in software the location of the
   handlers. That probably means SDEI or equivalent which is a can of
   worm in itself I presume, and I'm not sure how feasible it would be
   for this handler to live in the Gunyah driver (that too probably
   requires exporting kernel symbols we don't want to export).

Thanks,
Quentin

Elliot Berman March 8, 2024, 7:55 p.m. UTC | #13

On Wed, Mar 06, 2024 at 12:05:38PM +0000, Quentin Perret wrote:
> On Tuesday 05 Mar 2024 at 12:26:59 (-0800), Elliot Berman wrote:
> > I still disagree that this is a Gunyah-specific problem. As far as we
> > can tell, Arm doesn't specify how EL2 can tell EL1 its S2 page tables
> > couldn't give a validation translation of the IPA from stage 1. IMO,
> > downstream/Android pKVM is violating spec for ESR_EL1 by using the
> > S1PTW bit (which is res0 for everyone except EL2 [1]) and this means
> > that guests need to be pKVM-enlightened.
> 
> Not really, in pKVM we have a very clear distinction between host Linux
> and guests, and only the host needs to be enlightened. But luckily,
> since pKVM is part of Linux, this is pretty much an internal kernel
> thing, so we're very flexible and if the S1PTW trick ever conflicts
> with something else (e.g. NV) we can fairly easily switch to another
> approach. We can tolerate non-architectural tricks like that between
> pKVM and host Linux because that is not ABI, but we certainly can't do
> that for guests.
> 
> > If we are adding pKVM
> > enlightment in the exception handlers, can we add Gunyah enlightment to
> > handle the same?
> 
> If you mean extending the Linux SEA handler so it does what Gunyah
> wants, then I'm personally not supportive of that idea since the
> 'contract' between Linux and Gunyah _is_ the architecture.

Fair enough. We're building out more use cases where we want to allocate
memory from buddy and donate it to some entity which unmaps it from
Linux (some entity = Gunyah or Qualcomm firmware). Video DRM is an
example we're working on. I imagine OP-TEE might eventually have
use-cases as well since pKVM is doing same. David expressed concerns
about exporting the direct unmap functions. What kind of
framework/restrictions do we want to have instead? I don't think making
drivers like Gunyah a builtin-only module [1] (even a refactored/small
portion) is the best approach, but maybe that is what we want to do.

Thanks,
Elliot

[1]: qcom_scm_assign_mem (d/firmware/qcom/qcom_scm.ko) is an example of
a module that would have to become builtin as we upstream use cases that
lend buddy-allocated memory to firmware

Elliot Berman July 31, 2024, 10:21 p.m. UTC | #14

I wanted to revive this thread based on the mm alignment discussion for
guest_memfd.

Gunyah's guest_memfd allocates memory via filemap_alloc_folio, identical
to KVM's guest_memfd. There's a possiblity of a stage-2 fault when
memory is donated to guest VM and Linux incidentally tries to access the
donated memory with an unaligned access. This access will cause kernel
to panic as it expects to be able to access all memory which has been
mapped in stage 1. We don't want to disallow unaligned access simply
because Gunyah drivers are enabled.

There are two options I see to prevent the stage-2 fault from crashing
the kernel: we can fix up the stage-2 fault or ensure that Linux has a
S1 table consistent with S2.

To do the latter, the obvious solution seemed to be using the
set_direct_map functions, but you and Christoph have valid concerns
about exporting this to modules since it's a low-level API. One way to
avoid exporting the symbols is to make Gunyah a built-in, but I'd like
to find a better solution.

One way I can think of is to create a "guest_memfd library" that both
KVM and Gunyah can use. It abstracts the common bits between the 2 into
a built-in module and can be the one to call the set_direct_map
functions. I also think the abstraction will also help keep KVM
guest_memfd cleaner once we start supporting huge folios (and splitting
them). Do KVM and mm folks also see value to using a library-fied
guest_memfd?

Thanks,
Elliot

On Thu, Feb 29, 2024 at 05:35:45PM -0800, Elliot Berman wrote:
> On Tue, Feb 27, 2024 at 10:49:32AM +0100, David Hildenbrand wrote:
> > On 26.02.24 18:27, Elliot Berman wrote:
> > > On Mon, Feb 26, 2024 at 12:53:48PM +0100, David Hildenbrand wrote:
> > > > On 26.02.24 12:06, Christoph Hellwig wrote:
> > > > > The point is that we can't we just allow modules to unmap data from
> > > > > the kernel mapping, no matter how noble your intentions are.
> > > > 
> > > > I absolutely agree.
> > > > 
> > > 
> > > Hi David and Chirstoph,
> > > 
> > > Are your preferences that we should make Gunyah builtin only or should add
> > > fixing up S2 PTW errors (or something else)?
> > 
> > Having that built into the kernel certainly does sound better than exposing
> > that functionality to arbitrary OOT modules. But still, this feels like it
> > is using a "too-low-level" interface.
> > 
> 
> What are your thoughts about fixing up the stage-2 fault instead? I
> think this gives mmu-based isolation a slight speed boost because we
> avoid modifying kernel mapping. The hypervisor driver (KVM or Gunyah)
> knows that the page isn't mapped. Whether we get S2 or S1 fault, the
> kernel is likely going to crash, except in the rare case where we want
> to fix the exception. In that case, we can modify the S2 fault handler
> to call fixup_exception() when appropriate.
> 
> > > 
> > > Also, do you extend that preference to modifying S2 mappings? This would
> > > require any hypervisor driver that supports confidential compute
> > > usecases to only ever be builtin.
> > > 
> > > Is your concern about unmapping data from kernel mapping, then module
> > > being unloaded, and then having no way to recover the mapping? Would a
> > > permanent module be better? The primary reason we were wanting to have
> > > it as module was to avoid having driver in memory if you're not a Gunyah
> > > guest.
> > 
> > What I didn't grasp from this patch description: is the area where a driver
> > would unmap/remap that memory somehow known ahead of time and limited?
> > 
> > How would the driver obtain that memory it would try to unmap/remap the
> > direct map of? Simply allocate some pages and then unmap the direct map?
> 
> That's correct.
> 
> > 
> > For example, we do have mm/secretmem.c, where we unmap the directmap on
> > allocation and remap when freeing a page. A nice abstraction on alloc/free,
> > so one cannot really do a lot of harm.
> > 
> > Further, we enlightened the remainder of the system about secretmem, such
> > that we can detect that the directmap is no longer there. As one example,
> > see the secretmem_active() check in kernel/power/hibernate.c.
> > 
> 
> I'll take a look at this. guest_memfd might be able to use PM notifiers here
> instead, but I'll dig in the archives to see why secretmem isn't using that.
> 
> > A similar abstraction would make sense (I remember a discussion about having
> > secretmem functionality in guest_memfd, would that help?), but the question
> > is "which" memory you want to unmap the direct map of, and how the driver
> > became "owner" of that memory such that it would really be allowed to mess
> > with the directmap.
>

David Hildenbrand Aug. 1, 2024, 6:56 a.m. UTC | #15

On 01.08.24 00:21, Elliot Berman wrote:
> I wanted to revive this thread based on the mm alignment discussion for
> guest_memfd.
> 
> Gunyah's guest_memfd allocates memory via filemap_alloc_folio, identical
> to KVM's guest_memfd. There's a possiblity of a stage-2 fault when
> memory is donated to guest VM and Linux incidentally tries to access the
> donated memory with an unaligned access. This access will cause kernel
> to panic as it expects to be able to access all memory which has been
> mapped in stage 1. We don't want to disallow unaligned access simply
> because Gunyah drivers are enabled.
> 
> There are two options I see to prevent the stage-2 fault from crashing
> the kernel: we can fix up the stage-2 fault or ensure that Linux has a
> S1 table consistent with S2.
> 
> To do the latter, the obvious solution seemed to be using the
> set_direct_map functions, but you and Christoph have valid concerns
> about exporting this to modules since it's a low-level API. One way to
> avoid exporting the symbols is to make Gunyah a built-in, but I'd like
> to find a better solution.
> 
> One way I can think of is to create a "guest_memfd library" that both
> KVM and Gunyah can use. It abstracts the common bits between the 2 into
> a built-in module and can be the one to call the set_direct_map
> functions. I also think the abstraction will also help keep KVM
> guest_memfd cleaner once we start supporting huge folios (and splitting
> them). Do KVM and mm folks also see value to using a library-fied
> guest_memfd?

Without knowing about any details, this sounds like the right approach 
to me!

[v17,19/35] arch/mm: Export direct {un,}map functions

Commit Message

Comments

Patch