Message ID | 20210402152645.26680-8-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | TDX and guest memory unmapping | expand |
On 02.04.21 17:26, Kirill A. Shutemov wrote: > TDX architecture aims to provide resiliency against confidentiality and > integrity attacks. Towards this goal, the TDX architecture helps enforce > the enabling of memory integrity for all TD-private memory. > > The CPU memory controller computes the integrity check value (MAC) for > the data (cache line) during writes, and it stores the MAC with the > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > Checking of memory integrity is performed during memory reads. If > integrity check fails, CPU poisones cache line. > > On a subsequent consumption (read) of the poisoned data by software, > there are two possible scenarios: > > - Core determines that the execution can continue and it treats > poison with exception semantics signaled as a #MCE > > - Core determines execution cannot continue,and it does an unbreakable > shutdown > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > As some of integrity check failures may lead to system shutdown host > kernel must not allow any writes to TD-private memory. This requirment > clashes with KVM design: KVM expects the guest memory to be mapped into > host userspace (e.g. QEMU). So what you are saying is that if QEMU would write to such memory, it could crash the kernel? What a broken design. "As some of integrity check failures may lead to system shutdown host" -- usually we expect to recover from an MCE by killing the affected process, which would be the right thing to do here. How can it happen that "Core determines execution cannot continue,and it does an unbreakable shutdown". Who is "Core"? CPU "core", MM "core" ? And why would it decide to do a shutdown instead of just killing the process?
On Tue, Apr 06, 2021 at 09:44:07AM +0200, David Hildenbrand wrote: > On 02.04.21 17:26, Kirill A. Shutemov wrote: > > TDX architecture aims to provide resiliency against confidentiality and > > integrity attacks. Towards this goal, the TDX architecture helps enforce > > the enabling of memory integrity for all TD-private memory. > > > > The CPU memory controller computes the integrity check value (MAC) for > > the data (cache line) during writes, and it stores the MAC with the > > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > > > Checking of memory integrity is performed during memory reads. If > > integrity check fails, CPU poisones cache line. > > > > On a subsequent consumption (read) of the poisoned data by software, > > there are two possible scenarios: > > > > - Core determines that the execution can continue and it treats > > poison with exception semantics signaled as a #MCE > > > > - Core determines execution cannot continue,and it does an unbreakable > > shutdown > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > > > As some of integrity check failures may lead to system shutdown host > > kernel must not allow any writes to TD-private memory. This requirment > > clashes with KVM design: KVM expects the guest memory to be mapped into > > host userspace (e.g. QEMU). > > So what you are saying is that if QEMU would write to such memory, it could > crash the kernel? What a broken design. Cannot disagree. #MCE for integrity check is very questionable. But I'm not CPU engineer. > "As some of integrity check failures may lead to system shutdown host" -- > usually we expect to recover from an MCE by killing the affected process, > which would be the right thing to do here. In the most cases that's what happen. > How can it happen that "Core determines execution cannot continue,and it > does an unbreakable shutdown". Who is "Core"? CPU "core", MM "core" ? CPU core. > And why would it decide to do a shutdown instead of just killing the > process? <As I said, I'm not CPU engineer. Below is my understanding of the issue.> If the CPU handles long flow instruction (involves microcode and doing multiple memory accesses), consuming poison somewhere in the middle leads to CPU not being able to get back into sane state and the only option is system shutdown.
On 4/6/21 12:44 AM, David Hildenbrand wrote: > On 02.04.21 17:26, Kirill A. Shutemov wrote: >> TDX architecture aims to provide resiliency against confidentiality and >> integrity attacks. Towards this goal, the TDX architecture helps enforce >> the enabling of memory integrity for all TD-private memory. >> >> The CPU memory controller computes the integrity check value (MAC) for >> the data (cache line) during writes, and it stores the MAC with the >> memory as meta-data. A 28-bit MAC is stored in the ECC bits. >> >> Checking of memory integrity is performed during memory reads. If >> integrity check fails, CPU poisones cache line. >> >> On a subsequent consumption (read) of the poisoned data by software, >> there are two possible scenarios: >> >> - Core determines that the execution can continue and it treats >> poison with exception semantics signaled as a #MCE >> >> - Core determines execution cannot continue,and it does an unbreakable >> shutdown >> >> For more details, see Chapter 14 of Intel TDX Module EAS[1] >> >> As some of integrity check failures may lead to system shutdown host >> kernel must not allow any writes to TD-private memory. This requirment >> clashes with KVM design: KVM expects the guest memory to be mapped into >> host userspace (e.g. QEMU). > > So what you are saying is that if QEMU would write to such memory, it > could crash the kernel? What a broken design. IMNHO, the broken design is mapping the memory to userspace in the first place. Why the heck would you actually expose something with the MMU to a context that can't possibly meaningfully access or safely write to it? This started with SEV. QEMU creates normal memory mappings with the SEV C-bit (encryption) disabled. The kernel plumbs those into NPT, but when those are instantiated, they have the C-bit set. So, we have mismatched mappings. Where does that lead? The two mappings not only differ in the encryption bit, causing one side to read gibberish if the other writes: they're not even cache coherent. That's the situation *TODAY*, even ignoring TDX. BTW, I'm pretty sure I know the answer to the "why would you expose this to userspace" question: it's what QEMU/KVM did alreadhy for non-encrypted memory, so this was the quickest way to get SEV working. So, I don't like the #MC either. But, this series is a step in the right direction for TDX *AND* SEV.
On 06.04.21 16:33, Dave Hansen wrote: > On 4/6/21 12:44 AM, David Hildenbrand wrote: >> On 02.04.21 17:26, Kirill A. Shutemov wrote: >>> TDX architecture aims to provide resiliency against confidentiality and >>> integrity attacks. Towards this goal, the TDX architecture helps enforce >>> the enabling of memory integrity for all TD-private memory. >>> >>> The CPU memory controller computes the integrity check value (MAC) for >>> the data (cache line) during writes, and it stores the MAC with the >>> memory as meta-data. A 28-bit MAC is stored in the ECC bits. >>> >>> Checking of memory integrity is performed during memory reads. If >>> integrity check fails, CPU poisones cache line. >>> >>> On a subsequent consumption (read) of the poisoned data by software, >>> there are two possible scenarios: >>> >>> - Core determines that the execution can continue and it treats >>> poison with exception semantics signaled as a #MCE >>> >>> - Core determines execution cannot continue,and it does an unbreakable >>> shutdown >>> >>> For more details, see Chapter 14 of Intel TDX Module EAS[1] >>> >>> As some of integrity check failures may lead to system shutdown host >>> kernel must not allow any writes to TD-private memory. This requirment >>> clashes with KVM design: KVM expects the guest memory to be mapped into >>> host userspace (e.g. QEMU). >> >> So what you are saying is that if QEMU would write to such memory, it >> could crash the kernel? What a broken design. > > IMNHO, the broken design is mapping the memory to userspace in the first > place. Why the heck would you actually expose something with the MMU to > a context that can't possibly meaningfully access or safely write to it? I'd say the broken design is being able to crash the machine via a simple memory write, instead of only crashing a single process in case you're doing something nasty. From the evaluation of the problem it feels like this was a CPU design workaround: instead of properly cleaning up when it gets tricky within the core, just crash the machine. And that's a CPU "feature", not a kernel "feature". Now we have to fix broken HW in the kernel - once again. However, you raise a valid point: it does not make too much sense to to map this into user space. Not arguing against that; but crashing the machine is just plain ugly. I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds like: for hacking support for that memory type into QEMU/KVM. This all feels wrong, but I cannot really tell how it could be better. That memory can really only be used (right now?) with hardware virtualization from some point on. From that point on (right from the start?), there should be no VMA/mmap/page tables for user space anymore. Or am I missing something? Is there still valid user space access? > > This started with SEV. QEMU creates normal memory mappings with the SEV > C-bit (encryption) disabled. The kernel plumbs those into NPT, but when > those are instantiated, they have the C-bit set. So, we have mismatched > mappings. Where does that lead? The two mappings not only differ in > the encryption bit, causing one side to read gibberish if the other > writes: they're not even cache coherent. > > That's the situation *TODAY*, even ignoring TDX. > > BTW, I'm pretty sure I know the answer to the "why would you expose this > to userspace" question: it's what QEMU/KVM did alreadhy for > non-encrypted memory, so this was the quickest way to get SEV working. > Yes, I guess so. It was the fastest way to "hack" it into QEMU. Would we ever even want a VMA/mmap/process page tables for that memory? How could user space ever do something *not so nasty* with that memory (in the current context of VMs)?
On 4/6/21 9:33 AM, Dave Hansen wrote: > On 4/6/21 12:44 AM, David Hildenbrand wrote: >> On 02.04.21 17:26, Kirill A. Shutemov wrote: >>> TDX architecture aims to provide resiliency against confidentiality and >>> integrity attacks. Towards this goal, the TDX architecture helps enforce >>> the enabling of memory integrity for all TD-private memory. >>> >>> The CPU memory controller computes the integrity check value (MAC) for >>> the data (cache line) during writes, and it stores the MAC with the >>> memory as meta-data. A 28-bit MAC is stored in the ECC bits. >>> >>> Checking of memory integrity is performed during memory reads. If >>> integrity check fails, CPU poisones cache line. >>> >>> On a subsequent consumption (read) of the poisoned data by software, >>> there are two possible scenarios: >>> >>> - Core determines that the execution can continue and it treats >>> poison with exception semantics signaled as a #MCE >>> >>> - Core determines execution cannot continue,and it does an unbreakable >>> shutdown >>> >>> For more details, see Chapter 14 of Intel TDX Module EAS[1] >>> >>> As some of integrity check failures may lead to system shutdown host >>> kernel must not allow any writes to TD-private memory. This requirment >>> clashes with KVM design: KVM expects the guest memory to be mapped into >>> host userspace (e.g. QEMU). >> >> So what you are saying is that if QEMU would write to such memory, it >> could crash the kernel? What a broken design. > > IMNHO, the broken design is mapping the memory to userspace in the first > place. Why the heck would you actually expose something with the MMU to > a context that can't possibly meaningfully access or safely write to it? > > This started with SEV. QEMU creates normal memory mappings with the SEV > C-bit (encryption) disabled. The kernel plumbs those into NPT, but when > those are instantiated, they have the C-bit set. So, we have mismatched > mappings. Where does that lead? The two mappings not only differ in > the encryption bit, causing one side to read gibberish if the other > writes: they're not even cache coherent. QEMU is running on the hypervisor side, so even if the C-bit is set for its memory mappings, it would use the hypervisor key to access the memory, not the guest key. So it doesn't matter from a QEMU perspective whether it creates mappings with or without the C-bit. The C-bit in the NPT is only used if the guest is accessing the memory as shared/un-encrypted, in which case the the hypervisor key is then used. The latest EPYC hardware provides cache coherency for encrypted / non-encrypted accesses (X86_FEATURE_SME_COHERENT). > > That's the situation *TODAY*, even ignoring TDX. > > BTW, I'm pretty sure I know the answer to the "why would you expose this > to userspace" question: it's what QEMU/KVM did alreadhy for > non-encrypted memory, so this was the quickest way to get SEV working. > > So, I don't like the #MC either. But, this series is a step in the > right direction for TDX *AND* SEV. So, yes, this is a step in the right direction. Thanks, Tom >
On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote: > On 06.04.21 16:33, Dave Hansen wrote: > > On 4/6/21 12:44 AM, David Hildenbrand wrote: > > > On 02.04.21 17:26, Kirill A. Shutemov wrote: > > > > TDX architecture aims to provide resiliency against confidentiality and > > > > integrity attacks. Towards this goal, the TDX architecture helps enforce > > > > the enabling of memory integrity for all TD-private memory. > > > > > > > > The CPU memory controller computes the integrity check value (MAC) for > > > > the data (cache line) during writes, and it stores the MAC with the > > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > > > > > > > Checking of memory integrity is performed during memory reads. If > > > > integrity check fails, CPU poisones cache line. > > > > > > > > On a subsequent consumption (read) of the poisoned data by software, > > > > there are two possible scenarios: > > > > > > > > - Core determines that the execution can continue and it treats > > > > poison with exception semantics signaled as a #MCE > > > > > > > > - Core determines execution cannot continue,and it does an unbreakable > > > > shutdown > > > > > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > > > > > > > As some of integrity check failures may lead to system shutdown host > > > > kernel must not allow any writes to TD-private memory. This requirment > > > > clashes with KVM design: KVM expects the guest memory to be mapped into > > > > host userspace (e.g. QEMU). > > > > > > So what you are saying is that if QEMU would write to such memory, it > > > could crash the kernel? What a broken design. > > > > IMNHO, the broken design is mapping the memory to userspace in the first > > place. Why the heck would you actually expose something with the MMU to > > a context that can't possibly meaningfully access or safely write to it? > > I'd say the broken design is being able to crash the machine via a simple > memory write, instead of only crashing a single process in case you're doing > something nasty. From the evaluation of the problem it feels like this was a > CPU design workaround: instead of properly cleaning up when it gets tricky > within the core, just crash the machine. And that's a CPU "feature", not a > kernel "feature". Now we have to fix broken HW in the kernel - once again. > > However, you raise a valid point: it does not make too much sense to to map > this into user space. Not arguing against that; but crashing the machine is > just plain ugly. > > I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds > like: for hacking support for that memory type into QEMU/KVM. > > This all feels wrong, but I cannot really tell how it could be better. That > memory can really only be used (right now?) with hardware virtualization > from some point on. From that point on (right from the start?), there should > be no VMA/mmap/page tables for user space anymore. > > Or am I missing something? Is there still valid user space access? There is. For IO (e.g. virtio) the guest mark a range of memory as shared (or unencrypted for AMD SEV). The range is not pre-defined. > > This started with SEV. QEMU creates normal memory mappings with the SEV > > C-bit (encryption) disabled. The kernel plumbs those into NPT, but when > > those are instantiated, they have the C-bit set. So, we have mismatched > > mappings. Where does that lead? The two mappings not only differ in > > the encryption bit, causing one side to read gibberish if the other > > writes: they're not even cache coherent. > > > > That's the situation *TODAY*, even ignoring TDX. > > > > BTW, I'm pretty sure I know the answer to the "why would you expose this > > to userspace" question: it's what QEMU/KVM did alreadhy for > > non-encrypted memory, so this was the quickest way to get SEV working. > > > > Yes, I guess so. It was the fastest way to "hack" it into QEMU. > > Would we ever even want a VMA/mmap/process page tables for that memory? How > could user space ever do something *not so nasty* with that memory (in the > current context of VMs)? In the future, the memory should be still managable by host MM: migration, swapping, etc. But it's long way there. For now, the guest memory effectively pinned on the host.
> On 7 Apr 2021, at 15:16, Kirill A. Shutemov <kirill@shutemov.name> wrote: > > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote: >> On 06.04.21 16:33, Dave Hansen wrote: >>> On 4/6/21 12:44 AM, David Hildenbrand wrote: >>>> On 02.04.21 17:26, Kirill A. Shutemov wrote: >>>>> TDX architecture aims to provide resiliency against confidentiality and >>>>> integrity attacks. Towards this goal, the TDX architecture helps enforce >>>>> the enabling of memory integrity for all TD-private memory. >>>>> >>>>> The CPU memory controller computes the integrity check value (MAC) for >>>>> the data (cache line) during writes, and it stores the MAC with the >>>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits. >>>>> >>>>> Checking of memory integrity is performed during memory reads. If >>>>> integrity check fails, CPU poisones cache line. >>>>> >>>>> On a subsequent consumption (read) of the poisoned data by software, >>>>> there are two possible scenarios: >>>>> >>>>> - Core determines that the execution can continue and it treats >>>>> poison with exception semantics signaled as a #MCE >>>>> >>>>> - Core determines execution cannot continue,and it does an unbreakable >>>>> shutdown >>>>> >>>>> For more details, see Chapter 14 of Intel TDX Module EAS[1] >>>>> >>>>> As some of integrity check failures may lead to system shutdown host >>>>> kernel must not allow any writes to TD-private memory. This requirment >>>>> clashes with KVM design: KVM expects the guest memory to be mapped into >>>>> host userspace (e.g. QEMU). >>>> >>>> So what you are saying is that if QEMU would write to such memory, it >>>> could crash the kernel? What a broken design. >>> >>> IMNHO, the broken design is mapping the memory to userspace in the first >>> place. Why the heck would you actually expose something with the MMU to >>> a context that can't possibly meaningfully access or safely write to it? >> >> I'd say the broken design is being able to crash the machine via a simple >> memory write, instead of only crashing a single process in case you're doing >> something nasty. From the evaluation of the problem it feels like this was a >> CPU design workaround: instead of properly cleaning up when it gets tricky >> within the core, just crash the machine. And that's a CPU "feature", not a >> kernel "feature". Now we have to fix broken HW in the kernel - once again. >> >> However, you raise a valid point: it does not make too much sense to to map >> this into user space. Not arguing against that; but crashing the machine is >> just plain ugly. >> >> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds >> like: for hacking support for that memory type into QEMU/KVM. >> >> This all feels wrong, but I cannot really tell how it could be better. That >> memory can really only be used (right now?) with hardware virtualization >> from some point on. From that point on (right from the start?), there should >> be no VMA/mmap/page tables for user space anymore. >> >> Or am I missing something? Is there still valid user space access? > > There is. For IO (e.g. virtio) the guest mark a range of memory as shared > (or unencrypted for AMD SEV). The range is not pre-defined. > >>> This started with SEV. QEMU creates normal memory mappings with the SEV >>> C-bit (encryption) disabled. The kernel plumbs those into NPT, but when >>> those are instantiated, they have the C-bit set. So, we have mismatched >>> mappings. Where does that lead? The two mappings not only differ in >>> the encryption bit, causing one side to read gibberish if the other >>> writes: they're not even cache coherent. >>> >>> That's the situation *TODAY*, even ignoring TDX. >>> >>> BTW, I'm pretty sure I know the answer to the "why would you expose this >>> to userspace" question: it's what QEMU/KVM did alreadhy for >>> non-encrypted memory, so this was the quickest way to get SEV working. >>> >> >> Yes, I guess so. It was the fastest way to "hack" it into QEMU. >> >> Would we ever even want a VMA/mmap/process page tables for that memory? How >> could user space ever do something *not so nasty* with that memory (in the >> current context of VMs)? > > In the future, the memory should be still managable by host MM: migration, > swapping, etc. But it's long way there. For now, the guest memory > effectively pinned on the host. Is there even a theoretical way to restore an encrypted page e.g. from (host) swap without breaking the integrity check? Or will that only be possible with assistance from within the encrypted enclave? > > -- > Kirill A. Shutemov >
Christophe de Dinechin <cdupontd@redhat.com> writes: > Is there even a theoretical way to restore an encrypted page e.g. from (host) > swap without breaking the integrity check? Or will that only be possible with > assistance from within the encrypted enclave? Only the later. You would need balloning. It's in principle possible, but currently not implemented. In general host swap without balloning is usually a bad idea anyways because it often just swaps a lot of cache data that could easily be thrown away instead. -andi
On 07.04.21 15:16, Kirill A. Shutemov wrote: > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote: >> On 06.04.21 16:33, Dave Hansen wrote: >>> On 4/6/21 12:44 AM, David Hildenbrand wrote: >>>> On 02.04.21 17:26, Kirill A. Shutemov wrote: >>>>> TDX architecture aims to provide resiliency against confidentiality and >>>>> integrity attacks. Towards this goal, the TDX architecture helps enforce >>>>> the enabling of memory integrity for all TD-private memory. >>>>> >>>>> The CPU memory controller computes the integrity check value (MAC) for >>>>> the data (cache line) during writes, and it stores the MAC with the >>>>> memory as meta-data. A 28-bit MAC is stored in the ECC bits. >>>>> >>>>> Checking of memory integrity is performed during memory reads. If >>>>> integrity check fails, CPU poisones cache line. >>>>> >>>>> On a subsequent consumption (read) of the poisoned data by software, >>>>> there are two possible scenarios: >>>>> >>>>> - Core determines that the execution can continue and it treats >>>>> poison with exception semantics signaled as a #MCE >>>>> >>>>> - Core determines execution cannot continue,and it does an unbreakable >>>>> shutdown >>>>> >>>>> For more details, see Chapter 14 of Intel TDX Module EAS[1] >>>>> >>>>> As some of integrity check failures may lead to system shutdown host >>>>> kernel must not allow any writes to TD-private memory. This requirment >>>>> clashes with KVM design: KVM expects the guest memory to be mapped into >>>>> host userspace (e.g. QEMU). >>>> >>>> So what you are saying is that if QEMU would write to such memory, it >>>> could crash the kernel? What a broken design. >>> >>> IMNHO, the broken design is mapping the memory to userspace in the first >>> place. Why the heck would you actually expose something with the MMU to >>> a context that can't possibly meaningfully access or safely write to it? >> >> I'd say the broken design is being able to crash the machine via a simple >> memory write, instead of only crashing a single process in case you're doing >> something nasty. From the evaluation of the problem it feels like this was a >> CPU design workaround: instead of properly cleaning up when it gets tricky >> within the core, just crash the machine. And that's a CPU "feature", not a >> kernel "feature". Now we have to fix broken HW in the kernel - once again. >> >> However, you raise a valid point: it does not make too much sense to to map >> this into user space. Not arguing against that; but crashing the machine is >> just plain ugly. >> >> I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds >> like: for hacking support for that memory type into QEMU/KVM. >> >> This all feels wrong, but I cannot really tell how it could be better. That >> memory can really only be used (right now?) with hardware virtualization >> from some point on. From that point on (right from the start?), there should >> be no VMA/mmap/page tables for user space anymore. >> >> Or am I missing something? Is there still valid user space access? > > There is. For IO (e.g. virtio) the guest mark a range of memory as shared > (or unencrypted for AMD SEV). The range is not pre-defined. > Ah right, rings a bell. One obvious alternative would be to let user space only explicitly map what is shared and can be safely accessed, instead of doing it the other way around. But that obviously requires more thought/work and clashes with future MM changes you discuss below. >>> This started with SEV. QEMU creates normal memory mappings with the SEV >>> C-bit (encryption) disabled. The kernel plumbs those into NPT, but when >>> those are instantiated, they have the C-bit set. So, we have mismatched >>> mappings. Where does that lead? The two mappings not only differ in >>> the encryption bit, causing one side to read gibberish if the other >>> writes: they're not even cache coherent. >>> >>> That's the situation *TODAY*, even ignoring TDX. >>> >>> BTW, I'm pretty sure I know the answer to the "why would you expose this >>> to userspace" question: it's what QEMU/KVM did alreadhy for >>> non-encrypted memory, so this was the quickest way to get SEV working. >>> >> >> Yes, I guess so. It was the fastest way to "hack" it into QEMU. >> >> Would we ever even want a VMA/mmap/process page tables for that memory? How >> could user space ever do something *not so nasty* with that memory (in the >> current context of VMs)? > > In the future, the memory should be still managable by host MM: migration, > swapping, etc. But it's long way there. For now, the guest memory I was involved in the s390x implementation where this already works, simply because whenever encrypted memory is read/written from the hypervisor, you simple read/write the encrypted data; once the VM accesses that memory, it reads/writes unencrypted memory. For this reason, migration, swapping etc. works fairly naturally. I do wonder how x86-64 wants to tackle that; In the far future, will it be valid to again read/write encrypted memory, especially from user space? > effectively pinned on the host. Right, I remember that limitation for SEV. Thanks!
On Wed, Apr 07, 2021 at 04:09:35PM +0200, David Hildenbrand wrote: > On 07.04.21 15:16, Kirill A. Shutemov wrote: > > On Tue, Apr 06, 2021 at 04:57:46PM +0200, David Hildenbrand wrote: > > > On 06.04.21 16:33, Dave Hansen wrote: > > > > On 4/6/21 12:44 AM, David Hildenbrand wrote: > > > > > On 02.04.21 17:26, Kirill A. Shutemov wrote: > > > > > > TDX architecture aims to provide resiliency against confidentiality and > > > > > > integrity attacks. Towards this goal, the TDX architecture helps enforce > > > > > > the enabling of memory integrity for all TD-private memory. > > > > > > > > > > > > The CPU memory controller computes the integrity check value (MAC) for > > > > > > the data (cache line) during writes, and it stores the MAC with the > > > > > > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > > > > > > > > > > > Checking of memory integrity is performed during memory reads. If > > > > > > integrity check fails, CPU poisones cache line. > > > > > > > > > > > > On a subsequent consumption (read) of the poisoned data by software, > > > > > > there are two possible scenarios: > > > > > > > > > > > > - Core determines that the execution can continue and it treats > > > > > > poison with exception semantics signaled as a #MCE > > > > > > > > > > > > - Core determines execution cannot continue,and it does an unbreakable > > > > > > shutdown > > > > > > > > > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > > > > > > > > > > > As some of integrity check failures may lead to system shutdown host > > > > > > kernel must not allow any writes to TD-private memory. This requirment > > > > > > clashes with KVM design: KVM expects the guest memory to be mapped into > > > > > > host userspace (e.g. QEMU). > > > > > > > > > > So what you are saying is that if QEMU would write to such memory, it > > > > > could crash the kernel? What a broken design. > > > > > > > > IMNHO, the broken design is mapping the memory to userspace in the first > > > > place. Why the heck would you actually expose something with the MMU to > > > > a context that can't possibly meaningfully access or safely write to it? > > > > > > I'd say the broken design is being able to crash the machine via a simple > > > memory write, instead of only crashing a single process in case you're doing > > > something nasty. From the evaluation of the problem it feels like this was a > > > CPU design workaround: instead of properly cleaning up when it gets tricky > > > within the core, just crash the machine. And that's a CPU "feature", not a > > > kernel "feature". Now we have to fix broken HW in the kernel - once again. > > > > > > However, you raise a valid point: it does not make too much sense to to map > > > this into user space. Not arguing against that; but crashing the machine is > > > just plain ugly. > > > > > > I wonder: why do we even *want* a VMA/mmap describing that memory? Sounds > > > like: for hacking support for that memory type into QEMU/KVM. > > > > > > This all feels wrong, but I cannot really tell how it could be better. That > > > memory can really only be used (right now?) with hardware virtualization > > > from some point on. From that point on (right from the start?), there should > > > be no VMA/mmap/page tables for user space anymore. > > > > > > Or am I missing something? Is there still valid user space access? > > > > There is. For IO (e.g. virtio) the guest mark a range of memory as shared > > (or unencrypted for AMD SEV). The range is not pre-defined. > > > > Ah right, rings a bell. One obvious alternative would be to let user space > only explicitly map what is shared and can be safely accessed, instead of > doing it the other way around. But that obviously requires more thought/work > and clashes with future MM changes you discuss below. IIUC, HyperV's VMBus uses pre-defined range that communicated through ACPI. KVM/virtio can do the same in theory, but it would require changes in the existing driver model. > > > > This started with SEV. QEMU creates normal memory mappings with the SEV > > > > C-bit (encryption) disabled. The kernel plumbs those into NPT, but when > > > > those are instantiated, they have the C-bit set. So, we have mismatched > > > > mappings. Where does that lead? The two mappings not only differ in > > > > the encryption bit, causing one side to read gibberish if the other > > > > writes: they're not even cache coherent. > > > > > > > > That's the situation *TODAY*, even ignoring TDX. > > > > > > > > BTW, I'm pretty sure I know the answer to the "why would you expose this > > > > to userspace" question: it's what QEMU/KVM did alreadhy for > > > > non-encrypted memory, so this was the quickest way to get SEV working. > > > > > > > > > > Yes, I guess so. It was the fastest way to "hack" it into QEMU. > > > > > > Would we ever even want a VMA/mmap/process page tables for that memory? How > > > could user space ever do something *not so nasty* with that memory (in the > > > current context of VMs)? > > > > In the future, the memory should be still managable by host MM: migration, > > swapping, etc. But it's long way there. For now, the guest memory > > I was involved in the s390x implementation where this already works, simply > because whenever encrypted memory is read/written from the hypervisor, you > simple read/write the encrypted data; once the VM accesses that memory, it > reads/writes unencrypted memory. For this reason, migration, swapping etc. > works fairly naturally. In TDX case, the encryption tied to the physical address of the encrypted block. Moving the block to other place in memory would produce garbage. It's done intentionally to protected against replay attack. > I do wonder how x86-64 wants to tackle that; In the far future, will it be > valid to again read/write encrypted memory, especially from user space? > It would require assistance from the guest and/or TDX module.
On 02.04.21 17:26, Kirill A. Shutemov wrote: > TDX architecture aims to provide resiliency against confidentiality and > integrity attacks. Towards this goal, the TDX architecture helps enforce > the enabling of memory integrity for all TD-private memory. > > The CPU memory controller computes the integrity check value (MAC) for > the data (cache line) during writes, and it stores the MAC with the > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > Checking of memory integrity is performed during memory reads. If > integrity check fails, CPU poisones cache line. > > On a subsequent consumption (read) of the poisoned data by software, > there are two possible scenarios: > > - Core determines that the execution can continue and it treats > poison with exception semantics signaled as a #MCE > > - Core determines execution cannot continue,and it does an unbreakable > shutdown > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > As some of integrity check failures may lead to system shutdown host > kernel must not allow any writes to TD-private memory. This requirment > clashes with KVM design: KVM expects the guest memory to be mapped into > host userspace (e.g. QEMU). > > This patch aims to start discussion on how we can approach the issue. > > For now I intentionally keep TDX out of picture here and try to find a > generic way to unmap KVM guest memory from host userspace. Hopefully, it > makes the patch more approachable. And anyone can try it out. > > To the proposal: > > Looking into existing codepaths I've discovered that we already have > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap > entries in page tables: > > - If an application touches a page mapped with the SWP_HWPOISON, it will > get SIGBUS. > > - GUP will fail with -EFAULT; > > Access the poisoned memory via page cache doesn't match required > semantics right now, but it shouldn't be too hard to make it work: > access to poisoned dirty pages should give -EIO or -EHWPOISON. > > My idea is that we can mark page as poisoned when we make it TD-private > and replace all PTEs that map the page with SWP_HWPOISON. It looks quite hacky (well, what did I expect from an RFC :) ) you can no longer distinguish actually poisoned pages from "temporarily poisoned" pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous - "I want to read/write a poisoned page, trust me, I know what I am doing". Storing the state for each individual page initially sounded like the right thing to do, but I wonder if we couldn't handle this on a per-VMA level. You can just remember the handful of shared ranges internally like you do right now AFAIU. From what I get, you want a way to 1. Unmap pages from the user space page tables. 2. Disallow re-faulting of the protected pages into the page tables. On user space access, you want to deliver some signal (e.g., SIGBUS). 3. Allow selected users to still grab the pages (esp. KVM to fault them into the page tables). 4. Allow access to currently shared specific pages from user space. Right now, you achieve 1. Via try_to_unmap() 2. TestSetPageHWPoison 3. TBD (e.g., FOLL_ALLOW_POISONED) 4. ClearPageHWPoison() If we could bounce all writes to shared pages through the kernel, things could end up a little easier. Some very rough idea: We could let user space setup VM memory as mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV when trying to write from user space. You could then still access the pages, e.g., via FOLL_FORCE or a new fancy flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would allow an ioctl to write page content and to map the pages into NPTs. As an extension, we could think about (re?)mapping some shared pages read|write. The question is how to synchronize with user space. I have no idea how expensive would be bouncing writes (and reads?) through the kernel. Did you ever experiment with that/evaluate that?
David Hildenbrand <david@redhat.com> writes: > I have no idea how expensive would be bouncing writes (and reads?) > through the kernel. Did you ever experiment with that/evaluate that? I would expect it to be quite expensive, as in virtio IO performance tanking. -Andi
On Wed, Apr 07, 2021 at 04:55:54PM +0200, David Hildenbrand wrote: > On 02.04.21 17:26, Kirill A. Shutemov wrote: > > TDX architecture aims to provide resiliency against confidentiality and > > integrity attacks. Towards this goal, the TDX architecture helps enforce > > the enabling of memory integrity for all TD-private memory. > > > > The CPU memory controller computes the integrity check value (MAC) for > > the data (cache line) during writes, and it stores the MAC with the > > memory as meta-data. A 28-bit MAC is stored in the ECC bits. > > > > Checking of memory integrity is performed during memory reads. If > > integrity check fails, CPU poisones cache line. > > > > On a subsequent consumption (read) of the poisoned data by software, > > there are two possible scenarios: > > > > - Core determines that the execution can continue and it treats > > poison with exception semantics signaled as a #MCE > > > > - Core determines execution cannot continue,and it does an unbreakable > > shutdown > > > > For more details, see Chapter 14 of Intel TDX Module EAS[1] > > > > As some of integrity check failures may lead to system shutdown host > > kernel must not allow any writes to TD-private memory. This requirment > > clashes with KVM design: KVM expects the guest memory to be mapped into > > host userspace (e.g. QEMU). > > > > This patch aims to start discussion on how we can approach the issue. > > > > For now I intentionally keep TDX out of picture here and try to find a > > generic way to unmap KVM guest memory from host userspace. Hopefully, it > > makes the patch more approachable. And anyone can try it out. > > > > To the proposal: > > > > Looking into existing codepaths I've discovered that we already have > > semantics we want. That's PG_hwpoison'ed pages and SWP_HWPOISON swap > > entries in page tables: > > > > - If an application touches a page mapped with the SWP_HWPOISON, it will > > get SIGBUS. > > > > - GUP will fail with -EFAULT; > > > > Access the poisoned memory via page cache doesn't match required > > semantics right now, but it shouldn't be too hard to make it work: > > access to poisoned dirty pages should give -EIO or -EHWPOISON. > > > > My idea is that we can mark page as poisoned when we make it TD-private > > and replace all PTEs that map the page with SWP_HWPOISON. > > It looks quite hacky (well, what did I expect from an RFC :) ) you can no > longer distinguish actually poisoned pages from "temporarily poisoned" > pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous - "I want > to read/write a poisoned page, trust me, I know what I am doing". > > Storing the state for each individual page initially sounded like the right > thing to do, but I wonder if we couldn't handle this on a per-VMA level. You > can just remember the handful of shared ranges internally like you do right > now AFAIU. per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be required. Or per-memslot. I donno. Need more experiments. Note, I use PG_hwpoison now, but if we find a show-stopper issue where we would see confusion with a real poison, we can switch to new flags and a new swap_type(). I have not seen a reason yet. > From what I get, you want a way to > > 1. Unmap pages from the user space page tables. Plain unmap would not work for some use-cases. Some CSPs want to preallocate memory in a specific way. It's a way to provide a fine-grained NUMA policy. The existing mapping has to be converted. > 2. Disallow re-faulting of the protected pages into the page tables. On user > space access, you want to deliver some signal (e.g., SIGBUS). Note that userspace mapping is the only source of pfn's for VM's shadow mapping. The fault should be allow, but lead to non-present PTE that still encodes pfn. > 3. Allow selected users to still grab the pages (esp. KVM to fault them into > the page tables). As long as fault leads to non-present PTEs we are fine. Usespace still may want to mlock() some of guest memory. There's no reason to prevent this. > 4. Allow access to currently shared specific pages from user space. > > Right now, you achieve > > 1. Via try_to_unmap() > 2. TestSetPageHWPoison > 3. TBD (e.g., FOLL_ALLOW_POISONED) > 4. ClearPageHWPoison() > > > If we could bounce all writes to shared pages through the kernel, things > could end up a little easier. Some very rough idea: > > We could let user space setup VM memory as > mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected > memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to > PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV > when trying to write from user space. > > You could then still access the pages, e.g., via FOLL_FORCE or a new fancy > flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would > allow an ioctl to write page content and to map the pages into NPTs. > > As an extension, we could think about (re?)mapping some shared pages > read|write. The question is how to synchronize with user space. > > I have no idea how expensive would be bouncing writes (and reads?) through > the kernel. Did you ever experiment with that/evaluate that? It's going to be double bounce buffer: on the guest we force swiotlb to make it go through shared region. I don't think it's a good idea. There are a number of way to share a memory. It's going to be decided by the way we get these pages unmapped in the first place.
>> It looks quite hacky (well, what did I expect from an RFC :) ) you can no >> longer distinguish actually poisoned pages from "temporarily poisoned" >> pages. FOLL_ALLOW_POISONED sounds especially nasty and dangerous - "I want >> to read/write a poisoned page, trust me, I know what I am doing". >> >> Storing the state for each individual page initially sounded like the right >> thing to do, but I wonder if we couldn't handle this on a per-VMA level. You >> can just remember the handful of shared ranges internally like you do right >> now AFAIU. > > per-VMA would not fly for file-backed (e.g. tmpfs) memory. We may need to > combine PG_hwpoison with VMA flag. Maybe per-inode tracking would also be > required. Or per-memslot. I donno. Need more experiments. Indeed. > > Note, I use PG_hwpoison now, but if we find a show-stopper issue where we > would see confusion with a real poison, we can switch to new flags and > a new swap_type(). I have not seen a reason yet. I think we'll want a dedicate mechanism to cleanly mark pages as "protected". Finding a page flag you can use will be the problematic part, but should not be impossible if we have a good reason to do so (even if it means making the feature mutually exclusive with other features). > >> From what I get, you want a way to >> >> 1. Unmap pages from the user space page tables. > > Plain unmap would not work for some use-cases. Some CSPs want to > preallocate memory in a specific way. It's a way to provide a fine-grained > NUMA policy. > > The existing mapping has to be converted. > >> 2. Disallow re-faulting of the protected pages into the page tables. On user >> space access, you want to deliver some signal (e.g., SIGBUS). > > Note that userspace mapping is the only source of pfn's for VM's shadow > mapping. The fault should be allow, but lead to non-present PTE that still > encodes pfn. Makes sense, but I guess that's the part still to be implemented (see next comment). > >> 3. Allow selected users to still grab the pages (esp. KVM to fault them into >> the page tables). > > As long as fault leads to non-present PTEs we are fine. Usespace still may > want to mlock() some of guest memory. There's no reason to prevent this. I'm curious, even get_user_pages() will lead to a present PTE as is, no? So that will need modifications I assume. (although I think it fundamentally differs to the way get_user_pages() works - trigger a fault first, then lookup the PTE in the page tables). >> 4. Allow access to currently shared specific pages from user space. >> >> Right now, you achieve >> >> 1. Via try_to_unmap() >> 2. TestSetPageHWPoison >> 3. TBD (e.g., FOLL_ALLOW_POISONED) >> 4. ClearPageHWPoison() >> >> >> If we could bounce all writes to shared pages through the kernel, things >> could end up a little easier. Some very rough idea: >> >> We could let user space setup VM memory as >> mprotect(PROT_READ) (+ PROT_KERNEL_WRITE?), and after activating protected >> memory (I assume via a KVM ioctl), make sure the VMAs cannot be set to >> PROT_WRITE anymore. This would already properly unmap and deliver a SIGSEGV >> when trying to write from user space. >> >> You could then still access the pages, e.g., via FOLL_FORCE or a new fancy >> flag that allows to write with VM_MAYWRITE|VM_DENYUSERWRITE. This would >> allow an ioctl to write page content and to map the pages into NPTs. >> >> As an extension, we could think about (re?)mapping some shared pages >> read|write. The question is how to synchronize with user space. >> >> I have no idea how expensive would be bouncing writes (and reads?) through >> the kernel. Did you ever experiment with that/evaluate that? > > It's going to be double bounce buffer: on the guest we force swiotlb to > make it go through shared region. I don't think it's a good idea. So if it's already slow, do we really care? ;) > > There are a number of way to share a memory. It's going to be decided by > the way we get these pages unmapped in the first place. I agree that shared memory can be somewhat problematic and would require tracking it per page.
On Fri, Apr 09, 2021 at 03:50:42PM +0200, David Hildenbrand wrote: > > > 3. Allow selected users to still grab the pages (esp. KVM to fault them into > > > the page tables). > > > > As long as fault leads to non-present PTEs we are fine. Usespace still may > > want to mlock() some of guest memory. There's no reason to prevent this. > > I'm curious, even get_user_pages() will lead to a present PTE as is, no? So > that will need modifications I assume. (although I think it fundamentally > differs to the way get_user_pages() works - trigger a fault first, then > lookup the PTE in the page tables). For now, the patch has two step poisoning: first fault in, on the add to shadow PTE -- poison. By the time VM has chance to use the page it's poisoned and unmapped from the host userspace.
On 09.04.21 16:12, Kirill A. Shutemov wrote: > On Fri, Apr 09, 2021 at 03:50:42PM +0200, David Hildenbrand wrote: >>>> 3. Allow selected users to still grab the pages (esp. KVM to fault them into >>>> the page tables). >>> >>> As long as fault leads to non-present PTEs we are fine. Usespace still may >>> want to mlock() some of guest memory. There's no reason to prevent this. >> >> I'm curious, even get_user_pages() will lead to a present PTE as is, no? So >> that will need modifications I assume. (although I think it fundamentally >> differs to the way get_user_pages() works - trigger a fault first, then >> lookup the PTE in the page tables). > > For now, the patch has two step poisoning: first fault in, on the add to > shadow PTE -- poison. By the time VM has chance to use the page it's > poisoned and unmapped from the host userspace. IIRC, this then assumes that while a page is protected, it will remain mapped into the NPT; because, there is no way to remap into NPT later because the pages have already been poisoned.
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 7ac592664c52..b7db1c455e7c 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -46,6 +46,7 @@ config KVM select KVM_GENERIC_DIRTYLOG_READ_PROTECT select KVM_VFIO select SRCU + select HAVE_KVM_PROTECTED_MEMORY help Support hosting fully virtualized guest machines using hardware virtualization extensions. You will need a fairly recent diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c index 38172ca627d3..1457692c1080 100644 --- a/arch/x86/kvm/cpuid.c +++ b/arch/x86/kvm/cpuid.c @@ -796,7 +796,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function) (1 << KVM_FEATURE_PV_SEND_IPI) | (1 << KVM_FEATURE_POLL_CONTROL) | (1 << KVM_FEATURE_PV_SCHED_YIELD) | - (1 << KVM_FEATURE_ASYNC_PF_INT); + (1 << KVM_FEATURE_ASYNC_PF_INT) | + (1 << KVM_FEATURE_MEM_PROTECTED); if (sched_info_on()) entry->eax |= (1 << KVM_FEATURE_STEAL_TIME); diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 6d16481aa29d..53a69c8c59f1 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -43,6 +43,7 @@ #include <linux/hash.h> #include <linux/kern_levels.h> #include <linux/kthread.h> +#include <linux/rmap.h> #include <asm/page.h> #include <asm/memtype.h> @@ -2758,7 +2759,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) if (sp->role.level > PG_LEVEL_4K) return; - __direct_pte_prefetch(vcpu, sp, sptep); + if (!vcpu->kvm->mem_protected) + __direct_pte_prefetch(vcpu, sp, sptep); } static int host_pfn_mapping_level(struct kvm_vcpu *vcpu, gfn_t gfn, @@ -3723,6 +3725,17 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code, if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r)) return r; + if (vcpu->kvm->mem_protected && unlikely(!is_noslot_pfn(pfn)) && + !gfn_is_shared(vcpu->kvm, gfn)) { + struct page *page = pfn_to_page(pfn); + lock_page(page); + VM_BUG_ON_PAGE(!PageSwapBacked(page) && !PageReserved(page), page); + /* Recheck gfn_is_shared() under page lock */ + if (!gfn_is_shared(vcpu->kvm, gfn) && !TestSetPageHWPoison(page)) + try_to_unmap(page, TTU_IGNORE_MLOCK); + unlock_page(page); + } + r = RET_PF_RETRY; spin_lock(&vcpu->kvm->mmu_lock); if (mmu_notifier_retry(vcpu->kvm, mmu_seq)) diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h index 50e268eb8e1a..26b0494a1207 100644 --- a/arch/x86/kvm/mmu/paging_tmpl.h +++ b/arch/x86/kvm/mmu/paging_tmpl.h @@ -397,8 +397,14 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker, goto error; ptep_user = (pt_element_t __user *)((void *)host_addr + offset); - if (unlikely(__get_user(pte, ptep_user))) - goto error; + if (vcpu->kvm->mem_protected) { + if (copy_from_guest(vcpu->kvm, &pte, host_addr + offset, + sizeof(pte))) + goto error; + } else { + if (unlikely(__get_user(pte, ptep_user))) + goto error; + } walker->ptep_user[walker->level - 1] = ptep_user; trace_kvm_mmu_paging_element(pte, walker->level); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1b404e4d7dd8..f8183386abe7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -8170,6 +8170,12 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu) kvm_sched_yield(vcpu->kvm, a0); ret = 0; break; + case KVM_HC_ENABLE_MEM_PROTECTED: + ret = kvm_protect_memory(vcpu->kvm); + break; + case KVM_HC_MEM_SHARE: + ret = kvm_share_memory(vcpu->kvm, a0, a1); + break; default: ret = -KVM_ENOSYS; break; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index f3b1013fb22c..f941bcbefb79 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -436,6 +436,8 @@ static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) } #endif +#define KVM_NR_SHARED_RANGES 32 + /* * Note: * memslots are not sorted by id anymore, please use id_to_memslot() @@ -513,6 +515,9 @@ struct kvm { pid_t userspace_pid; unsigned int max_halt_poll_ns; u32 dirty_ring_size; + bool mem_protected; + int nr_shared_ranges; + struct range shared_ranges[KVM_NR_SHARED_RANGES]; }; #define kvm_err(fmt, ...) \ @@ -709,6 +714,10 @@ void kvm_arch_flush_shadow_all(struct kvm *kvm); void kvm_arch_flush_shadow_memslot(struct kvm *kvm, struct kvm_memory_slot *slot); +int kvm_protect_memory(struct kvm *kvm); +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages); +bool gfn_is_shared(struct kvm *kvm, unsigned long gfn); + int gfn_to_page_many_atomic(struct kvm_memory_slot *slot, gfn_t gfn, struct page **pages, int nr_pages); @@ -718,6 +727,9 @@ unsigned long gfn_to_hva_prot(struct kvm *kvm, gfn_t gfn, bool *writable); unsigned long gfn_to_hva_memslot(struct kvm_memory_slot *slot, gfn_t gfn); unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot, gfn_t gfn, bool *writable); +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len); +int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len); + void kvm_release_page_clean(struct page *page); void kvm_release_page_dirty(struct page *page); void kvm_set_page_accessed(struct page *page); diff --git a/include/linux/swapops.h b/include/linux/swapops.h index d9b7c9132c2f..520589b12fb3 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -323,6 +323,16 @@ static inline int is_hwpoison_entry(swp_entry_t entry) return swp_type(entry) == SWP_HWPOISON; } +static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry) +{ + return swp_offset(entry); +} + +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) +{ + return pfn_to_page(hwpoison_entry_to_pfn(entry)); +} + static inline void num_poisoned_pages_inc(void) { atomic_long_inc(&num_poisoned_pages); @@ -345,6 +355,16 @@ static inline int is_hwpoison_entry(swp_entry_t swp) return 0; } +static inline unsigned long hwpoison_entry_to_pfn(swp_entry_t entry) +{ + return 0; +} + +static inline struct page *hwpoison_entry_to_page(swp_entry_t entry) +{ + return NULL; +} + static inline void num_poisoned_pages_inc(void) { } diff --git a/include/uapi/linux/kvm_para.h b/include/uapi/linux/kvm_para.h index 09d36683ee0a..743e621111f0 100644 --- a/include/uapi/linux/kvm_para.h +++ b/include/uapi/linux/kvm_para.h @@ -17,6 +17,7 @@ #define KVM_E2BIG E2BIG #define KVM_EPERM EPERM #define KVM_EOPNOTSUPP 95 +#define KVM_EINTR EINTR #define KVM_HC_VAPIC_POLL_IRQ 1 #define KVM_HC_MMU_OP 2 diff --git a/mm/gup.c b/mm/gup.c index e4c224cd9661..ce4fdf213455 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -384,22 +384,31 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, ptep = pte_offset_map_lock(mm, pmd, address, &ptl); pte = *ptep; if (!pte_present(pte)) { - swp_entry_t entry; + swp_entry_t entry = pte_to_swp_entry(pte); + + if (pte_none(pte)) + goto no_page; + /* * KSM's break_ksm() relies upon recognizing a ksm page * even while it is being migrated, so for that case we * need migration_entry_wait(). */ - if (likely(!(flags & FOLL_MIGRATION))) - goto no_page; - if (pte_none(pte)) - goto no_page; - entry = pte_to_swp_entry(pte); - if (!is_migration_entry(entry)) - goto no_page; - pte_unmap_unlock(ptep, ptl); - migration_entry_wait(mm, pmd, address); - goto retry; + if (is_migration_entry(entry) && (flags & FOLL_MIGRATION)) { + pte_unmap_unlock(ptep, ptl); + migration_entry_wait(mm, pmd, address); + goto retry; + } + + if (is_hwpoison_entry(entry)) { + page = hwpoison_entry_to_page(entry); + if (PageHWPoison(page) /* && (flags & FOLL_ALLOW_POISONED) */) { + get_page(page); + goto out; + } + } + + goto no_page; } if ((flags & FOLL_NUMA) && pte_protnone(pte)) goto no_page; diff --git a/mm/memory.c b/mm/memory.c index feff48e1465a..524dce15a087 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -767,6 +767,9 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } + } else if (is_hwpoison_entry(entry)) { + page = hwpoison_entry_to_page(entry); + get_page(page); } set_pte_at(dst_mm, addr, dst_pte, pte); return 0; @@ -1305,6 +1308,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, page = migration_entry_to_page(entry); rss[mm_counter(page)]--; + + } else if (is_hwpoison_entry(entry)) { + put_page(hwpoison_entry_to_page(entry)); } if (unlikely(!free_swap_and_cache(entry))) print_bad_pte(vma, addr, ptent, NULL); @@ -3274,7 +3280,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) vmf->page = device_private_entry_to_page(entry); ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); } else if (is_hwpoison_entry(entry)) { - ret = VM_FAULT_HWPOISON; + page = hwpoison_entry_to_page(entry); + + locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags); + if (!locked) { + ret = VM_FAULT_RETRY; + goto out; + } + + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + + if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) { + ret = 0; + } else if (PageHWPoison(page)) { + ret = VM_FAULT_HWPOISON; + } else { + /* + * The page is unpoisoned. Replace hwpoison + * entry with a present PTE. + */ + + inc_mm_counter(vma->vm_mm, mm_counter(page)); + pte = mk_pte(page, vma->vm_page_prot); + + if (PageAnon(page)) { + page_add_anon_rmap(page, vma, + vmf->address, false); + } else { + page_add_file_rmap(page, false); + } + + set_pte_at(vma->vm_mm, vmf->address, + vmf->pte, pte); + } + + pte_unmap_unlock(vmf->pte, vmf->ptl); + unlock_page(page); } else { print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL); ret = VM_FAULT_SIGBUS; @@ -3282,7 +3324,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) goto out; } - delayacct_set_flag(DELAYACCT_PF_SWAPIN); page = lookup_swap_cache(entry, vma, vmf->address); swapcache = page; diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index 86e3a3688d59..8fffae175104 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -93,10 +93,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) return false; entry = pte_to_swp_entry(*pvmw->pte); - if (!is_migration_entry(entry)) + if (is_migration_entry(entry)) + pfn = migration_entry_to_pfn(entry); + else if (is_hwpoison_entry(entry)) + pfn = hwpoison_entry_to_pfn(entry); + else return false; - - pfn = migration_entry_to_pfn(entry); } else if (is_swap_pte(*pvmw->pte)) { swp_entry_t entry; diff --git a/mm/rmap.c b/mm/rmap.c index 08c56aaf72eb..f08d1fc28522 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1575,7 +1575,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma, dec_mm_counter(mm, mm_counter(page)); set_pte_at(mm, address, pvmw.pte, pteval); } - + get_page(page); } else if (pte_unused(pteval) && !userfaultfd_armed(vma)) { /* * The guest indicated that the page content is of no diff --git a/mm/shmem.c b/mm/shmem.c index 7c6b6d8f6c39..d29a0c9be19c 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1832,6 +1832,13 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (page) hindex = page->index; + + if (page && PageHWPoison(page)) { + unlock_page(page); + put_page(page); + return -EIO; + } + if (page && sgp == SGP_WRITE) mark_page_accessed(page); diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig index 1c37ccd5d402..50d7422386aa 100644 --- a/virt/kvm/Kconfig +++ b/virt/kvm/Kconfig @@ -63,3 +63,6 @@ config HAVE_KVM_NO_POLL config KVM_XFER_TO_GUEST_WORK bool + +config HAVE_KVM_PROTECTED_MEMORY + bool diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8367d88ce39b..f182c54bfa34 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -51,6 +51,7 @@ #include <linux/io.h> #include <linux/lockdep.h> #include <linux/kthread.h> +#include <linux/rmap.h> #include <asm/processor.h> #include <asm/ioctl.h> @@ -2333,19 +2334,85 @@ static int next_segment(unsigned long len, int offset) return len; } -static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn, - void *data, int offset, int len) +int copy_from_guest(struct kvm *kvm, void *data, unsigned long hva, int len) +{ + int offset = offset_in_page(hva); + struct page *page; + int npages, seg; + void *vaddr; + + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) || + !kvm->mem_protected) { + return __copy_from_user(data, (void __user *)hva, len); + } + + might_fault(); + kasan_check_write(data, len); + check_object_size(data, len, false); + + while ((seg = next_segment(len, offset)) != 0) { + npages = get_user_pages_unlocked(hva, 1, &page, 0); + if (npages != 1) + return -EFAULT; + + vaddr = kmap_atomic(page); + memcpy(data, vaddr + offset, seg); + kunmap_atomic(vaddr); + + put_page(page); + len -= seg; + hva += seg; + data += seg; + offset = 0; + } + + return 0; +} + +int copy_to_guest(struct kvm *kvm, unsigned long hva, const void *data, int len) +{ + int offset = offset_in_page(hva); + struct page *page; + int npages, seg; + void *vaddr; + + if (!IS_ENABLED(CONFIG_HAVE_KVM_PROTECTED_MEMORY) || + !kvm->mem_protected) { + return __copy_to_user((void __user *)hva, data, len); + } + + might_fault(); + kasan_check_read(data, len); + check_object_size(data, len, true); + + while ((seg = next_segment(len, offset)) != 0) { + npages = get_user_pages_unlocked(hva, 1, &page, FOLL_WRITE); + if (npages != 1) + return -EFAULT; + + vaddr = kmap_atomic(page); + memcpy(vaddr + offset, data, seg); + kunmap_atomic(vaddr); + + put_page(page); + len -= seg; + hva += seg; + data += seg; + offset = 0; + } + + return 0; +} + +static int __kvm_read_guest_page(struct kvm *kvm, struct kvm_memory_slot *slot, + gfn_t gfn, void *data, int offset, int len) { - int r; unsigned long addr; addr = gfn_to_hva_memslot_prot(slot, gfn, NULL); if (kvm_is_error_hva(addr)) return -EFAULT; - r = __copy_from_user(data, (void __user *)addr + offset, len); - if (r) - return -EFAULT; - return 0; + return copy_from_guest(kvm, data, addr + offset, len); } int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, @@ -2353,7 +2420,7 @@ int kvm_read_guest_page(struct kvm *kvm, gfn_t gfn, void *data, int offset, { struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn); - return __kvm_read_guest_page(slot, gfn, data, offset, len); + return __kvm_read_guest_page(kvm, slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_read_guest_page); @@ -2362,7 +2429,7 @@ int kvm_vcpu_read_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn, void *data, { struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn); - return __kvm_read_guest_page(slot, gfn, data, offset, len); + return __kvm_read_guest_page(vcpu->kvm, slot, gfn, data, offset, len); } EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_page); @@ -2444,7 +2511,8 @@ static int __kvm_write_guest_page(struct kvm *kvm, addr = gfn_to_hva_memslot(memslot, gfn); if (kvm_is_error_hva(addr)) return -EFAULT; - r = __copy_to_user((void __user *)addr + offset, data, len); + + r = copy_to_guest(kvm, addr + offset, data, len); if (r) return -EFAULT; mark_page_dirty_in_slot(kvm, memslot, gfn); @@ -2581,7 +2649,7 @@ int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, if (unlikely(!ghc->memslot)) return kvm_write_guest(kvm, gpa, data, len); - r = __copy_to_user((void __user *)ghc->hva + offset, data, len); + r = copy_to_guest(kvm, ghc->hva + offset, data, len); if (r) return -EFAULT; mark_page_dirty_in_slot(kvm, ghc->memslot, gpa >> PAGE_SHIFT); @@ -2602,7 +2670,6 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, unsigned long len) { struct kvm_memslots *slots = kvm_memslots(kvm); - int r; gpa_t gpa = ghc->gpa + offset; BUG_ON(len + offset > ghc->len); @@ -2618,11 +2685,7 @@ int kvm_read_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc, if (unlikely(!ghc->memslot)) return kvm_read_guest(kvm, gpa, data, len); - r = __copy_from_user(data, (void __user *)ghc->hva + offset, len); - if (r) - return -EFAULT; - - return 0; + return copy_from_guest(kvm, data, ghc->hva + offset, len); } EXPORT_SYMBOL_GPL(kvm_read_guest_offset_cached); @@ -2688,6 +2751,73 @@ void kvm_vcpu_mark_page_dirty(struct kvm_vcpu *vcpu, gfn_t gfn) } EXPORT_SYMBOL_GPL(kvm_vcpu_mark_page_dirty); +int kvm_protect_memory(struct kvm *kvm) +{ + if (mmap_write_lock_killable(kvm->mm)) + return -KVM_EINTR; + kvm->mem_protected = true; + kvm_arch_flush_shadow_all(kvm); + mmap_write_unlock(kvm->mm); + + return 0; +} + +bool gfn_is_shared(struct kvm *kvm, unsigned long gfn) +{ + bool ret = false; + int i; + + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nr_shared_ranges; i++) { + if (gfn < kvm->shared_ranges[i].start) + continue; + if (gfn >= kvm->shared_ranges[i].end) + continue; + + ret = true; + break; + } + spin_unlock(&kvm->mmu_lock); + + return ret; +} + +int kvm_share_memory(struct kvm *kvm, unsigned long gfn, unsigned long npages) +{ + unsigned long end = gfn + npages; + + if (!npages) + return 0; + + /* + * Out of slots. + * Still worth to proceed: the new range may merge with an existing + * one. + */ + WARN_ON_ONCE(kvm->nr_shared_ranges == ARRAY_SIZE(kvm->shared_ranges)); + + spin_lock(&kvm->mmu_lock); + kvm->nr_shared_ranges = add_range_with_merge(kvm->shared_ranges, + ARRAY_SIZE(kvm->shared_ranges), + kvm->nr_shared_ranges, gfn, end); + kvm->nr_shared_ranges = clean_sort_range(kvm->shared_ranges, + ARRAY_SIZE(kvm->shared_ranges)); + spin_unlock(&kvm->mmu_lock); + + for (; gfn < end; gfn++) { + struct page *page = gfn_to_page(kvm, gfn); + + if (page == KVM_ERR_PTR_BAD_PAGE) + continue; + lock_page(page); + ClearPageHWPoison(page); + unlock_page(page); + put_page(page); + } + + return 0; +} + void kvm_sigset_activate(struct kvm_vcpu *vcpu) { if (!vcpu->sigset_active)