Message ID | 20240602142303.3263551-1-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | Handled Elsewhere, archived |
Headers | show |
Series | None | expand |
On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote: > + /* > + * The only thing one can do at this point on failure > + * is panic. It is reasonable to proceed. It makes even less sense now: panic() means "all stops and we die" and you say it is reasonable to proceed. I'm confused.
On Mon, Jun 03, 2024 at 10:37:54AM +0200, Borislav Petkov wrote: > On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote: > > + /* > > + * The only thing one can do at this point on failure > > + * is panic. It is reasonable to proceed. > > It makes even less sense now: panic() means "all stops and we die" and > you say it is reasonable to proceed. > > I'm confused. Right. What about the comment below? /* * One possible reason for the failure is if kexec raced * with memory conversion. In this case shared bit in * page table got set (or not cleared) during * shared<->private conversion, but the page is actually * private. So this failure is not going to affect the * kexec'ed kernel. * * The only thing one can do at this point on failure * at this point is panic. In absence of better options, * it is reasonable to proceed, hoping the failure is a * benign shared bit mismatch due to the race. * * Also, even if the failure is real and the page cannot * be touched as private, the kdump kernel will boot * fine as it uses pre-reserved memory. What happens * next depends on what the dumping process does and * there's a reasonable chance to produce useful dump * on crash. * * Regardless, the print leaves a trace in the log to * give a clue for debug. */
On 6/4/24 08:32, Kirill A. Shutemov wrote: > What about the comment below? > > /* > * One possible reason for the failure is if kexec raced > * with memory conversion. In this case shared bit in > * page table got set (or not cleared) during > * shared<->private conversion, but the page is actually > * private. So this failure is not going to affect the > * kexec'ed kernel. > * > * The only thing one can do at this point on failure > * at this point is panic. In absence of better options, > * it is reasonable to proceed, hoping the failure is a > * benign shared bit mismatch due to the race. > * > * Also, even if the failure is real and the page cannot > * be touched as private, the kdump kernel will boot > * fine as it uses pre-reserved memory. What happens > * next depends on what the dumping process does and > * there's a reasonable chance to produce useful dump > * on crash. > * > * Regardless, the print leaves a trace in the log to > * give a clue for debug. > */ It's rambling too much for my taste. Let's boil this down to what matters: 1. Failures to change encryption status here can lead a future kernel to touch shared memory with a private mapping 2. That causes an immediate unrecoverable guest shutdown (right?) 3. kdump kernels should not be affected since they have their own memory ranges and its encryption status is not being tweawked here 4. The pr_err() may help make some sense out of #2 when it happens I'm not sure the reason behind the failed conversion is important here. I wouldn't mention panic(). We don't need to opine about what the next kernel might or might not do.
On Tue, Jun 04, 2024 at 08:47:22AM -0700, Dave Hansen wrote: > On 6/4/24 08:32, Kirill A. Shutemov wrote: > > What about the comment below? > > > > /* > > * One possible reason for the failure is if kexec raced > > * with memory conversion. In this case shared bit in > > * page table got set (or not cleared) during > > * shared<->private conversion, but the page is actually > > * private. So this failure is not going to affect the > > * kexec'ed kernel. > > * > > * The only thing one can do at this point on failure > > * at this point is panic. In absence of better options, > > * it is reasonable to proceed, hoping the failure is a > > * benign shared bit mismatch due to the race. > > * > > * Also, even if the failure is real and the page cannot > > * be touched as private, the kdump kernel will boot > > * fine as it uses pre-reserved memory. What happens > > * next depends on what the dumping process does and > > * there's a reasonable chance to produce useful dump > > * on crash. > > * > > * Regardless, the print leaves a trace in the log to > > * give a clue for debug. > > */ > > It's rambling too much for my taste. > > Let's boil this down to what matters: > > 1. Failures to change encryption status here can lead a future kernel > to touch shared memory with a private mapping > 2. That causes an immediate unrecoverable guest shutdown (right?) Right. > 3. kdump kernels should not be affected since they have their own > memory ranges and its encryption status is not being tweawked here > 4. The pr_err() may help make some sense out of #2 when it happens > > I'm not sure the reason behind the failed conversion is important here. The important part is that failure can be benign. It explains "can" in #1. But okay. > I wouldn't mention panic(). > > We don't need to opine about what the next kernel might or might not do. Is this any better? /* * If tdx_enc_status_changed() fails, it leaves memory * in an unknown state. If the memory remains shared, * it can result in an unrecoverable guest shutdown on * the first accessed through a private mapping. * * The kdump kernel boot is not impacted as it uses * a pre-reserved memory range that is always private. * However, gathering crash information could lead to * a crash if it accesses unconverted memory through * a private mapping. * * pr_err() may assist in understanding such crashes. */
On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote: > /* > * If tdx_enc_status_changed() fails, it leaves memory > * in an unknown state. If the memory remains shared, > * it can result in an unrecoverable guest shutdown on > * the first accessed through a private mapping. "access" So this sentence above can go too, right? Because that comment is in tdx_kexec_finish() and we're basically going off to kexec. So can a guest even access it through a private mapping? We're shutting down so nothing is running anymore... > * The kdump kernel boot is not impacted as it uses > * a pre-reserved memory range that is always private. > * However, gathering crash information could lead to > * a crash if it accesses unconverted memory through > * a private mapping. When does the kexec kernel even get such a private mapping? It is not even up yet... > * pr_err() may assist in understanding such crashes. "Print error info in order to leave bread crumbs for debugging." is what I'd say. Thx.
On Tue, Jun 04, 2024 at 08:05:54PM +0200, Borislav Petkov wrote: > On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote: > > /* > > * If tdx_enc_status_changed() fails, it leaves memory > > * in an unknown state. If the memory remains shared, > > * it can result in an unrecoverable guest shutdown on > > * the first accessed through a private mapping. > > "access" Okay. > So this sentence above can go too, right? I don't think so. > Because that comment is in tdx_kexec_finish() and we're basically going > off to kexec. So can a guest even access it through a private mapping? > We're shutting down so nothing is running anymore... This kernel can't. But the next kernel can. If a page can be accessed via private mapping is determined by the presence in Secure EPT. This state persist across kexec. > > * The kdump kernel boot is not impacted as it uses > > * a pre-reserved memory range that is always private. > > * However, gathering crash information could lead to > > * a crash if it accesses unconverted memory through > > * a private mapping. > > When does the kexec kernel even get such a private mapping? It is not > even up yet... Crash kernel provides access to this memory via /proc/vmcore. Crash kernel will assume all memory there is private. > > * pr_err() may assist in understanding such crashes. > > "Print error info in order to leave bread crumbs for debugging." is what > I'd say. Okay.
On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote: > If a page can be accessed via private mapping is determined by the > presence in Secure EPT. This state persist across kexec. I just love it how I tickle out details each time I touch this comment because we three can't write a single concise and self-contained explanation. :-( Ok, next version: "Private mappings persist across kexec. If tdx_enc_status_changed() fails in the first kernel, it leaves memory in an unknown state. If that memory remains shared, accessing it in the *next* kernel through a private mapping will result in an unrecoverable guest shutdown. The kdump kernel boot is not impacted as it uses a pre-reserved memory range that is always private. However, gathering crash information could lead to a crash if it accesses unconverted memory through a private mapping which is possible when accessing that memory through /proc/vmcore, for example. In all cases, print error info in order to leave enough bread crumbs for debugging." I think this is getting in the right direction as it actually makes sense now.
On Wed, Jun 05, 2024 at 06:24:19PM +0200, Borislav Petkov wrote: > On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote: > > If a page can be accessed via private mapping is determined by the > > presence in Secure EPT. This state persist across kexec. > > I just love it how I tickle out details each time I touch this comment > because we three can't write a single concise and self-contained > explanation. :-( > > Ok, next version: > > "Private mappings persist across kexec. If tdx_enc_status_changed() fails s/Private mappings persist /Memory encryption state persists / > in the first kernel, it leaves memory in an unknown state. > > If that memory remains shared, accessing it in the *next* kernel through > a private mapping will result in an unrecoverable guest shutdown. > > The kdump kernel boot is not impacted as it uses a pre-reserved memory > range that is always private. However, gathering crash information > could lead to a crash if it accesses unconverted memory through > a private mapping which is possible when accessing that memory through > /proc/vmcore, for example. > > In all cases, print error info in order to leave enough bread crumbs for > debugging." > > I think this is getting in the right direction as it actually makes > sense now. Otherwise looks good to me.
diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c index 979891e97d83..afd71bc6eb02 100644 --- a/arch/x86/coco/tdx/tdx.c +++ b/arch/x86/coco/tdx/tdx.c @@ -7,6 +7,7 @@ #include <linux/cpufeature.h> #include <linux/export.h> #include <linux/io.h> +#include <linux/kexec.h> #include <asm/coco.h> #include <asm/tdx.h> #include <asm/vmx.h> @@ -14,6 +15,7 @@ #include <asm/insn.h> #include <asm/insn-eval.h> #include <asm/pgtable.h> +#include <asm/set_memory.h> /* MMIO direction */ #define EPT_READ 0 @@ -831,6 +833,91 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages, return 0; } +/* Stop new private<->shared conversions */ +static void tdx_kexec_begin(bool crash) +{ + /* + * Crash kernel reaches here with interrupts disabled: can't wait for + * conversions to finish. + * + * If race happened, just report and proceed. + */ + if (!set_memory_enc_stop_conversion(!crash)) + pr_warn("Failed to stop shared<->private conversions\n"); +} + +/* Walk direct mapping and convert all shared memory back to private */ +static void tdx_kexec_finish(void) +{ + unsigned long addr, end; + long found = 0, shared; + + lockdep_assert_irqs_disabled(); + + addr = PAGE_OFFSET; + end = PAGE_OFFSET + get_max_mapped(); + + while (addr < end) { + unsigned long size; + unsigned int level; + pte_t *pte; + + pte = lookup_address(addr, &level); + size = page_level_size(level); + + if (pte && pte_decrypted(*pte)) { + int pages = size / PAGE_SIZE; + + /* + * Touching memory with shared bit set triggers implicit + * conversion to shared. + * + * Make sure nobody touches the shared range from + * now on. + */ + set_pte(pte, __pte(0)); + + /* + * The only thing one can do at this point on failure + * is panic. It is reasonable to proceed. + * + * Also, even if the failure is real and the page cannot + * be touched as private, the kdump kernel will boot + * fine as it uses pre-reserved memory. What happens + * next depends on what the dumping process does and + * there's a reasonable chance to produce useful dump + * on crash. + * + * Regardless, the print leaves a trace in the log to + * give a clue for debug. + * + * One possible reason for the failure is if kdump raced + * with memory conversion. In this case shared bit in + * page table got set (or not cleared) during + * shared<->private conversion, but the page is actually + * private. So this failure is not going to affect the + * kexec'ed kernel. + */ + if (!tdx_enc_status_changed(addr, pages, true)) { + pr_err("Failed to unshare range %#lx-%#lx\n", + addr, addr + size); + } + + found += pages; + } + + addr += size; + } + + __flush_tlb_all(); + + shared = atomic_long_read(&nr_shared); + if (shared != found) { + pr_err("shared page accounting is off\n"); + pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found); + } +} + void __init tdx_early_init(void) { struct tdx_module_args args = { @@ -890,6 +977,9 @@ void __init tdx_early_init(void) x86_platform.guest.enc_cache_flush_required = tdx_cache_flush_required; x86_platform.guest.enc_tlb_flush_required = tdx_tlb_flush_required; + x86_platform.guest.enc_kexec_begin = tdx_kexec_begin; + x86_platform.guest.enc_kexec_finish = tdx_kexec_finish; + /* * TDX intercepts the RDMSR to read the X2APIC ID in the parallel * bringup low level code. That raises #VE which cannot be handled diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 65b8e5bb902c..e39311a89bf4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -140,6 +140,11 @@ static inline int pte_young(pte_t pte) return pte_flags(pte) & _PAGE_ACCESSED; } +static inline bool pte_decrypted(pte_t pte) +{ + return cc_mkdec(pte_val(pte)) == pte_val(pte); +} + #define pmd_dirty pmd_dirty static inline bool pmd_dirty(pmd_t pmd) { diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h index 9aee31862b4a..d490db38db9e 100644 --- a/arch/x86/include/asm/set_memory.h +++ b/arch/x86/include/asm/set_memory.h @@ -49,8 +49,11 @@ int set_memory_wb(unsigned long addr, int numpages); int set_memory_np(unsigned long addr, int numpages); int set_memory_p(unsigned long addr, int numpages); int set_memory_4k(unsigned long addr, int numpages); + +bool set_memory_enc_stop_conversion(bool wait); int set_memory_encrypted(unsigned long addr, int numpages); int set_memory_decrypted(unsigned long addr, int numpages); + int set_memory_np_noalias(unsigned long addr, int numpages); int set_memory_nonglobal(unsigned long addr, int numpages); int set_memory_global(unsigned long addr, int numpages); diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c index a7a7a6c6a3fb..2a548b65ef5f 100644 --- a/arch/x86/mm/pat/set_memory.c +++ b/arch/x86/mm/pat/set_memory.c @@ -2227,12 +2227,47 @@ static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc) return ret; } +/* + * The lock serializes conversions between private and shared memory. + * + * It is taken for read on conversion. A write lock guarantees that no + * concurrent conversions are in progress. + */ +static DECLARE_RWSEM(mem_enc_lock); + +/* + * Stop new private<->shared conversions. + * + * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete. + * The lock is not released to prevent new conversions from being started. + * + * If sleep is not allowed, as in a crash scenario, try to take the lock. + * Failure indicates that there is a race with the conversion. + */ +bool set_memory_enc_stop_conversion(bool wait) +{ + if (!wait) + return down_write_trylock(&mem_enc_lock); + + down_write(&mem_enc_lock); + + return true; +} + static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc) { - if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) - return __set_memory_enc_pgtable(addr, numpages, enc); + int ret = 0; - return 0; + if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) { + if (!down_read_trylock(&mem_enc_lock)) + return -EBUSY; + + ret = __set_memory_enc_pgtable(addr, numpages, enc); + + up_read(&mem_enc_lock); + } + + return ret; } int set_memory_encrypted(unsigned long addr, int numpages)