[v19,007/130] x86/virt/tdx: Export SEAMCALL functions

Message ID	8f64043a6c393c017347bf8954d92b84b58603ec.1708933498.git.isaku.yamahata@intel.com (mailing list archive)
State	New, archived
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.7]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 064DA219FD; Mon, 26 Feb 2024 08:27:39 +0000 (UTC) From: isaku.yamahata@intel.com To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org Cc: isaku.yamahata@intel.com, isaku.yamahata@gmail.com, Paolo Bonzini <pbonzini@redhat.com>, erdemaktas@google.com, Sean Christopherson <seanjc@google.com>, Sagi Shahar <sagis@google.com>, Kai Huang <kai.huang@intel.com>, chen.bo@intel.com, hang.yuan@intel.com, tina.zhang@intel.com Subject: [PATCH v19 007/130] x86/virt/tdx: Export SEAMCALL functions Date: Mon, 26 Feb 2024 00:25:09 -0800 Message-Id: <8f64043a6c393c017347bf8954d92b84b58603ec.1708933498.git.isaku.yamahata@intel.com> In-Reply-To: <cover.1708933498.git.isaku.yamahata@intel.com> References: <cover.1708933498.git.isaku.yamahata@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro \| expand [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro [v19,002/130] x86/virt/tdx: Move TDMR metadata fields map table to local variable [v19,003/130] x86/virt/tdx: Unbind global metadata read with 'struct tdx_tdmr_sysinfo' [v19,004/130] x86/virt/tdx: Support global metadata read for all element sizes [v19,005/130] x86/virt/tdx: Export global metadata read infrastructure [v19,006/130] x86/virt/tdx: Export TDX KeyID information [v19,007/130] x86/virt/tdx: Export SEAMCALL functions [v19,008/130] x86/tdx: Warning with 32bit build shift-count-overflow [v19,009/130] KVM: x86: Add gmem hook for determining max NPT mapping level [v19,010/130] KVM: x86: Pass is_private to gmem hook of gmem_max_level [v19,011/130] KVM: Add new members to struct kvm_gfn_range to operate on [v19,012/130] KVM: x86/mmu: Pass around full 64-bit error code for the KVM page fault [v19,013/130] KVM: x86: Use PFERR_GUEST_ENC_MASK to indicate fault is private [v19,014/130] KVM: Add KVM vcpu ioctl to pre-populate guest memory [v19,015/130] KVM: Document KVM_MEMORY_MAPPING ioctl [v19,016/130] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX [v19,017/130] KVM: x86: Implement kvm_arch_{, pre_}vcpu_memory_mapping() [v19,018/130] KVM: x86/mmu: Assume guest MMIOs are shared [v19,019/130] KVM: x86: Add is_vm_type_supported callback [v19,020/130] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX [v19,021/130] KVM: x86/vmx: initialize loaded_vmcss_on_cpu in vmx_init() [v19,022/130] KVM: x86/vmx: Refactor KVM VMX module init/exit functions [v19,023/130] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module [v19,024/130] KVM: TDX: Add placeholders for TDX VM/vcpu structure [v19,025/130] KVM: TDX: Make TDX VM type supported [v19,026/130,MARKER] The start of TDX KVM patch series: TDX architectural definitions [v19,027/130] KVM: TDX: Define TDX architectural definitions [v19,028/130] KVM: TDX: Add TDX "architectural" error codes [v19,029/130] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module [v19,030/130] KVM: TDX: Add helper functions to print TDX SEAMCALL error [v19,031/130,MARKER] The start of TDX KVM patch series: TD VM creation/destruction [v19,032/130] KVM: TDX: Add helper functions to allocate/free TDX private host key id [v19,033/130] KVM: TDX: Add helper function to read TDX metadata in array [v19,034/130] KVM: TDX: Get system-wide info about TDX module on initialization [v19,035/130] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl [v19,036/130] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters [v19,037/130] KVM: TDX: Make KVM_CAP_MAX_VCPUS backend specific [v19,038/130] KVM: TDX: create/destroy VM structure [v19,039/130] KVM: TDX: initialize VM with TDX specific parameters [v19,040/130] KVM: TDX: Make pmu_intel.c ignore guest TD case [v19,041/130] KVM: TDX: Refuse to unplug the last cpu on the package [v19,042/130,MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction [v19,043/130] KVM: TDX: create/free TDX vcpu structure [v19,044/130] KVM: TDX: Do TDX specific vcpu initialization [v19,045/130,MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits [v19,046/130] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA [v19,047/130,MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX [v19,048/130] KVM: Allow page-sized MMU caches to be initialized with custom 64-bit values [v19,049/130] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE [v19,050/130] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE [v19,051/130] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask/shadow_present_mask [v19,052/130] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis [v19,053/130] KVM: x86/mmu: Disallow fast page fault on private GPA [v19,054/130] KVM: VMX: Introduce test mode related to EPT violation VE [v19,055/130,MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks [v19,056/130] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation [v19,057/130] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role [v19,058/130] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page [v19,059/130] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases [v19,060/130] KVM: x86/tdp_mmu: Apply mmu notifier callback to only shared GPA [v19,061/130] KVM: x86/tdp_mmu: Sprinkle __must_check [v19,062/130] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU [v19,063/130,MARKER] The start of TDX KVM patch series: TDX EPT violation [v19,064/130] KVM: x86/mmu: Do not enable page track for TD guest [v19,065/130] KVM: VMX: Split out guts of EPT violation to common/exposed function [v19,066/130] KVM: TDX: Add accessors VMX VMCS helpers [v19,067/130] KVM: TDX: Add load_mmu_pgd method for TDX [v19,068/130] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT [v19,069/130] KVM: TDX: Require TDP MMU and mmio caching for TDX [v19,070/130] KVM: TDX: TDP MMU TDX support [v19,071/130] KVM: TDX: MTRR: implement get_mt_mask() for TDX [v19,072/130,MARKER] The start of TDX KVM patch series: TD finalization [v19,073/130] KVM: x86: Add hooks in kvm_arch_vcpu_memory_mapping() [v19,074/130] KVM: TDX: Create initial guest memory [v19,075/130] KVM: TDX: Extend memory measurement with initial guest memory [v19,076/130] KVM: TDX: Finalize VM initialization [v19,077/130,MARKER] The start of TDX KVM patch series: TD vcpu enter/exit [v19,078/130] KVM: TDX: Implement TDX vcpu enter/exit path [v19,079/130] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) [v19,080/130] KVM: TDX: restore host xsave state when exit from the guest TD [v19,081/130] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr [v19,082/130] KVM: TDX: restore user ret MSRs [v19,083/130] KVM: TDX: Add TSX_CTRL msr into uret_msrs list [v19,084/130,MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls [v19,085/130] KVM: TDX: Complete interrupts after tdexit [v19,086/130] KVM: TDX: restore debug store when TD exit [v19,087/130] KVM: TDX: handle vcpu migration over logical processor [v19,088/130] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior [v19,089/130] KVM: TDX: Add support for find pending IRQ in a protected local APIC [v19,090/130] KVM: x86: Assume timer IRQ was injected if APIC state is proteced [v19,091/130] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c [v19,092/130] KVM: TDX: Implement interrupt injection [v19,093/130] KVM: TDX: Implements vcpu request_immediate_exit [v19,094/130] KVM: TDX: Implement methods to inject NMI [v19,095/130] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument [v19,096/130] KVM: VMX: Move NMI/exception handler to common helper [v19,097/130] KVM: x86: Split core of hypercall emulation to helper function [v19,098/130] KVM: TDX: Add a place holder to handle TDX VM exit [v19,099/130] KVM: TDX: Handle vmentry failure for INTEL TD guest [v19,100/130] KVM: TDX: handle EXIT_REASON_OTHER_SMI [v19,101/130] KVM: TDX: handle ept violation/misconfig exit [v19,102/130] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT [v19,103/130] KVM: TDX: Handle EXIT_REASON_OTHER_SMI with MSMI [v19,104/130] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) [v19,105/130] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL [v19,106/130] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL [v19,107/130] KVM: TDX: Handle TDX PV CPUID hypercall [v19,108/130] KVM: TDX: Handle TDX PV HLT hypercall [v19,109/130] KVM: TDX: Handle TDX PV port io hypercall [v19,110/130] KVM: TDX: Handle TDX PV MMIO hypercall [v19,111/130] KVM: TDX: Implement callbacks for MSR operations for TDX [v19,112/130] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall [v19,113/130] KVM: TDX: Handle MSR MTRRCap and MTRRDefType access [v19,114/130] KVM: TDX: Handle MSR IA32_FEAT_CTL MSR and IA32_MCG_EXT_CTL [v19,115/130] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall [v19,116/130] KVM: TDX: Silently discard SMI request [v19,117/130] KVM: TDX: Silently ignore INIT/SIPI [v19,118/130] KVM: TDX: Add methods to ignore accesses to CPU state [v19,119/130] KVM: TDX: Add methods to ignore guest instruction emulation [v19,120/130] KVM: TDX: Add a method to ignore dirty logging [v19,121/130] KVM: TDX: Add methods to ignore VMX preemption timer [v19,122/130] KVM: TDX: Add methods to ignore accesses to TSC [v19,123/130] KVM: TDX: Ignore setting up mce [v19,124/130] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch [v19,125/130] KVM: TDX: Add methods to ignore virtual apic related operation [v19,126/130] KVM: TDX: Inhibit APICv for TDX guest [v19,127/130] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) [v19,128/130] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU [v19,129/130] RFC: KVM: x86: Add x86 callback to check cpuid [v19,130/130] RFC: KVM: x86, TDX: Add check for KVM_SET_CPUID2

Isaku Yamahata Feb. 26, 2024, 8:25 a.m. UTC

From: Kai Huang <kai.huang@intel.com>

KVM will need to make SEAMCALLs to create and run TDX guests.  Export
SEAMCALL functions for KVM to use.

Also add declaration of SEAMCALL functions to <asm/asm-prototypes.h> to
support CONFIG_MODVERSIONS=y.

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/asm-prototypes.h | 1 +
 arch/x86/virt/vmx/tdx/seamcall.S      | 4 ++++
 2 files changed, 5 insertions(+)

Huang, Kai March 15, 2024, 12:02 a.m. UTC | #1

On 26/02/2024 9:25 pm, isaku.yamahata@intel.com wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> KVM will need to make SEAMCALLs to create and run TDX guests.  Export
> SEAMCALL functions for KVM to use.
> 

Could you also list the reason that we want to expose __seamcall() 
directly, rather than wanting to put some higher level wrappers in the 
TDX host code, and export them?

For example, we can give a summary of the SEAMCALLs (e.g., how many in 
total, and roughly introduce them based on categories) that will be used 
by KVM, and clarify the reasons why we want to just export __seamcall().

E.g., we can say something like this:

TD;LR:

KVM roughly will need to use dozens of SEAMCALLs, and all these are 
logically related to creating and running TDX guests.  It makes more 
sense to just export __seamcall() and let KVM maintain these VM-related 
wrappers rather than having the TDX host code to provide wrappers for 
each SEAMCALL or higher-level abstraction.

Long version:

You give a detailed explanation of SEAMCALLs that will be used by KVM, 
and clarify logically it's better to manage these code in KVM.

Edgecombe, Rick P March 15, 2024, 1:17 a.m. UTC | #2

Dave, care to weigh in here?

Context (whether to export the generic __seamcall for KVM's use):
https://lore.kernel.org/lkml/e6e8f585-b718-4f53-88f6-89832a1e4b9f@intel.com/

On Fri, 2024-03-15 at 13:02 +1300, Huang, Kai wrote:
> KVM roughly will need to use dozens of SEAMCALLs, and all these are 
> logically related to creating and running TDX guests.  It makes more 
> sense to just export __seamcall() and let KVM maintain these VM-
> related 
> wrappers rather than having the TDX host code to provide wrappers for
> each SEAMCALL or higher-level abstraction.

The helpers being discussed are these:
https://lore.kernel.org/lkml/7cfd33d896fce7b49bcf4b7179d0ded22c06b8c2.1708933498.git.isaku.yamahata@intel.com/

I guess there are three options:
1. Export the low level seamcall function
2. Export a bunch of higher level helper functions
3. Duplicate __seamcall asm in KVM

Letting modules make unrestricted seamcalls is not ideal. Preventing
the compiler from inlining the small logic in the static inline helpers
is not ideal. Duplicating code is not ideal. Hmm.

I want to say 2 sounds the least worst of the three. But I'm not sure.
I'm not sure if x86 folks would like to police new seamcalls, or be
bothered by it, either.

Dave Hansen March 15, 2024, 1:33 a.m. UTC | #3

On 3/14/24 18:17, Edgecombe, Rick P wrote:
> I guess there are three options:
> 1. Export the low level seamcall function
> 2. Export a bunch of higher level helper functions
> 3. Duplicate __seamcall asm in KVM
> 
> Letting modules make unrestricted seamcalls is not ideal. Preventing
> the compiler from inlining the small logic in the static inline helpers
> is not ideal. Duplicating code is not ideal. Hmm.
> 
> I want to say 2 sounds the least worst of the three. But I'm not sure.
> I'm not sure if x86 folks would like to police new seamcalls, or be
> bothered by it, either.

#3 is the only objectively awful one. :)

In the end, we actually _want_ to have conversations about these things.
 There are going to be considerations about what functionality should be
in KVM or the core kernel.  We don't want KVM doing any calls that could
affect global TDX module state, for instance.

But I'd also defer to the KVM maintainers on this.  They're the ones
that have to play the symbol exporting game a lot more than I ever do.
If they cringe at the idea of adding 20 (or whatever) exports, then
that's a lot more important than the possibility of some other silly
module abusing the generic exported __seamcall.

Sean Christopherson March 15, 2024, 4:33 p.m. UTC | #4

On Thu, Mar 14, 2024, Dave Hansen wrote:
> On 3/14/24 18:17, Edgecombe, Rick P wrote:
> > I guess there are three options:
> > 1. Export the low level seamcall function
> > 2. Export a bunch of higher level helper functions
> > 3. Duplicate __seamcall asm in KVM
> > 
> > Letting modules make unrestricted seamcalls is not ideal. Preventing
> > the compiler from inlining the small logic in the static inline helpers
> > is not ideal. Duplicating code is not ideal. Hmm.
> > 
> > I want to say 2 sounds the least worst of the three. But I'm not sure.
> > I'm not sure if x86 folks would like to police new seamcalls, or be
> > bothered by it, either.
> 
> #3 is the only objectively awful one. :)
> 
> In the end, we actually _want_ to have conversations about these things.
>  There are going to be considerations about what functionality should be
> in KVM or the core kernel.  We don't want KVM doing any calls that could
> affect global TDX module state, for instance.

Heh, Like this one?

        static inline u64 tdh_sys_lp_shutdown(void)
        {
        	struct tdx_module_args in = {
        	};
        
        	return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
        }

Which isn't actually used...

> But I'd also defer to the KVM maintainers on this.  They're the ones
> that have to play the symbol exporting game a lot more than I ever do.
> If they cringe at the idea of adding 20 (or whatever) exports, then
> that's a lot more important than the possibility of some other silly
> module abusing the generic exported __seamcall.

I don't care much about exports.  What I do care about is sane code, and while
the current code _looks_ pretty, it's actually quite insane.

I get why y'all put SEAMCALL in assembly subroutines; the macro shenanigans I
originally wrote years ago were their own brand of crazy, and dealing with GPRs
that can't be asm() constraints often results in brittle code.

But the tdx_module_args structure approach generates truly atrocious code.  Yes,
SEAMCALL is inherently slow, but that doesn't mean that we shouldn't at least try
to generate efficient code.  And it's not just efficiency that is lost, the
generated code ends up being much harder to read than it ought to be.

E.g. the seemingly simple

        static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
        				      struct tdx_module_args *out)
        {
        	struct tdx_module_args in = {
        		.rcx = gpa | level,
        		.rdx = tdr,
        	};

        	return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
        }

generates the below monstrosity with gcc-13.  And that's just one SEAMCALL wrapper,
*every* single one generates the same mess.  clang-16 is kinda sorta a little
better, as it at least inlines the helpers that have single callers.

So my feedback is to not worry about the exports, and instead focus on figuring
out a way to make the generated code less bloated and easier to read/debug.

Dump of assembler code for function tdh_mem_page_remove:
   0x0000000000032b20 <+0>:	push   %r15
   0x0000000000032b22 <+2>:	xor    %eax,%eax
   0x0000000000032b24 <+4>:	movabs $0x8000ff0000000006,%r15
   0x0000000000032b2e <+14>:	push   %r14
   0x0000000000032b30 <+16>:	mov    %rcx,%r14
   0x0000000000032b33 <+19>:	mov    $0xb,%ecx
   0x0000000000032b38 <+24>:	push   %r13
   0x0000000000032b3a <+26>:	movslq %edx,%r13
   0x0000000000032b3d <+29>:	push   %r12
   0x0000000000032b3f <+31>:	or     %rsi,%r13
   0x0000000000032b42 <+34>:	mov    $0x11,%r12d
   0x0000000000032b48 <+40>:	push   %rbp
   0x0000000000032b49 <+41>:	movabs $0x8000020300000000,%rbp
   0x0000000000032b53 <+51>:	push   %rbx
   0x0000000000032b54 <+52>:	sub    $0x70,%rsp
   0x0000000000032b58 <+56>:	mov    %rdi,(%rsp)
   0x0000000000032b5c <+60>:	lea    0x18(%rsp),%rdi
   0x0000000000032b61 <+65>:	rep stos %rax,%es:(%rdi)
   0x0000000000032b64 <+68>:	mov    (%rsp),%rax
   0x0000000000032b68 <+72>:	mov    %r13,(%r14)
   0x0000000000032b6b <+75>:	mov    $0xa,%ebx
   0x0000000000032b70 <+80>:	mov    %rax,0x8(%r14)
   0x0000000000032b74 <+84>:	mov    0x18(%rsp),%rax
   0x0000000000032b79 <+89>:	mov    %rax,0x10(%r14)
   0x0000000000032b7d <+93>:	mov    0x20(%rsp),%rax
   0x0000000000032b82 <+98>:	mov    %rax,0x18(%r14)
   0x0000000000032b86 <+102>:	mov    0x28(%rsp),%rax
   0x0000000000032b8b <+107>:	mov    %rax,0x20(%r14)
   0x0000000000032b8f <+111>:	mov    0x30(%rsp),%rax
   0x0000000000032b94 <+116>:	mov    %rax,0x28(%r14)
   0x0000000000032b98 <+120>:	mov    0x38(%rsp),%rax
   0x0000000000032b9d <+125>:	mov    %rax,0x30(%r14)
   0x0000000000032ba1 <+129>:	mov    0x40(%rsp),%rax
   0x0000000000032ba6 <+134>:	mov    %rax,0x38(%r14)
   0x0000000000032baa <+138>:	mov    0x48(%rsp),%rax
   0x0000000000032baf <+143>:	mov    %rax,0x40(%r14)
   0x0000000000032bb3 <+147>:	mov    0x50(%rsp),%rax
   0x0000000000032bb8 <+152>:	mov    %rax,0x48(%r14)
   0x0000000000032bbc <+156>:	mov    0x58(%rsp),%rax
   0x0000000000032bc1 <+161>:	mov    %rax,0x50(%r14)
   0x0000000000032bc5 <+165>:	mov    0x60(%rsp),%rax
   0x0000000000032bca <+170>:	mov    %rax,0x58(%r14)
   0x0000000000032bce <+174>:	mov    0x68(%rsp),%rax
   0x0000000000032bd3 <+179>:	mov    %rax,0x60(%r14)
   0x0000000000032bd7 <+183>:	mov    %r14,%rsi
   0x0000000000032bda <+186>:	mov    $0x1d,%edi
   0x0000000000032bdf <+191>:	call   0x32be4 <tdh_mem_page_remove+196>
   0x0000000000032be4 <+196>:	cmp    %rbp,%rax
   0x0000000000032be7 <+199>:	jne    0x32bfd <tdh_mem_page_remove+221>
   0x0000000000032be9 <+201>:	sub    $0x1,%ebx
   0x0000000000032bec <+204>:	jne    0x32bd7 <tdh_mem_page_remove+183>
   0x0000000000032bee <+206>:	add    $0x70,%rsp
   0x0000000000032bf2 <+210>:	pop    %rbx
   0x0000000000032bf3 <+211>:	pop    %rbp
   0x0000000000032bf4 <+212>:	pop    %r12
   0x0000000000032bf6 <+214>:	pop    %r13
   0x0000000000032bf8 <+216>:	pop    %r14
   0x0000000000032bfa <+218>:	pop    %r15
   0x0000000000032bfc <+220>:	ret
   0x0000000000032bfd <+221>:	cmp    %r15,%rax
   0x0000000000032c00 <+224>:	je     0x32c2a <tdh_mem_page_remove+266>
   0x0000000000032c02 <+226>:	movabs $0x8000020000000092,%rdx
   0x0000000000032c0c <+236>:	cmp    %rdx,%rax
   0x0000000000032c0f <+239>:	jne    0x32bee <tdh_mem_page_remove+206>
   0x0000000000032c11 <+241>:	sub    $0x1,%r12d
   0x0000000000032c15 <+245>:	jne    0x32b64 <tdh_mem_page_remove+68>
   0x0000000000032c1b <+251>:	add    $0x70,%rsp
   0x0000000000032c1f <+255>:	pop    %rbx
   0x0000000000032c20 <+256>:	pop    %rbp
   0x0000000000032c21 <+257>:	pop    %r12
   0x0000000000032c23 <+259>:	pop    %r13
   0x0000000000032c25 <+261>:	pop    %r14
   0x0000000000032c27 <+263>:	pop    %r15
   0x0000000000032c29 <+265>:	ret
   0x0000000000032c2a <+266>:	call   0x32c2f <tdh_mem_page_remove+271>
   0x0000000000032c2f <+271>:	xor    %eax,%eax
   0x0000000000032c31 <+273>:	add    $0x70,%rsp
   0x0000000000032c35 <+277>:	pop    %rbx
   0x0000000000032c36 <+278>:	pop    %rbp
   0x0000000000032c37 <+279>:	pop    %r12
   0x0000000000032c39 <+281>:	pop    %r13
   0x0000000000032c3b <+283>:	pop    %r14
   0x0000000000032c3d <+285>:	pop    %r15
   0x0000000000032c3f <+287>:	ret
End of assembler dump.

Sean Christopherson March 15, 2024, 5:46 p.m. UTC | #5

On Fri, Mar 15, 2024, Sean Christopherson wrote:
> So my feedback is to not worry about the exports, and instead focus on figuring
> out a way to make the generated code less bloated and easier to read/debug.

Oh, and please make it a collaborative, public effort.  I don't want to hear
crickets and then see v20 dropped with a completely new SEAMCALL scheme.

Edgecombe, Rick P March 15, 2024, 5:48 p.m. UTC | #6

On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote:
> Heh, Like this one?
> 
>         static inline u64 tdh_sys_lp_shutdown(void)
>         {
>                 struct tdx_module_args in = {
>                 };
>         
>                 return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
>         }
> 
> Which isn't actually used...

Looks like is was turned into a NOP in TDX 1.5. So will even forever be
dead code. I see one other that is unused. Thanks for pointing it out.

> 
> > But I'd also defer to the KVM maintainers on this.  They're the
> > ones
> > that have to play the symbol exporting game a lot more than I ever
> > do.
> > If they cringe at the idea of adding 20 (or whatever) exports, then
> > that's a lot more important than the possibility of some other
> > silly
> > module abusing the generic exported __seamcall.
> 
> I don't care much about exports.  What I do care about is sane code,
> and while
> the current code _looks_ pretty, it's actually quite insane.
> 
> I get why y'all put SEAMCALL in assembly subroutines; the macro
> shenanigans I
> originally wrote years ago were their own brand of crazy, and dealing
> with GPRs
> that can't be asm() constraints often results in brittle code.

I guess it must be this, for the initiated:
https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/

> 
> But the tdx_module_args structure approach generates truly atrocious
> code.  Yes,
> SEAMCALL is inherently slow, but that doesn't mean that we shouldn't
> at least try
> to generate efficient code.  And it's not just efficiency that is
> lost, the
> generated code ends up being much harder to read than it ought to be.
> 
> 
[snip]
> 
> So my feedback is to not worry about the exports, and instead focus
> on figuring
> out a way to make the generated code less bloated and easier to
> read/debug.
> 

Thanks for the feedback both! It sounds like everyone is flexible on
the exports. As for the generated code, oof.

Kai, I see the solution has gone through some iterations already. First
the macro one linked above, then that was dropped pretty quick to
something that loses the asm constraints:
https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/

Then next the struct grew here, and here:
https://lore.kernel.org/linux-mm/20230628211132.GS38236@hirez.programming.kicks-ass.net/
https://lore.kernel.org/linux-mm/20230630102141.GA2534364@hirez.programming.kicks-ass.net/

Not sure I understand all of the constraints yet. Do you have any
ideas?

Edgecombe, Rick P March 15, 2024, 5:51 p.m. UTC | #7

On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote:
> On Fri, Mar 15, 2024, Sean Christopherson wrote:
> > So my feedback is to not worry about the exports, and instead focus
> > on figuring
> > out a way to make the generated code less bloated and easier to
> > read/debug.
> 
> Oh, and please make it a collaborative, public effort.  I don't want
> to hear
> crickets and then see v20 dropped with a completely new SEAMCALL
> scheme.

And here we we're worrying that people might eventually grow tired of
us adding mails to v19 and we debate every detail in public. Will do.

Dave Hansen March 15, 2024, 6:28 p.m. UTC | #8

On 3/15/24 09:33, Sean Christopherson wrote:
>         static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
>         				      struct tdx_module_args *out)
>         {
>         	struct tdx_module_args in = {
>         		.rcx = gpa | level,
>         		.rdx = tdr,
>         	};
> 
>         	return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
>         }
> 
> generates the below monstrosity with gcc-13.  And that's just one SEAMCALL wrapper,
> *every* single one generates the same mess.  clang-16 is kinda sorta a little
> better, as it at least inlines the helpers that have single callers.

Yeah, that's really awful.

Is all the inlining making the compiler too ambitious?  Why is this all
inlined in the first place?

tdh_mem_page_remove() _should_ just be logically:

	* initialize tdx_module_args.  Move a few things into place on
	  the stack and zero the rest.
	* Put a pointer to tdx_module_args in a register
	* Put TDH_MEM_PAGE_REMOVE immediate in a register
	* Some register preservation, maybe
	* call
	* maybe some cleanup
	* return

Those logical things are *NOT* easy to spot in the disassembly.

Sean Christopherson March 15, 2024, 7:20 p.m. UTC | #9

On Fri, Mar 15, 2024, Rick P Edgecombe wrote:
> On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote:
> > On Fri, Mar 15, 2024, Sean Christopherson wrote:
> > > So my feedback is to not worry about the exports, and instead focus on
> > > figuring out a way to make the generated code less bloated and easier to
> > > read/debug.
> > 
> > Oh, and please make it a collaborative, public effort.  I don't want to
> > hear crickets and then see v20 dropped with a completely new SEAMCALL
> > scheme.
> 
> And here we we're worrying that people might eventually grow tired of
> us adding mails to v19 and we debate every detail in public. Will do.

As a general rule, I _strongly_ prefer all review to be done on-list, in public.
Copy+pasting myself from another Intel series[*]

 : Correct, what I object to is Intel _requiring_ a Reviewed-by before posting.
 : 
 : And while I'm certainly not going to refuse patches that have been reviewed
 : internally, I _strongly_ prefer reviews be on-list so that they are public and
 : recorded.  Being able to go back and look at the history and evolution of patches
 : is valuable, and the discussion itself is often beneficial to non-participants,
 : e.g. people that are new-ish to KVM and/or aren't familiar with the feature being
 : enabled can often learn new things and avoid similar pitfalls of their own.

There are definitely situations where exceptions are warranted, e.g. if someone
is a first-time poster and/or wants a sanity check to make sure their idea isn't
completely crazy.  But even then, the internal review should only be very cursory.

In addition to the history being valuable, doing reviews in public minimizes the
probability of a developer being led astray, e.g. due to someone internally saying
do XYZ, and then upstream reviewers telling them to do something entirely different. 

As far as noise goes, look at it this way.  Every time a new TDX series is posted,
I get 130+ emails.  Y'all can do a _lot_ of public review and discussion before
you'll get anywhere near the point where it'd be noiser than spinning a new version
of the series.

[*] https://lore.kernel.org/all/Y+ZxLfCrcTQ6poYg@google.com

Sean Christopherson March 15, 2024, 7:38 p.m. UTC | #10

On Fri, Mar 15, 2024, Dave Hansen wrote:
> On 3/15/24 09:33, Sean Christopherson wrote:
> >         static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
> >         				      struct tdx_module_args *out)
> >         {
> >         	struct tdx_module_args in = {
> >         		.rcx = gpa | level,
> >         		.rdx = tdr,
> >         	};
> > 
> >         	return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out);
> >         }
> > 
> > generates the below monstrosity with gcc-13.  And that's just one SEAMCALL wrapper,
> > *every* single one generates the same mess.  clang-16 is kinda sorta a little
> > better, as it at least inlines the helpers that have single callers.
> 
> Yeah, that's really awful.
> 
> Is all the inlining making the compiler too ambitious?

No, whether or not the wrappers are inlined doesn't change anything.  gcc actually
doesn't inline any of these helpers.  More below.

> Why is this all inlined in the first place?

Likely because no one looked at the generated code.  The C code is super simple
and looks like it should be inlined.

And the very original code was macro heavy, i.e. relied on inlining to allow the
compiler to precisely set only the actual registers needed for the SEAMCALL.

> tdh_mem_page_remove() _should_ just be logically:
> 
> 	* initialize tdx_module_args.  Move a few things into place on
> 	  the stack and zero the rest.

The "zero the rest" is what generates the fugly code.  The underlying problem is
that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args.
As a result, tdx_module_args needs to be zeroed to avoid loading registers with
unitialized stack data.

E.g. before I looked at the assembly code, my initial thought to clean things
up by doing:

	struct tdx_module_args in;

	in.rcx = gpa | level;
	in.rdx = tdr;

but that would make one or more sanitizers (and maybe even the compiler itself)
more than a bit unhappy.

The struct is 72 bytes, which adds up to a lot of wasted effort since the majority
of SEAMCALLs only use a few of the 13 registers.

FWIW, the guest side of TDX is equally gross.  E.g. to do kvm_hypercall1(), which
without TDX is simply

   0xffffffff810eb4a2 <+82>:	48 c7 c0 8c 9a 01 00	mov    $0x19a8c,%rax
   0xffffffff810eb4a9 <+89>:	8b 1c 02           	mov    (%rdx,%rax,1),%ebx
   0xffffffff810eb4ac <+92>:	b8 0b 00 00 00     	mov    $0xb,%eax
   0xffffffff810eb4b1 <+97>:	0f 01 c1           	vmcall
   0xffffffff810eb4b4 <+100>:	5b                 	pop    %rbx
   0xffffffff810eb4b5 <+101>:	5d                 	pop    %rbp

the kernel blasts the unused params

   0xffffffff810ffdc1 <+97>:	bf 0b 00 00 00     	mov    $0xb,%edi
   0xffffffff810ffdc6 <+102>:	31 d2              	xor    %edx,%edx
   0xffffffff810ffdc8 <+104>:	31 c9              	xor    %ecx,%ecx
   0xffffffff810ffdca <+106>:	45 31 c0           	xor    %r8d,%r8d
   0xffffffff810ffdcd <+109>:	5b                 	pop    %rbx
   0xffffffff810ffdce <+110>:	41 5e              	pop    %r14
   0xffffffff810ffdd0 <+112>:	e9 bb 1b f0 ff     	jmp    0xffffffff81001990 <tdx_kvm_hypercall>

then loads and zeros a ton of memory (tdx_kvm_hypercall()):

   0xffffffff81001990 <+0>:	nopl   0x0(%rax,%rax,1)
   0xffffffff81001995 <+5>:	sub    $0x70,%rsp
   0xffffffff81001999 <+9>:	mov    %gs:0x28,%rax
   0xffffffff810019a2 <+18>:	mov    %rax,0x68(%rsp)
   0xffffffff810019a7 <+23>:	mov    %edi,%eax
   0xffffffff810019a9 <+25>:	movq   $0x0,0x18(%rsp)
   0xffffffff810019b2 <+34>:	movq   $0x0,0x10(%rsp)
   0xffffffff810019bb <+43>:	movq   $0x0,0x8(%rsp)
   0xffffffff810019c4 <+52>:	movq   $0x0,(%rsp)
   0xffffffff810019cc <+60>:	mov    %rax,0x20(%rsp)
   0xffffffff810019d1 <+65>:	mov    %rsi,0x28(%rsp)
   0xffffffff810019d6 <+70>:	mov    %rdx,0x30(%rsp)
   0xffffffff810019db <+75>:	mov    %rcx,0x38(%rsp)
   0xffffffff810019e0 <+80>:	mov    %r8,0x40(%rsp)
   0xffffffff810019e5 <+85>:	movq   $0x0,0x48(%rsp)
   0xffffffff810019ee <+94>:	movq   $0x0,0x50(%rsp)
   0xffffffff810019f7 <+103>:	movq   $0x0,0x58(%rsp)
   0xffffffff81001a00 <+112>:	movq   $0x0,0x60(%rsp)
   0xffffffff81001a09 <+121>:	mov    %rsp,%rdi
   0xffffffff81001a0c <+124>:	call   0xffffffff819f0a80 <__tdx_hypercall>
   0xffffffff81001a11 <+129>:	mov    %gs:0x28,%rcx
   0xffffffff81001a1a <+138>:	cmp    0x68(%rsp),%rcx
   0xffffffff81001a1f <+143>:	jne    0xffffffff81001a26 <tdx_kvm_hypercall+150>
   0xffffffff81001a21 <+145>:	add    $0x70,%rsp
   0xffffffff81001a25 <+149>:	ret

and then unpacks all of that memory back into registers, and reverses that last
part on the way back, (__tdcall_saved_ret()):

   0xffffffff819f0b10 <+0>:	mov    %rdi,%rax
   0xffffffff819f0b13 <+3>:	mov    (%rsi),%rcx
   0xffffffff819f0b16 <+6>:	mov    0x8(%rsi),%rdx
   0xffffffff819f0b1a <+10>:	mov    0x10(%rsi),%r8
   0xffffffff819f0b1e <+14>:	mov    0x18(%rsi),%r9
   0xffffffff819f0b22 <+18>:	mov    0x20(%rsi),%r10
   0xffffffff819f0b26 <+22>:	mov    0x28(%rsi),%r11
   0xffffffff819f0b2a <+26>:	push   %rbx
   0xffffffff819f0b2b <+27>:	push   %r12
   0xffffffff819f0b2d <+29>:	push   %r13
   0xffffffff819f0b2f <+31>:	push   %r14
   0xffffffff819f0b31 <+33>:	push   %r15
   0xffffffff819f0b33 <+35>:	mov    0x30(%rsi),%r12
   0xffffffff819f0b37 <+39>:	mov    0x38(%rsi),%r13
   0xffffffff819f0b3b <+43>:	mov    0x40(%rsi),%r14
   0xffffffff819f0b3f <+47>:	mov    0x48(%rsi),%r15
   0xffffffff819f0b43 <+51>:	mov    0x50(%rsi),%rbx
   0xffffffff819f0b47 <+55>:	push   %rsi
   0xffffffff819f0b48 <+56>:	mov    0x58(%rsi),%rdi
   0xffffffff819f0b4c <+60>:	mov    0x60(%rsi),%rsi
   0xffffffff819f0b50 <+64>:	tdcall
   0xffffffff819f0b54 <+68>:	push   %rax
   0xffffffff819f0b55 <+69>:	mov    0x8(%rsp),%rax
   0xffffffff819f0b5a <+74>:	mov    %rsi,0x60(%rax)
   0xffffffff819f0b5e <+78>:	pop    %rax
   0xffffffff819f0b5f <+79>:	pop    %rsi
   0xffffffff819f0b60 <+80>:	mov    %r12,0x30(%rsi)
   0xffffffff819f0b64 <+84>:	mov    %r13,0x38(%rsi)
   0xffffffff819f0b68 <+88>:	mov    %r14,0x40(%rsi)
   0xffffffff819f0b6c <+92>:	mov    %r15,0x48(%rsi)
   0xffffffff819f0b70 <+96>:	mov    %rbx,0x50(%rsi)
   0xffffffff819f0b74 <+100>:	mov    %rdi,0x58(%rsi)
   0xffffffff819f0b78 <+104>:	mov    %rcx,(%rsi)
   0xffffffff819f0b7b <+107>:	mov    %rdx,0x8(%rsi)
   0xffffffff819f0b7f <+111>:	mov    %r8,0x10(%rsi)
   0xffffffff819f0b83 <+115>:	mov    %r9,0x18(%rsi)
   0xffffffff819f0b87 <+119>:	mov    %r10,0x20(%rsi)
   0xffffffff819f0b8b <+123>:	mov    %r11,0x28(%rsi)
   0xffffffff819f0b8f <+127>:	xor    %ecx,%ecx
   0xffffffff819f0b91 <+129>:	xor    %edx,%edx
   0xffffffff819f0b93 <+131>:	xor    %r8d,%r8d
   0xffffffff819f0b96 <+134>:	xor    %r9d,%r9d
   0xffffffff819f0b99 <+137>:	xor    %r10d,%r10d
   0xffffffff819f0b9c <+140>:	xor    %r11d,%r11d
   0xffffffff819f0b9f <+143>:	xor    %r12d,%r12d
   0xffffffff819f0ba2 <+146>:	xor    %r13d,%r13d
   0xffffffff819f0ba5 <+149>:	xor    %r14d,%r14d
   0xffffffff819f0ba8 <+152>:	xor    %r15d,%r15d
   0xffffffff819f0bab <+155>:	xor    %ebx,%ebx
   0xffffffff819f0bad <+157>:	xor    %edi,%edi
   0xffffffff819f0baf <+159>:	pop    %r15
   0xffffffff819f0bb1 <+161>:	pop    %r14
   0xffffffff819f0bb3 <+163>:	pop    %r13
   0xffffffff819f0bb5 <+165>:	pop    %r12
   0xffffffff819f0bb7 <+167>:	pop    %rbx
   0xffffffff819f0bb8 <+168>:	ret

It's honestly quite amusing, because y'all took one what I see as one of the big
advantages of TDX over SEV (using registers instead of shared memory), and managed
to effectively turn it into a disadvantage.

Again, I completely understand the maintenance and robustness benefits, but IMO
the pendulum swung a bit too far in that direction.

Dave Hansen March 15, 2024, 11:53 p.m. UTC | #11

On 3/15/24 12:38, Sean Christopherson wrote:
>> tdh_mem_page_remove() _should_ just be logically:
>>
>> 	* initialize tdx_module_args.  Move a few things into place on
>> 	  the stack and zero the rest.
> The "zero the rest" is what generates the fugly code.  The underlying problem is
> that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args.
> As a result, tdx_module_args needs to be zeroed to avoid loading registers with
> unitialized stack data.

It's the "zero the rest" and also the copy:

> +	if (out) {
> +		*out = *in;
> +		ret = seamcall_ret(op, out);
> +	} else
> +		ret = seamcall(op, in);

Things get a wee bit nicer if you do an out-of-line mempcy() instead of
the structure copy.  But the really fun part is that 'out' is NULL and
the compiler *SHOULD* know it.  I'm not actually sure what trips it up.

In any case, I think it ends up generating code for both sides of the
if/else including the entirely superfluous copy.

The two nested while loops (one for TDX_RND_NO_ENTROPY and the other for
TDX_ERROR_SEPT_BUSY) also don't make for great code generation.

So, sure, the generated code here could be a better.  But there's a lot
more going on here than just shuffling gunk in and out of the 'struct
tdx_module_args', and there's a _lot_ more work to do for one of these
than for a plain old kvm_hypercall*().

It might make sense to separate out the "out" functionality into and
maybe to uninline _some_ of the helper levels.  But after that, there's
not a lot of low hanging fruit.

Huang, Kai March 20, 2024, 11:27 a.m. UTC | #12

On Fri, 2024-03-15 at 17:48 +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote:
> > Heh, Like this one?
> > 
> >         static inline u64 tdh_sys_lp_shutdown(void)
> >         {
> >                 struct tdx_module_args in = {
> >                 };
> >         
> >                 return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL);
> >         }
> > 
> > Which isn't actually used...
> 
> Looks like is was turned into a NOP in TDX 1.5. So will even forever be
> dead code. I see one other that is unused. Thanks for pointing it out.
> 
> > 
> > > But I'd also defer to the KVM maintainers on this.  They're the
> > > ones
> > > that have to play the symbol exporting game a lot more than I ever
> > > do.
> > > If they cringe at the idea of adding 20 (or whatever) exports, then
> > > that's a lot more important than the possibility of some other
> > > silly
> > > module abusing the generic exported __seamcall.
> > 
> > I don't care much about exports.  What I do care about is sane code,
> > and while
> > the current code _looks_ pretty, it's actually quite insane.
> > 
> > I get why y'all put SEAMCALL in assembly subroutines; the macro
> > shenanigans I
> > originally wrote years ago were their own brand of crazy, and dealing
> > with GPRs
> > that can't be asm() constraints often results in brittle code.
> 
> I guess it must be this, for the initiated:
> https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/

Hmm.. I lost memory of this.  :-(

> 
> > 
> > But the tdx_module_args structure approach generates truly atrocious
> > code.  Yes,
> > SEAMCALL is inherently slow, but that doesn't mean that we shouldn't
> > at least try
> > to generate efficient code.  And it's not just efficiency that is
> > lost, the
> > generated code ends up being much harder to read than it ought to be.
> > 
> > 
> [snip]
> > 
> > So my feedback is to not worry about the exports, and instead focus
> > on figuring
> > out a way to make the generated code less bloated and easier to
> > read/debug.
> > 
> 
> Thanks for the feedback both! It sounds like everyone is flexible on
> the exports. As for the generated code, oof.
> 
> Kai, I see the solution has gone through some iterations already. First
> the macro one linked above, then that was dropped pretty quick to
> something that loses the asm constraints:
> https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/

Sorry I forgot for what reason we changed from the (bunch of) macros to this.

> 
> Then next the struct grew here, and here:
> https://lore.kernel.org/linux-mm/20230628211132.GS38236@hirez.programming.kicks-ass.net/
> https://lore.kernel.org/linux-mm/20230630102141.GA2534364@hirez.programming.kicks-ass.net/
> 
> Not sure I understand all of the constraints yet. Do you have any
> ideas?

This was due to Peter's request to unify the TDCALL/SEAMCALL and TDVMCALL
assembly, which was done by this series:

https://lore.kernel.org/lkml/cover.1692096753.git.kai.huang@intel.com/

So when Peter requested this, the __seamcall() and __tdcall() (or
__tdx_module_call() before the above series) were already sharing one assembly
macro.  The TDVMCALL were using a different macro, though.  However the two
assembly macros used similar structure and similar code, so we tried to unify
them.

And another reason that we changed from ...

  u64 __seamcall(u64 fn, u64 rcx, u64 rdx, ..., struct tdx_module_args *out);

to  ...

  u64 __seamcall(u64, struct tdx_module_args *args);

... was the former doesn't extend.  

E.g., live migration related new SEAMCALLs use more registers as input.  It is
just insane to have so many individual regs as function argument.

And Peter wanted to use __seamcall() to cover TDH.VP.ENTER too, which
historically had been implemented in KVM using its own assembly, because
VP.ENTER basically uses all GPRs as input/output.

Huang, Kai March 20, 2024, 12:09 p.m. UTC | #13

> 
> then loads and zeros a ton of memory (tdx_kvm_hypercall()):
> 
> 
[...]

> 
> and then unpacks all of that memory back into registers, and reverses that last
> part on the way back, (__tdcall_saved_ret()):
> 
> 
[...]

> 
> It's honestly quite amusing, because y'all took one what I see as one of the big
> advantages of TDX over SEV (using registers instead of shared memory), and managed
> to effectively turn it into a disadvantage.
> 
> Again, I completely understand the maintenance and robustness benefits, but IMO
> the pendulum swung a bit too far in that direction.

Having to zero-and-init the structure and store output regs to the structure is
an unfortunately burden if we want to have a single API __seamcall() for all
SEAMCALL leafs.

To (precisely) avoid writing to the unnecessary structure members before and
after the SEAMCALL instruction, we need to use your old way to have bunch of
macros.  But we may end up with *a lot* macros due to needing to cover new
(e.g., live migration) SEAMCALLs.

Because essentially it's a game to implement wrappers for bunch of combinations
of each individual input/output regs.

I don't want to judge which way is better, but to be honest I think completely
switching to the old way (using bunch of macros) isn't a realistic option at
this stage.

However, I think we might be able to change ...

    u64 __seamcall(u64 fn, struct tdx_module_args *args);

... to

    u64 __seamcall(u64 fn, struct tdx_module_args *in,
			struct tdx_module_args *out);

.. so that the assembly can actually skip "storing output regs to the structure"
if the SEAMCALL doesn't really have any output regs except RAX.

I can try to do if you guys believe this should be done, and should be done
earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
have meaningful performance improvement.

Dave Hansen March 20, 2024, 3:07 p.m. UTC | #14

On 3/20/24 05:09, Huang, Kai wrote:
> I can try to do if you guys believe this should be done, and should be done
> earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
> have meaningful performance improvement.

I don't think Sean had performance concerns.

I think he was having a justifiably violent reaction to how much more
complicated the generated code is to do a SEAMCALL versus a good ol' KVM
hypercall.

"Complicated" in this case means lots more instructions and control
flow.  That's slower, sure, but the main impact is that when you go to
debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a
KVM hypercall.

My takeaway from this, though, is that we are relying on the compiler
for a *LOT*.  There are also so many levels in the helpers that it's
hard to avoid silly things like two _separate_ retry loops.

We should probably be looking at the generated code a _bit_ more often
than never, and it's OK to tinker a _bit_ to make things out-of-line or
make sure that the compiler optimizes everything that we think that it
should.

Also remember that there are very fun tools out there that can make this
much easier than recompiling the kernel a billion times:

	https://godbolt.org/z/8ooE4d465

Huang, Kai March 20, 2024, 9 p.m. UTC | #15

On 21/03/2024 4:07 am, Dave Hansen wrote:
> On 3/20/24 05:09, Huang, Kai wrote:
>> I can try to do if you guys believe this should be done, and should be done
>> earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
>> have meaningful performance improvement.
> 
> I don't think Sean had performance concerns.
> 
> I think he was having a justifiably violent reaction to how much more
> complicated the generated code is to do a SEAMCALL versus a good ol' KVM
> hypercall.

Ah, I automatically linked "generating better code" to "in order to have 
better performance".  My bad.

> 
> "Complicated" in this case means lots more instructions and control
> flow.  That's slower, sure, but the main impact is that when you go to
> debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a
> KVM hypercall.

[...]

> 
> My takeaway from this, though, is that we are relying on the compiler
> for a *LOT*.  There are also so many levels in the helpers that it's
> hard to avoid silly things like two _separate_ retry loops.
> 
> We should probably be looking at the generated code a _bit_ more often
> than never, and it's OK to tinker a _bit_ to make things out-of-line or
> make sure that the compiler optimizes everything that we think that it
> should.

Yeah, agreed.

> 
> Also remember that there are very fun tools out there that can make this
> much easier than recompiling the kernel a billion times:
> 
> 	https://godbolt.org/z/8ooE4d465

Ah, this is useful.  Thanks for sharing :-)

Sean Christopherson April 3, 2024, 9:36 p.m. UTC | #16

On Wed, Mar 20, 2024, Dave Hansen wrote:
> On 3/20/24 05:09, Huang, Kai wrote:
> > I can try to do if you guys believe this should be done, and should be done
> > earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will
> > have meaningful performance improvement.
> 
> I don't think Sean had performance concerns.
> 
> I think he was having a justifiably violent reaction to how much more
> complicated the generated code is to do a SEAMCALL versus a good ol' KVM
> hypercall.

Yep.  The code essentially violates the principle of least surprise.  I genuinely
thought I was dumping the wrong function(s) when I first looked at the output.

Kirill A. Shutemov April 10, 2024, 12:49 p.m. UTC | #17

On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> So my feedback is to not worry about the exports, and instead focus on figuring
> out a way to make the generated code less bloated and easier to read/debug.

I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
few megawrappers. I think we can get better results by shifting leaf
function wrappers into assembly.

We are going to have more assembly, but it should produce better result.
Adding macros can help to write such wrapper and minimizer boilerplate.

Below is an example of how it can look like. It's not complete. I only
converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
more complex.

Any opinions? Is it something worth investing more time?

.set offset_rcx,	TDX_MODULE_rcx
.set offset_rdx,	TDX_MODULE_rdx
.set offset_r8,		TDX_MODULE_r8
.set offset_r9,		TDX_MODULE_r9
.set offset_r10,	TDX_MODULE_r10
.set offset_r11,	TDX_MODULE_r11

.macro save_output struct_reg regs:vararg
.irp reg,\regs
	movq	%\reg, offset_\reg(%\struct_reg)
.endr
.endm

.macro tdcall leaf
	movq	\leaf, %rax
	.byte	0x66,0x0f,0x01,0xcc
.endm

.macro tdcall_or_panic leaf
	tdcall	\leaf
	testq	%rax, %rax
	jnz	.Lpanic
.endm

SYM_FUNC_START(tdg_vm_rd)
	FRAME_BEGIN

	xorl	%ecx, %ecx
	movq	%rdi, %rdx

	tdcall_or_panic $TDG_VM_RD

	movq	%r8, %rax

	RET
	FRAME_END
SYM_FUNC_END(tdg_vm_rd)

SYM_FUNC_START(tdg_vm_wr)
	FRAME_BEGIN

	xorl	%ecx, %ecx
	movq	%rsi, %r8
	movq	%rdx, %r9
	movq	%rdi, %rdx

	tdcall_or_panic $TDG_VM_WR

	/* Old value */
	movq	%r8, %rax

	RET
	FRAME_END
SYM_FUNC_END(tdg_vm_wr)

SYM_FUNC_START(tdcs_ctls_set)
	FRAME_BEGIN

	movq	$TDCS_TD_CTLS, %rdx
	xorl	%ecx, %ecx
	movq	%rdi, %r8
	movq	%rdi, %r9

	tdcall $TDG_VM_WR

	testq	%rax, %rax
	setz	%al

	RET
	FRAME_END
SYM_FUNC_END(tdcs_ctls_set)

SYM_FUNC_START(tdg_sys_rd)
	FRAME_BEGIN

	xorl	%ecx, %ecx
	movq	%rdi, %rdx

	tdcall_or_panic $TDG_SYS_RD

	movq	%r8, %rax

	RET
	FRAME_END
SYM_FUNC_END(tdg_sys_rd)

SYM_FUNC_START(tdg_vp_veinfo_get)
	FRAME_BEGIN

	tdcall_or_panic $TDG_VP_VEINFO_GET

	save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10

	FRAME_END
	RET
SYM_FUNC_END(tdg_vp_veinfo_get)

SYM_FUNC_START(tdg_vp_info)
	FRAME_BEGIN

	tdcall_or_panic $TDG_VP_INFO

	save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10,r11

	FRAME_END
	RET
SYM_FUNC_END(tdg_vp_info)

SYM_FUNC_START(tdg_mem_page_accept)
	FRAME_BEGIN

	movq	%rdi, %rcx

	tdcall $TDG_MEM_PAGE_ACCEPT

	FRAME_END
	RET
SYM_FUNC_END(tdg_mem_page_accept)

SYM_FUNC_START(tdg_mr_report)
	FRAME_BEGIN

	movq	%rdx, %r8
	movq	%rdi, %rcx
	movq	%rsi, %rdx

	tdcall $TDG_MR_REPORT

	FRAME_END
	RET
SYM_FUNC_END(tdg_mr_report)

.Lpanic:
	ud2

Edgecombe, Rick P April 16, 2024, 7:45 p.m. UTC | #18

On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > So my feedback is to not worry about the exports, and instead focus on
> > figuring
> > out a way to make the generated code less bloated and easier to read/debug.
> 
> I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> few megawrappers. I think we can get better results by shifting leaf
> function wrappers into assembly.
> 
> We are going to have more assembly, but it should produce better result.
> Adding macros can help to write such wrapper and minimizer boilerplate.
> 
> Below is an example of how it can look like. It's not complete. I only
> converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> more complex.
> 
> Any opinions? Is it something worth investing more time?

We discussed offline how implementing these for each TDVM/SEAMCALL increases the
chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
problems more challenging. Kirill raised the possibility of some code generating
solution like cpufeatures.h, that could take a spec and generate correct calls.

So far no big wins have presented themselves. Kirill, do we think the path to
move the messy part out-of-line will not work?

Kirill A. Shutemov April 18, 2024, 2:16 p.m. UTC | #19

On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > So my feedback is to not worry about the exports, and instead focus on
> > > figuring
> > > out a way to make the generated code less bloated and easier to read/debug.
> > 
> > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > few megawrappers. I think we can get better results by shifting leaf
> > function wrappers into assembly.
> > 
> > We are going to have more assembly, but it should produce better result.
> > Adding macros can help to write such wrapper and minimizer boilerplate.
> > 
> > Below is an example of how it can look like. It's not complete. I only
> > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > more complex.
> > 
> > Any opinions? Is it something worth investing more time?
> 
> We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> problems more challenging. Kirill raised the possibility of some code generating
> solution like cpufeatures.h, that could take a spec and generate correct calls.
> 
> So far no big wins have presented themselves. Kirill, do we think the path to
> move the messy part out-of-line will not work?

I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/

I still need to clean it up and write commit messages and comments for all
wrappers.

Now I think it worth the shot.

Any feedback?

Sean Christopherson April 18, 2024, 6:26 p.m. UTC | #20

On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote:
> On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > > So my feedback is to not worry about the exports, and instead focus on
> > > > figuring
> > > > out a way to make the generated code less bloated and easier to read/debug.
> > > 
> > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > > few megawrappers. I think we can get better results by shifting leaf
> > > function wrappers into assembly.
> > > 
> > > We are going to have more assembly, but it should produce better result.
> > > Adding macros can help to write such wrapper and minimizer boilerplate.
> > > 
> > > Below is an example of how it can look like. It's not complete. I only
> > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > > more complex.
> > > 
> > > Any opinions? Is it something worth investing more time?
> > 
> > We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> > problems more challenging. Kirill raised the possibility of some code generating
> > solution like cpufeatures.h, that could take a spec and generate correct calls.
> > 
> > So far no big wins have presented themselves. Kirill, do we think the path to
> > move the messy part out-of-line will not work?
> 
> I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
> Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/
> 
> I still need to clean it up and write commit messages and comments for all
> wrappers.
> 
> Now I think it worth the shot.
> 
> Any feedback?

I find it hard to review for correctness, and extremely susceptible to developer
error.  E.g. lots of copy+paste, and manual encoding of RCX to expose registers.
It also bleeds TDX ABI into C code, e.g.

	/*
	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
	 * So copy the register contents back to pt_regs.
	 */
	regs->ax = args.r12;
	regs->bx = args.r13;
	regs->cx = args.r14;
	regs->dx = args.r15;

Oh, and it requires input/output paramters, which is quite gross for C code *and*
for assembly code, e.g.

	u64 tdvmcall_map_gpa(u64 *gpa, u64 size);

and then the accompanying assembly code:

		FRAME_BEGIN

		save_regs r12,r13

		movq	(%rdi), %r12
		movq	%rsi, %r13

		movq	$(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx

		tdvmcall	$TDVMCALL_MAP_GPA

		movq	%r11, (%rdi)

		restore_regs r13,r12

		FRAME_END
		RET

I think having one trampoline makes sense, e.g. to minimize the probability of
leaking register state to the VMM.  The part that I don't like, and which generates
awful code, is shoving register state into a memory structure.

The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm()
constraints don't play nice with r8-15.  But that doesn't mean we can't use asm()
with macros, it just means we have to play games with registers.

Because REG-REG moves are super cheap, and ignoring the fatal error goofiness,
there are at most four inputs.  That means having a single trampoline take *all*
possible inputs is a non-issue.  And we can avoiding polluting the inline code if
we bury the register shuffling in the trampoline.

And if we use asm() wrappers to call the trampoline, then the trampoline doesn't
need to precisely follow the C calling convention.  I.e. the trampoline can return
with the outputs still in r12-r15, and let the asm() wrappers extract the outputs
they want.

As it stands, TDVMCALLs either have 0, 1, or 4 outputs.  I.e. we only need three
asm() wrappers.  We could get away with one wrapper, but then users of the wrappers
would need dummy variables for inputs *and* outputs, and the outputs get gross.

Completely untested, but this is what I'm thinking.  Side topic, I think making
"tdcall" a macro that takes a leaf is a mistake.  If/when an assembler learns what
tdcall is, we're going to have to rewrite all of that code.  And what a coincidence,
my suggestion needs a bare TDCALL!  :-)

Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function.

Side topic #3, the ud2 to induce panic should be out-of-line.

Weird?  Yeah.  But at least we one need to document one weird calling convention,
and the ugliness is contained to three macros and a small assembly function.

.pushsection .noinstr.text, "ax"
SYM_FUNC_START(tdvmcall_trampoline)
	movq	$TDX_HYPERCALL_STANDARD, %r10
	movq	%rax, %r11
	movq	%rdi, %r12
	movq	%rsi, %r13
	movq	%rdx, %r14
	movq	%rcx, %r15

	movq	$(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx

	tdcall

	testq	%rax, %rax
	jnz	.Lpanic

	ret

.Lpanic:
	ud2
SYM_FUNC_END(tdvmcall_trampoline)
.popsection


#define TDVMCALL(reason, in1, in2, in3, in4)				\
({									\
	long __ret;							\
									\
	asm(								\
		"call tdvmcall_trampoline\n\t"				\
		"mov %%r10, %0\n\t"					\
		: "=r" (__ret)						\
		: "D" (in1), "S"(in2), "d"(in3), "c" (in4)		\
		: "r12", "r13", "r14", "r15"				\
	);								\
	__ret;								\
})

#define TDVMCALL_1(reason, in1, in2, in3, in4, out1)			\
({									\
	long __ret;							\
									\
	asm(								\
		"call tdvmcall_trampoline\n\t"				\
		"mov %%r10, %0\n\t"					\
		"mov %%r12, %1\n\t"					\
		: "=r"(__ret) "=r" (out1)				\
		: "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4)	\
		: "r12", "r13", "r14", "r15"				\
	);								\
	__ret;								\
})

#define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4)	\
({									\
	long __ret;							\
									\
	asm(								\
		"call tdvmcall_trampoline\n\t"				\
		"mov %%r10, %0\n\t"					\
		"mov %%r12, %1\n\t"					\
		"mov %%r13, %2\n\t"					\
		"mov %%r14, %3\n\t"					\
		"mov %%r15, %4\n\t"					\
		: "=r" (__ret),						\
		  "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4)	\
		: "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4)	\
		  [reason] "i" (reason) 				\
		: "r12", "r13", "r14", "r15"				\
	);								\
	__ret;								\
})

static int handle_halt(struct ve_info *ve)
{
	if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0))
		return -EIO;

	return ve_instr_len(ve);
}

void __cpuidle tdx_safe_halt(void)
{
	WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0),
		  "HLT instruction emulation failed");
}

static int read_msr(struct pt_regs *regs, struct ve_info *ve)
{
	u64 val;

	if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val))
		return -EIO;

	regs->ax = lower_32_bits(val);
	regs->dx = upper_32_bits(val);

	return ve_instr_len(ve);
}

static int write_msr(struct pt_regs *regs, struct ve_info *ve)
{
	u64 val = (u64)regs->dx << 32 | regs->ax;

	if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0))
		return -EIO;

	return ve_instr_len(ve);
}
static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
{
	/*
	 * Only allow VMM to control range reserved for hypervisor
	 * communication.
	 *
	 * Return all-zeros for any CPUID outside the range. It matches CPU
	 * behaviour for non-supported leaf.
	 */
	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
		regs->ax = regs->bx = regs->cx = regs->dx = 0;
		return ve_instr_len(ve);
	}

	if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0,
		       regs->ax, regs->bx, regs->cx, regs->dx))
		return -EIO;

	return ve_instr_len(ve);
}

static bool mmio_read(int size, u64 gpa, u64 *val)
{
	*val = 0;
	return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val);
}

static bool mmio_write(int size, u64 gpa, u64 val)
{
	return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val);
}

Kirill A. Shutemov April 19, 2024, 2:46 p.m. UTC | #21

On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote:
> On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote:
> > On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote:
> > > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote:
> > > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote:
> > > > > So my feedback is to not worry about the exports, and instead focus on
> > > > > figuring
> > > > > out a way to make the generated code less bloated and easier to read/debug.
> > > > 
> > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into
> > > > few megawrappers. I think we can get better results by shifting leaf
> > > > function wrappers into assembly.
> > > > 
> > > > We are going to have more assembly, but it should produce better result.
> > > > Adding macros can help to write such wrapper and minimizer boilerplate.
> > > > 
> > > > Below is an example of how it can look like. It's not complete. I only
> > > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be
> > > > more complex.
> > > > 
> > > > Any opinions? Is it something worth investing more time?
> > > 
> > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the
> > > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging
> > > problems more challenging. Kirill raised the possibility of some code generating
> > > solution like cpufeatures.h, that could take a spec and generate correct calls.
> > > 
> > > So far no big wins have presented themselves. Kirill, do we think the path to
> > > move the messy part out-of-line will not work?
> > 
> > I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers.
> > Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/
> > 
> > I still need to clean it up and write commit messages and comments for all
> > wrappers.
> > 
> > Now I think it worth the shot.
> > 
> > Any feedback?
> 
> I find it hard to review for correctness, and extremely susceptible to developer
> error.  E.g. lots of copy+paste, and manual encoding of RCX to expose registers.

Yes, I agree. The approach requires careful manual work and is error-prone.
I was planning to get around and stare at every wrapper to make sure it is correct.
This approach is not scalable, though.

> It also bleeds TDX ABI into C code, e.g.
> 
> 	/*
> 	 * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of
> 	 * EAX, EBX, ECX, EDX registers after the CPUID instruction execution.
> 	 * So copy the register contents back to pt_regs.
> 	 */
> 	regs->ax = args.r12;
> 	regs->bx = args.r13;
> 	regs->cx = args.r14;
> 	regs->dx = args.r15;
> 
> Oh, and it requires input/output paramters, which is quite gross for C code *and*
> for assembly code, e.g.
> 
> 	u64 tdvmcall_map_gpa(u64 *gpa, u64 size);
> 
> and then the accompanying assembly code:
> 
> 		FRAME_BEGIN
> 
> 		save_regs r12,r13
> 
> 		movq	(%rdi), %r12
> 		movq	%rsi, %r13
> 
> 		movq	$(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx
> 
> 		tdvmcall	$TDVMCALL_MAP_GPA
> 
> 		movq	%r11, (%rdi)
> 
> 		restore_regs r13,r12
> 
> 		FRAME_END
> 		RET
> 
> I think having one trampoline makes sense, e.g. to minimize the probability of
> leaking register state to the VMM.  The part that I don't like, and which generates
> awful code, is shoving register state into a memory structure.

I don't think we can get away with single trampoline. We have outliers.

See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as
input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any
time soon. It uses all possible output registers.

But I guess we can make a *few* wrappers that covers all needed cases.

> The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm()
> constraints don't play nice with r8-15.  But that doesn't mean we can't use asm()
> with macros, it just means we have to play games with registers.
> 
> Because REG-REG moves are super cheap, and ignoring the fatal error goofiness,
> there are at most four inputs.  That means having a single trampoline take *all*
> possible inputs is a non-issue.  And we can avoiding polluting the inline code if
> we bury the register shuffling in the trampoline.
> 
> And if we use asm() wrappers to call the trampoline, then the trampoline doesn't
> need to precisely follow the C calling convention.  I.e. the trampoline can return
> with the outputs still in r12-r15, and let the asm() wrappers extract the outputs
> they want.
> 
> As it stands, TDVMCALLs either have 0, 1, or 4 outputs.  I.e. we only need three
> asm() wrappers.  We could get away with one wrapper, but then users of the wrappers
> would need dummy variables for inputs *and* outputs, and the outputs get gross.
> 
> Completely untested, but this is what I'm thinking.  Side topic, I think making
> "tdcall" a macro that takes a leaf is a mistake.  If/when an assembler learns what
> tdcall is, we're going to have to rewrite all of that code.  And what a coincidence,
> my suggestion needs a bare TDCALL!  :-)

I guess rename the macro (if it still needed) to something like
tdcall_leaf or something.

> Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function.

Hm. I guess.

> Side topic #3, the ud2 to induce panic should be out-of-line.

Yeah. I switched to the inline one while debugging one section mismatch
issue and forgot to switch back.

> Weird?  Yeah.  But at least we one need to document one weird calling convention,
> and the ugliness is contained to three macros and a small assembly function.

Okay, the approach is worth exploring. I can work on it.

You focuses here on TDVMCALL. What is your take on the rest of TDCALL?

> .pushsection .noinstr.text, "ax"
> SYM_FUNC_START(tdvmcall_trampoline)
> 	movq	$TDX_HYPERCALL_STANDARD, %r10
> 	movq	%rax, %r11
> 	movq	%rdi, %r12
> 	movq	%rsi, %r13
> 	movq	%rdx, %r14
> 	movq	%rcx, %r15
> 
> 	movq	$(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx
> 
> 	tdcall
> 
> 	testq	%rax, %rax
> 	jnz	.Lpanic
> 
> 	ret
> 
> .Lpanic:
> 	ud2
> SYM_FUNC_END(tdvmcall_trampoline)
> .popsection
> 
> 
> #define TDVMCALL(reason, in1, in2, in3, in4)				\
> ({									\
> 	long __ret;							\
> 									\
> 	asm(								\
> 		"call tdvmcall_trampoline\n\t"				\
> 		"mov %%r10, %0\n\t"					\
> 		: "=r" (__ret)						\
> 		: "D" (in1), "S"(in2), "d"(in3), "c" (in4)		\
> 		: "r12", "r13", "r14", "r15"				\
> 	);								\
> 	__ret;								\
> })
> 
> #define TDVMCALL_1(reason, in1, in2, in3, in4, out1)			\
> ({									\
> 	long __ret;							\
> 									\
> 	asm(								\
> 		"call tdvmcall_trampoline\n\t"				\
> 		"mov %%r10, %0\n\t"					\
> 		"mov %%r12, %1\n\t"					\

It is r11, not r12.

> 		: "=r"(__ret) "=r" (out1)				\
> 		: "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4)	\
> 		: "r12", "r13", "r14", "r15"				\
> 	);								\
> 	__ret;								\
> })
> 
> #define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4)	\
> ({									\
> 	long __ret;							\
> 									\
> 	asm(								\
> 		"call tdvmcall_trampoline\n\t"				\
> 		"mov %%r10, %0\n\t"					\
> 		"mov %%r12, %1\n\t"					\
> 		"mov %%r13, %2\n\t"					\
> 		"mov %%r14, %3\n\t"					\
> 		"mov %%r15, %4\n\t"					\
> 		: "=r" (__ret),						\
> 		  "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4)	\
> 		: "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4)	\
> 		  [reason] "i" (reason) 				\
> 		: "r12", "r13", "r14", "r15"				\
> 	);								\
> 	__ret;								\
> })
> 
> static int handle_halt(struct ve_info *ve)
> {
> 	if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0))
> 		return -EIO;
> 
> 	return ve_instr_len(ve);
> }
> 
> void __cpuidle tdx_safe_halt(void)
> {
> 	WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0),
> 		  "HLT instruction emulation failed");
> }
> 
> static int read_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> 	u64 val;
> 
> 	if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val))
> 		return -EIO;
> 
> 	regs->ax = lower_32_bits(val);
> 	regs->dx = upper_32_bits(val);
> 
> 	return ve_instr_len(ve);
> }
> 
> static int write_msr(struct pt_regs *regs, struct ve_info *ve)
> {
> 	u64 val = (u64)regs->dx << 32 | regs->ax;
> 
> 	if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0))
> 		return -EIO;
> 
> 	return ve_instr_len(ve);
> }
> static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve)
> {
> 	/*
> 	 * Only allow VMM to control range reserved for hypervisor
> 	 * communication.
> 	 *
> 	 * Return all-zeros for any CPUID outside the range. It matches CPU
> 	 * behaviour for non-supported leaf.
> 	 */
> 	if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) {
> 		regs->ax = regs->bx = regs->cx = regs->dx = 0;
> 		return ve_instr_len(ve);
> 	}
> 
> 	if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0,
> 		       regs->ax, regs->bx, regs->cx, regs->dx))
> 		return -EIO;
> 
> 	return ve_instr_len(ve);
> }
> 
> static bool mmio_read(int size, u64 gpa, u64 *val)
> {
> 	*val = 0;
> 	return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val);
> }
> 
> static bool mmio_write(int size, u64 gpa, u64 val)
> {
> 	return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val);
> }

Sean Christopherson April 19, 2024, 7:53 p.m. UTC | #22

On Fri, Apr 19, 2024, kirill.shutemov@linux.intel.com wrote:
> On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote:
> > On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote:
> > I think having one trampoline makes sense, e.g. to minimize the probability of
> > leaking register state to the VMM.  The part that I don't like, and which generates
> > awful code, is shoving register state into a memory structure.
> 
> I don't think we can get away with single trampoline. We have outliers.

Yeah, I should have said "one trampoline for all of the not insane APIs" :-)

> See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as
> input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any
> time soon. It uses all possible output registers.

XMM: just say no.

> But I guess we can make a *few* wrappers that covers all needed cases.

Yeah.  I suspect two will suffice.  One for the calls that say at or below four
inputs, and one for the fat ones like ReportFatalError that use everything under
the sun.

For the latter one, marshalling data into registers via an in-memory structure
makes, especially if we move away from tdx_module_args, e.g. this is quite clean
and reasonable:

	union {
		/* Define register order according to the GHCI */
		struct { u64 r14, r15, rbx, rdi, rsi, r8, r9, rdx; };

		char str[64];
	} message;

	/* VMM assumes '\0' in byte 65, if the message took all 64 bytes */
	strtomem_pad(message.str, msg, '\0');

	/*
	 * This hypercall should never return and it is not safe
	 * to keep the guest running. Call it forever if it
	 * happens to return.
	 */
	while (1)
		tdx_fat_tdvmcall(&message);

> > Weird?  Yeah.  But at least we one need to document one weird calling convention,
> > and the ugliness is contained to three macros and a small assembly function.
> 
> Okay, the approach is worth exploring. I can work on it.
> 
> You focuses here on TDVMCALL. What is your take on the rest of TDCALL?

Not sure, haven't looked at them recently.  At a glance, something similar?  The
use of high registers instead of RDI and RSI is damn annoying :-/

Hmm, but it looks like there are enough simple TDCALLs that stay away from high
registers that open coding inline asm() is a viable (best?) approach.

RAX being the leaf and the return value is annoying, so maybe a simple macro to
make it easier to deal with that?  It won't allow for perfectly optimal code
generation, but forcing a MOV for a TDCALL isn't going to affect performance, and
it will make reading the code dead simple.

  #define tdcall_leaf(leaf) "mov $" leaf ", %%eax\n\t.byte 0x66,0x0f,0x01,0xcc\n\t"

Then PAGE_ACCEPT is simply:

	asm(tdcall_leaf(TDG_MEM_PAGE_ACCEPT)
	    : "=a"(ret),
            : "c"(start | page_size));
	if (ret)
		return 0;

And even the meanies that use R8 are reasonably easy to handle:

	asm("xor %%r8d, %%r8d\n\t"
	    tdcall_leaf(TDG_MR_REPORT)
	    : "=a"(ret)
	    : "c"(__pa(report)), "d"(__pa(data)));


and (though using names for the outputs, I just can't remember the syntax as I'm
typing this :-/)

	asm(tdcall_leaf(TDG_VM_RD)
	    "mov %%r8, %0\n\t"
	    : "=r"(value), "=a"(ret)
	    : "c"(0), "d"(no_idea_what_this_is));

	if (ret)
		<cry>

	return value;

Or, if you wanted to get fancy, use asm_goto_output() to bail on an error so that
you could optimize for using RAX as the return value, because *that's going to
make all the difference :-D

Edgecombe, Rick P April 19, 2024, 8:04 p.m. UTC | #23

On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote:
> 
> > Side topic #3, the ud2 to induce panic should be out-of-line.
> 
> Yeah. I switched to the inline one while debugging one section mismatch
> issue and forgot to switch back.

Sorry, why do we need to panic?

Kirill A. Shutemov April 22, 2024, 11:46 a.m. UTC | #24

On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote:
> > 
> > > Side topic #3, the ud2 to induce panic should be out-of-line.
> > 
> > Yeah. I switched to the inline one while debugging one section mismatch
> > issue and forgot to switch back.
> 
> Sorry, why do we need to panic?

It panics in cases that should never occur if the TDX module is
functioning properly. For example, TDVMCALL itself should never fail,
although the leaf function could.

Edgecombe, Rick P April 22, 2024, 3:56 p.m. UTC | #25

On Mon, 2024-04-22 at 14:46 +0300, kirill.shutemov@linux.intel.com wrote:
> On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> > On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote:
> > > 
> > > > Side topic #3, the ud2 to induce panic should be out-of-line.
> > > 
> > > Yeah. I switched to the inline one while debugging one section mismatch
> > > issue and forgot to switch back.
> > 
> > Sorry, why do we need to panic?
> 
> It panics in cases that should never occur if the TDX module is
> functioning properly. For example, TDVMCALL itself should never fail,
> although the leaf function could.

Panic should normally be for desperate situations when horrible things will
likely happen if we continue, right? Why are we adding a panic when we didn't
have one before? Is it a second change, or a side affect of the refactor?

Sean Christopherson April 22, 2024, 7:50 p.m. UTC | #26

On Mon, Apr 22, 2024, Rick P Edgecombe wrote:
> On Mon, 2024-04-22 at 14:46 +0300, kirill.shutemov@linux.intel.com wrote:
> > On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote:
> > > On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote:
> > > > 
> > > > > Side topic #3, the ud2 to induce panic should be out-of-line.
> > > > 
> > > > Yeah. I switched to the inline one while debugging one section mismatch
> > > > issue and forgot to switch back.
> > > 
> > > Sorry, why do we need to panic?
> > 
> > It panics in cases that should never occur if the TDX module is
> > functioning properly. For example, TDVMCALL itself should never fail,
> > although the leaf function could.
> 
> Panic should normally be for desperate situations when horrible things will
> likely happen if we continue, right? Why are we adding a panic when we didn't
> have one before? Is it a second change, or a side affect of the refactor?

The kernel already does panic() if TDCALL itself fails,

  static inline void tdcall(u64 fn, struct tdx_module_args *args)
  {
	if (__tdcall_ret(fn, args))
		panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
  }

  /* Called from __tdx_hypercall() for unrecoverable failure */
  noinstr void __noreturn __tdx_hypercall_failed(void)
  {
	instrumentation_begin();
	panic("TDVMCALL failed. TDX module bug?");
  }

it's just doesn in C code via panic(), not in asm via a bare ud2.

Edgecombe, Rick P April 23, 2024, 12:28 a.m. UTC | #27

On Mon, 2024-04-22 at 12:50 -0700, Sean Christopherson wrote:
> The kernel already does panic() if TDCALL itself fails,
> 
>   static inline void tdcall(u64 fn, struct tdx_module_args *args)
>   {
>         if (__tdcall_ret(fn, args))
>                 panic("TDCALL %lld failed (Buggy TDX module!)\n", fn);
>   }
> 
>   /* Called from __tdx_hypercall() for unrecoverable failure */
>   noinstr void __noreturn __tdx_hypercall_failed(void)
>   {
>         instrumentation_begin();
>         panic("TDVMCALL failed. TDX module bug?");
>   }
> 
> it's just doesn in C code via panic(), not in asm via a bare ud2.

Hmm, I didn't realize. It looks like today some calls do and some don't. I don't
mean to reopen old debates. Just surprised that these are able to bring down the
system. Which funnily enough connects back to the original issue of the patch:
whether they are safe to export for module use.

Kirill A. Shutemov April 25, 2024, 4:46 p.m. UTC | #28

On Fri, Apr 19, 2024 at 12:53:19PM -0700, Sean Christopherson wrote:
> > But I guess we can make a *few* wrappers that covers all needed cases.
> 
> Yeah.  I suspect two will suffice.  One for the calls that say at or below four
> inputs, and one for the fat ones like ReportFatalError that use everything under
> the sun.

I ended up with three helpers:

  - tdvmcall_trampoline() as you proposed.
  
  - tdvmcall_report_fatal_error() that does ReportFatalError specificly.
    Pointer to char array as an input. Never returns: no need to
    save/restore registers.

  - hv_tdx_hypercall(). Hyper-V annoyingly uses R8 and RDX as parameters :/

> Not sure, haven't looked at them recently.  At a glance, something similar?  The
> use of high registers instead of RDI and RSI is damn annoying :-/

I defined three helpers: TDCALL_0(), TDCALL_1() and TDCALL_5(). All takes 4
input arguments.

I've updated the WIP branch with the new version:

https://github.com/intel/tdx/commits/guest-tdx-asm/

Any feedback is welcome.

[v19,007/130] x86/virt/tdx: Export SEAMCALL functions

Commit Message

Comments

Patch