Message ID | 8f64043a6c393c017347bf8954d92b84b58603ec.1708933498.git.isaku.yamahata@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v19,001/130] x86/virt/tdx: Rename _offset to _member for TD_SYSINFO_MAP() macro | expand |
On 26/02/2024 9:25 pm, isaku.yamahata@intel.com wrote: > From: Kai Huang <kai.huang@intel.com> > > KVM will need to make SEAMCALLs to create and run TDX guests. Export > SEAMCALL functions for KVM to use. > Could you also list the reason that we want to expose __seamcall() directly, rather than wanting to put some higher level wrappers in the TDX host code, and export them? For example, we can give a summary of the SEAMCALLs (e.g., how many in total, and roughly introduce them based on categories) that will be used by KVM, and clarify the reasons why we want to just export __seamcall(). E.g., we can say something like this: TD;LR: KVM roughly will need to use dozens of SEAMCALLs, and all these are logically related to creating and running TDX guests. It makes more sense to just export __seamcall() and let KVM maintain these VM-related wrappers rather than having the TDX host code to provide wrappers for each SEAMCALL or higher-level abstraction. Long version: You give a detailed explanation of SEAMCALLs that will be used by KVM, and clarify logically it's better to manage these code in KVM.
Dave, care to weigh in here? Context (whether to export the generic __seamcall for KVM's use): https://lore.kernel.org/lkml/e6e8f585-b718-4f53-88f6-89832a1e4b9f@intel.com/ On Fri, 2024-03-15 at 13:02 +1300, Huang, Kai wrote: > KVM roughly will need to use dozens of SEAMCALLs, and all these are > logically related to creating and running TDX guests. It makes more > sense to just export __seamcall() and let KVM maintain these VM- > related > wrappers rather than having the TDX host code to provide wrappers for > each SEAMCALL or higher-level abstraction. The helpers being discussed are these: https://lore.kernel.org/lkml/7cfd33d896fce7b49bcf4b7179d0ded22c06b8c2.1708933498.git.isaku.yamahata@intel.com/ I guess there are three options: 1. Export the low level seamcall function 2. Export a bunch of higher level helper functions 3. Duplicate __seamcall asm in KVM Letting modules make unrestricted seamcalls is not ideal. Preventing the compiler from inlining the small logic in the static inline helpers is not ideal. Duplicating code is not ideal. Hmm. I want to say 2 sounds the least worst of the three. But I'm not sure. I'm not sure if x86 folks would like to police new seamcalls, or be bothered by it, either.
On 3/14/24 18:17, Edgecombe, Rick P wrote: > I guess there are three options: > 1. Export the low level seamcall function > 2. Export a bunch of higher level helper functions > 3. Duplicate __seamcall asm in KVM > > Letting modules make unrestricted seamcalls is not ideal. Preventing > the compiler from inlining the small logic in the static inline helpers > is not ideal. Duplicating code is not ideal. Hmm. > > I want to say 2 sounds the least worst of the three. But I'm not sure. > I'm not sure if x86 folks would like to police new seamcalls, or be > bothered by it, either. #3 is the only objectively awful one. :) In the end, we actually _want_ to have conversations about these things. There are going to be considerations about what functionality should be in KVM or the core kernel. We don't want KVM doing any calls that could affect global TDX module state, for instance. But I'd also defer to the KVM maintainers on this. They're the ones that have to play the symbol exporting game a lot more than I ever do. If they cringe at the idea of adding 20 (or whatever) exports, then that's a lot more important than the possibility of some other silly module abusing the generic exported __seamcall.
On Thu, Mar 14, 2024, Dave Hansen wrote: > On 3/14/24 18:17, Edgecombe, Rick P wrote: > > I guess there are three options: > > 1. Export the low level seamcall function > > 2. Export a bunch of higher level helper functions > > 3. Duplicate __seamcall asm in KVM > > > > Letting modules make unrestricted seamcalls is not ideal. Preventing > > the compiler from inlining the small logic in the static inline helpers > > is not ideal. Duplicating code is not ideal. Hmm. > > > > I want to say 2 sounds the least worst of the three. But I'm not sure. > > I'm not sure if x86 folks would like to police new seamcalls, or be > > bothered by it, either. > > #3 is the only objectively awful one. :) > > In the end, we actually _want_ to have conversations about these things. > There are going to be considerations about what functionality should be > in KVM or the core kernel. We don't want KVM doing any calls that could > affect global TDX module state, for instance. Heh, Like this one? static inline u64 tdh_sys_lp_shutdown(void) { struct tdx_module_args in = { }; return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL); } Which isn't actually used... > But I'd also defer to the KVM maintainers on this. They're the ones > that have to play the symbol exporting game a lot more than I ever do. > If they cringe at the idea of adding 20 (or whatever) exports, then > that's a lot more important than the possibility of some other silly > module abusing the generic exported __seamcall. I don't care much about exports. What I do care about is sane code, and while the current code _looks_ pretty, it's actually quite insane. I get why y'all put SEAMCALL in assembly subroutines; the macro shenanigans I originally wrote years ago were their own brand of crazy, and dealing with GPRs that can't be asm() constraints often results in brittle code. But the tdx_module_args structure approach generates truly atrocious code. Yes, SEAMCALL is inherently slow, but that doesn't mean that we shouldn't at least try to generate efficient code. And it's not just efficiency that is lost, the generated code ends up being much harder to read than it ought to be. E.g. the seemingly simple static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level, struct tdx_module_args *out) { struct tdx_module_args in = { .rcx = gpa | level, .rdx = tdr, }; return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out); } generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper, *every* single one generates the same mess. clang-16 is kinda sorta a little better, as it at least inlines the helpers that have single callers. So my feedback is to not worry about the exports, and instead focus on figuring out a way to make the generated code less bloated and easier to read/debug. Dump of assembler code for function tdh_mem_page_remove: 0x0000000000032b20 <+0>: push %r15 0x0000000000032b22 <+2>: xor %eax,%eax 0x0000000000032b24 <+4>: movabs $0x8000ff0000000006,%r15 0x0000000000032b2e <+14>: push %r14 0x0000000000032b30 <+16>: mov %rcx,%r14 0x0000000000032b33 <+19>: mov $0xb,%ecx 0x0000000000032b38 <+24>: push %r13 0x0000000000032b3a <+26>: movslq %edx,%r13 0x0000000000032b3d <+29>: push %r12 0x0000000000032b3f <+31>: or %rsi,%r13 0x0000000000032b42 <+34>: mov $0x11,%r12d 0x0000000000032b48 <+40>: push %rbp 0x0000000000032b49 <+41>: movabs $0x8000020300000000,%rbp 0x0000000000032b53 <+51>: push %rbx 0x0000000000032b54 <+52>: sub $0x70,%rsp 0x0000000000032b58 <+56>: mov %rdi,(%rsp) 0x0000000000032b5c <+60>: lea 0x18(%rsp),%rdi 0x0000000000032b61 <+65>: rep stos %rax,%es:(%rdi) 0x0000000000032b64 <+68>: mov (%rsp),%rax 0x0000000000032b68 <+72>: mov %r13,(%r14) 0x0000000000032b6b <+75>: mov $0xa,%ebx 0x0000000000032b70 <+80>: mov %rax,0x8(%r14) 0x0000000000032b74 <+84>: mov 0x18(%rsp),%rax 0x0000000000032b79 <+89>: mov %rax,0x10(%r14) 0x0000000000032b7d <+93>: mov 0x20(%rsp),%rax 0x0000000000032b82 <+98>: mov %rax,0x18(%r14) 0x0000000000032b86 <+102>: mov 0x28(%rsp),%rax 0x0000000000032b8b <+107>: mov %rax,0x20(%r14) 0x0000000000032b8f <+111>: mov 0x30(%rsp),%rax 0x0000000000032b94 <+116>: mov %rax,0x28(%r14) 0x0000000000032b98 <+120>: mov 0x38(%rsp),%rax 0x0000000000032b9d <+125>: mov %rax,0x30(%r14) 0x0000000000032ba1 <+129>: mov 0x40(%rsp),%rax 0x0000000000032ba6 <+134>: mov %rax,0x38(%r14) 0x0000000000032baa <+138>: mov 0x48(%rsp),%rax 0x0000000000032baf <+143>: mov %rax,0x40(%r14) 0x0000000000032bb3 <+147>: mov 0x50(%rsp),%rax 0x0000000000032bb8 <+152>: mov %rax,0x48(%r14) 0x0000000000032bbc <+156>: mov 0x58(%rsp),%rax 0x0000000000032bc1 <+161>: mov %rax,0x50(%r14) 0x0000000000032bc5 <+165>: mov 0x60(%rsp),%rax 0x0000000000032bca <+170>: mov %rax,0x58(%r14) 0x0000000000032bce <+174>: mov 0x68(%rsp),%rax 0x0000000000032bd3 <+179>: mov %rax,0x60(%r14) 0x0000000000032bd7 <+183>: mov %r14,%rsi 0x0000000000032bda <+186>: mov $0x1d,%edi 0x0000000000032bdf <+191>: call 0x32be4 <tdh_mem_page_remove+196> 0x0000000000032be4 <+196>: cmp %rbp,%rax 0x0000000000032be7 <+199>: jne 0x32bfd <tdh_mem_page_remove+221> 0x0000000000032be9 <+201>: sub $0x1,%ebx 0x0000000000032bec <+204>: jne 0x32bd7 <tdh_mem_page_remove+183> 0x0000000000032bee <+206>: add $0x70,%rsp 0x0000000000032bf2 <+210>: pop %rbx 0x0000000000032bf3 <+211>: pop %rbp 0x0000000000032bf4 <+212>: pop %r12 0x0000000000032bf6 <+214>: pop %r13 0x0000000000032bf8 <+216>: pop %r14 0x0000000000032bfa <+218>: pop %r15 0x0000000000032bfc <+220>: ret 0x0000000000032bfd <+221>: cmp %r15,%rax 0x0000000000032c00 <+224>: je 0x32c2a <tdh_mem_page_remove+266> 0x0000000000032c02 <+226>: movabs $0x8000020000000092,%rdx 0x0000000000032c0c <+236>: cmp %rdx,%rax 0x0000000000032c0f <+239>: jne 0x32bee <tdh_mem_page_remove+206> 0x0000000000032c11 <+241>: sub $0x1,%r12d 0x0000000000032c15 <+245>: jne 0x32b64 <tdh_mem_page_remove+68> 0x0000000000032c1b <+251>: add $0x70,%rsp 0x0000000000032c1f <+255>: pop %rbx 0x0000000000032c20 <+256>: pop %rbp 0x0000000000032c21 <+257>: pop %r12 0x0000000000032c23 <+259>: pop %r13 0x0000000000032c25 <+261>: pop %r14 0x0000000000032c27 <+263>: pop %r15 0x0000000000032c29 <+265>: ret 0x0000000000032c2a <+266>: call 0x32c2f <tdh_mem_page_remove+271> 0x0000000000032c2f <+271>: xor %eax,%eax 0x0000000000032c31 <+273>: add $0x70,%rsp 0x0000000000032c35 <+277>: pop %rbx 0x0000000000032c36 <+278>: pop %rbp 0x0000000000032c37 <+279>: pop %r12 0x0000000000032c39 <+281>: pop %r13 0x0000000000032c3b <+283>: pop %r14 0x0000000000032c3d <+285>: pop %r15 0x0000000000032c3f <+287>: ret End of assembler dump.
On Fri, Mar 15, 2024, Sean Christopherson wrote: > So my feedback is to not worry about the exports, and instead focus on figuring > out a way to make the generated code less bloated and easier to read/debug. Oh, and please make it a collaborative, public effort. I don't want to hear crickets and then see v20 dropped with a completely new SEAMCALL scheme.
On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote: > Heh, Like this one? > > static inline u64 tdh_sys_lp_shutdown(void) > { > struct tdx_module_args in = { > }; > > return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL); > } > > Which isn't actually used... Looks like is was turned into a NOP in TDX 1.5. So will even forever be dead code. I see one other that is unused. Thanks for pointing it out. > > > But I'd also defer to the KVM maintainers on this. They're the > > ones > > that have to play the symbol exporting game a lot more than I ever > > do. > > If they cringe at the idea of adding 20 (or whatever) exports, then > > that's a lot more important than the possibility of some other > > silly > > module abusing the generic exported __seamcall. > > I don't care much about exports. What I do care about is sane code, > and while > the current code _looks_ pretty, it's actually quite insane. > > I get why y'all put SEAMCALL in assembly subroutines; the macro > shenanigans I > originally wrote years ago were their own brand of crazy, and dealing > with GPRs > that can't be asm() constraints often results in brittle code. I guess it must be this, for the initiated: https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/ > > But the tdx_module_args structure approach generates truly atrocious > code. Yes, > SEAMCALL is inherently slow, but that doesn't mean that we shouldn't > at least try > to generate efficient code. And it's not just efficiency that is > lost, the > generated code ends up being much harder to read than it ought to be. > > [snip] > > So my feedback is to not worry about the exports, and instead focus > on figuring > out a way to make the generated code less bloated and easier to > read/debug. > Thanks for the feedback both! It sounds like everyone is flexible on the exports. As for the generated code, oof. Kai, I see the solution has gone through some iterations already. First the macro one linked above, then that was dropped pretty quick to something that loses the asm constraints: https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/ Then next the struct grew here, and here: https://lore.kernel.org/linux-mm/20230628211132.GS38236@hirez.programming.kicks-ass.net/ https://lore.kernel.org/linux-mm/20230630102141.GA2534364@hirez.programming.kicks-ass.net/ Not sure I understand all of the constraints yet. Do you have any ideas?
On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote: > On Fri, Mar 15, 2024, Sean Christopherson wrote: > > So my feedback is to not worry about the exports, and instead focus > > on figuring > > out a way to make the generated code less bloated and easier to > > read/debug. > > Oh, and please make it a collaborative, public effort. I don't want > to hear > crickets and then see v20 dropped with a completely new SEAMCALL > scheme. And here we we're worrying that people might eventually grow tired of us adding mails to v19 and we debate every detail in public. Will do.
On 3/15/24 09:33, Sean Christopherson wrote: > static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level, > struct tdx_module_args *out) > { > struct tdx_module_args in = { > .rcx = gpa | level, > .rdx = tdr, > }; > > return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out); > } > > generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper, > *every* single one generates the same mess. clang-16 is kinda sorta a little > better, as it at least inlines the helpers that have single callers. Yeah, that's really awful. Is all the inlining making the compiler too ambitious? Why is this all inlined in the first place? tdh_mem_page_remove() _should_ just be logically: * initialize tdx_module_args. Move a few things into place on the stack and zero the rest. * Put a pointer to tdx_module_args in a register * Put TDH_MEM_PAGE_REMOVE immediate in a register * Some register preservation, maybe * call * maybe some cleanup * return Those logical things are *NOT* easy to spot in the disassembly.
On Fri, Mar 15, 2024, Rick P Edgecombe wrote: > On Fri, 2024-03-15 at 10:46 -0700, Sean Christopherson wrote: > > On Fri, Mar 15, 2024, Sean Christopherson wrote: > > > So my feedback is to not worry about the exports, and instead focus on > > > figuring out a way to make the generated code less bloated and easier to > > > read/debug. > > > > Oh, and please make it a collaborative, public effort. I don't want to > > hear crickets and then see v20 dropped with a completely new SEAMCALL > > scheme. > > And here we we're worrying that people might eventually grow tired of > us adding mails to v19 and we debate every detail in public. Will do. As a general rule, I _strongly_ prefer all review to be done on-list, in public. Copy+pasting myself from another Intel series[*] : Correct, what I object to is Intel _requiring_ a Reviewed-by before posting. : : And while I'm certainly not going to refuse patches that have been reviewed : internally, I _strongly_ prefer reviews be on-list so that they are public and : recorded. Being able to go back and look at the history and evolution of patches : is valuable, and the discussion itself is often beneficial to non-participants, : e.g. people that are new-ish to KVM and/or aren't familiar with the feature being : enabled can often learn new things and avoid similar pitfalls of their own. There are definitely situations where exceptions are warranted, e.g. if someone is a first-time poster and/or wants a sanity check to make sure their idea isn't completely crazy. But even then, the internal review should only be very cursory. In addition to the history being valuable, doing reviews in public minimizes the probability of a developer being led astray, e.g. due to someone internally saying do XYZ, and then upstream reviewers telling them to do something entirely different. As far as noise goes, look at it this way. Every time a new TDX series is posted, I get 130+ emails. Y'all can do a _lot_ of public review and discussion before you'll get anywhere near the point where it'd be noiser than spinning a new version of the series. [*] https://lore.kernel.org/all/Y+ZxLfCrcTQ6poYg@google.com
On Fri, Mar 15, 2024, Dave Hansen wrote: > On 3/15/24 09:33, Sean Christopherson wrote: > > static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level, > > struct tdx_module_args *out) > > { > > struct tdx_module_args in = { > > .rcx = gpa | level, > > .rdx = tdr, > > }; > > > > return tdx_seamcall_sept(TDH_MEM_PAGE_REMOVE, &in, out); > > } > > > > generates the below monstrosity with gcc-13. And that's just one SEAMCALL wrapper, > > *every* single one generates the same mess. clang-16 is kinda sorta a little > > better, as it at least inlines the helpers that have single callers. > > Yeah, that's really awful. > > Is all the inlining making the compiler too ambitious? No, whether or not the wrappers are inlined doesn't change anything. gcc actually doesn't inline any of these helpers. More below. > Why is this all inlined in the first place? Likely because no one looked at the generated code. The C code is super simple and looks like it should be inlined. And the very original code was macro heavy, i.e. relied on inlining to allow the compiler to precisely set only the actual registers needed for the SEAMCALL. > tdh_mem_page_remove() _should_ just be logically: > > * initialize tdx_module_args. Move a few things into place on > the stack and zero the rest. The "zero the rest" is what generates the fugly code. The underlying problem is that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args. As a result, tdx_module_args needs to be zeroed to avoid loading registers with unitialized stack data. E.g. before I looked at the assembly code, my initial thought to clean things up by doing: struct tdx_module_args in; in.rcx = gpa | level; in.rdx = tdr; but that would make one or more sanitizers (and maybe even the compiler itself) more than a bit unhappy. The struct is 72 bytes, which adds up to a lot of wasted effort since the majority of SEAMCALLs only use a few of the 13 registers. FWIW, the guest side of TDX is equally gross. E.g. to do kvm_hypercall1(), which without TDX is simply 0xffffffff810eb4a2 <+82>: 48 c7 c0 8c 9a 01 00 mov $0x19a8c,%rax 0xffffffff810eb4a9 <+89>: 8b 1c 02 mov (%rdx,%rax,1),%ebx 0xffffffff810eb4ac <+92>: b8 0b 00 00 00 mov $0xb,%eax 0xffffffff810eb4b1 <+97>: 0f 01 c1 vmcall 0xffffffff810eb4b4 <+100>: 5b pop %rbx 0xffffffff810eb4b5 <+101>: 5d pop %rbp the kernel blasts the unused params 0xffffffff810ffdc1 <+97>: bf 0b 00 00 00 mov $0xb,%edi 0xffffffff810ffdc6 <+102>: 31 d2 xor %edx,%edx 0xffffffff810ffdc8 <+104>: 31 c9 xor %ecx,%ecx 0xffffffff810ffdca <+106>: 45 31 c0 xor %r8d,%r8d 0xffffffff810ffdcd <+109>: 5b pop %rbx 0xffffffff810ffdce <+110>: 41 5e pop %r14 0xffffffff810ffdd0 <+112>: e9 bb 1b f0 ff jmp 0xffffffff81001990 <tdx_kvm_hypercall> then loads and zeros a ton of memory (tdx_kvm_hypercall()): 0xffffffff81001990 <+0>: nopl 0x0(%rax,%rax,1) 0xffffffff81001995 <+5>: sub $0x70,%rsp 0xffffffff81001999 <+9>: mov %gs:0x28,%rax 0xffffffff810019a2 <+18>: mov %rax,0x68(%rsp) 0xffffffff810019a7 <+23>: mov %edi,%eax 0xffffffff810019a9 <+25>: movq $0x0,0x18(%rsp) 0xffffffff810019b2 <+34>: movq $0x0,0x10(%rsp) 0xffffffff810019bb <+43>: movq $0x0,0x8(%rsp) 0xffffffff810019c4 <+52>: movq $0x0,(%rsp) 0xffffffff810019cc <+60>: mov %rax,0x20(%rsp) 0xffffffff810019d1 <+65>: mov %rsi,0x28(%rsp) 0xffffffff810019d6 <+70>: mov %rdx,0x30(%rsp) 0xffffffff810019db <+75>: mov %rcx,0x38(%rsp) 0xffffffff810019e0 <+80>: mov %r8,0x40(%rsp) 0xffffffff810019e5 <+85>: movq $0x0,0x48(%rsp) 0xffffffff810019ee <+94>: movq $0x0,0x50(%rsp) 0xffffffff810019f7 <+103>: movq $0x0,0x58(%rsp) 0xffffffff81001a00 <+112>: movq $0x0,0x60(%rsp) 0xffffffff81001a09 <+121>: mov %rsp,%rdi 0xffffffff81001a0c <+124>: call 0xffffffff819f0a80 <__tdx_hypercall> 0xffffffff81001a11 <+129>: mov %gs:0x28,%rcx 0xffffffff81001a1a <+138>: cmp 0x68(%rsp),%rcx 0xffffffff81001a1f <+143>: jne 0xffffffff81001a26 <tdx_kvm_hypercall+150> 0xffffffff81001a21 <+145>: add $0x70,%rsp 0xffffffff81001a25 <+149>: ret and then unpacks all of that memory back into registers, and reverses that last part on the way back, (__tdcall_saved_ret()): 0xffffffff819f0b10 <+0>: mov %rdi,%rax 0xffffffff819f0b13 <+3>: mov (%rsi),%rcx 0xffffffff819f0b16 <+6>: mov 0x8(%rsi),%rdx 0xffffffff819f0b1a <+10>: mov 0x10(%rsi),%r8 0xffffffff819f0b1e <+14>: mov 0x18(%rsi),%r9 0xffffffff819f0b22 <+18>: mov 0x20(%rsi),%r10 0xffffffff819f0b26 <+22>: mov 0x28(%rsi),%r11 0xffffffff819f0b2a <+26>: push %rbx 0xffffffff819f0b2b <+27>: push %r12 0xffffffff819f0b2d <+29>: push %r13 0xffffffff819f0b2f <+31>: push %r14 0xffffffff819f0b31 <+33>: push %r15 0xffffffff819f0b33 <+35>: mov 0x30(%rsi),%r12 0xffffffff819f0b37 <+39>: mov 0x38(%rsi),%r13 0xffffffff819f0b3b <+43>: mov 0x40(%rsi),%r14 0xffffffff819f0b3f <+47>: mov 0x48(%rsi),%r15 0xffffffff819f0b43 <+51>: mov 0x50(%rsi),%rbx 0xffffffff819f0b47 <+55>: push %rsi 0xffffffff819f0b48 <+56>: mov 0x58(%rsi),%rdi 0xffffffff819f0b4c <+60>: mov 0x60(%rsi),%rsi 0xffffffff819f0b50 <+64>: tdcall 0xffffffff819f0b54 <+68>: push %rax 0xffffffff819f0b55 <+69>: mov 0x8(%rsp),%rax 0xffffffff819f0b5a <+74>: mov %rsi,0x60(%rax) 0xffffffff819f0b5e <+78>: pop %rax 0xffffffff819f0b5f <+79>: pop %rsi 0xffffffff819f0b60 <+80>: mov %r12,0x30(%rsi) 0xffffffff819f0b64 <+84>: mov %r13,0x38(%rsi) 0xffffffff819f0b68 <+88>: mov %r14,0x40(%rsi) 0xffffffff819f0b6c <+92>: mov %r15,0x48(%rsi) 0xffffffff819f0b70 <+96>: mov %rbx,0x50(%rsi) 0xffffffff819f0b74 <+100>: mov %rdi,0x58(%rsi) 0xffffffff819f0b78 <+104>: mov %rcx,(%rsi) 0xffffffff819f0b7b <+107>: mov %rdx,0x8(%rsi) 0xffffffff819f0b7f <+111>: mov %r8,0x10(%rsi) 0xffffffff819f0b83 <+115>: mov %r9,0x18(%rsi) 0xffffffff819f0b87 <+119>: mov %r10,0x20(%rsi) 0xffffffff819f0b8b <+123>: mov %r11,0x28(%rsi) 0xffffffff819f0b8f <+127>: xor %ecx,%ecx 0xffffffff819f0b91 <+129>: xor %edx,%edx 0xffffffff819f0b93 <+131>: xor %r8d,%r8d 0xffffffff819f0b96 <+134>: xor %r9d,%r9d 0xffffffff819f0b99 <+137>: xor %r10d,%r10d 0xffffffff819f0b9c <+140>: xor %r11d,%r11d 0xffffffff819f0b9f <+143>: xor %r12d,%r12d 0xffffffff819f0ba2 <+146>: xor %r13d,%r13d 0xffffffff819f0ba5 <+149>: xor %r14d,%r14d 0xffffffff819f0ba8 <+152>: xor %r15d,%r15d 0xffffffff819f0bab <+155>: xor %ebx,%ebx 0xffffffff819f0bad <+157>: xor %edi,%edi 0xffffffff819f0baf <+159>: pop %r15 0xffffffff819f0bb1 <+161>: pop %r14 0xffffffff819f0bb3 <+163>: pop %r13 0xffffffff819f0bb5 <+165>: pop %r12 0xffffffff819f0bb7 <+167>: pop %rbx 0xffffffff819f0bb8 <+168>: ret It's honestly quite amusing, because y'all took one what I see as one of the big advantages of TDX over SEV (using registers instead of shared memory), and managed to effectively turn it into a disadvantage. Again, I completely understand the maintenance and robustness benefits, but IMO the pendulum swung a bit too far in that direction.
On 3/15/24 12:38, Sean Christopherson wrote: >> tdh_mem_page_remove() _should_ just be logically: >> >> * initialize tdx_module_args. Move a few things into place on >> the stack and zero the rest. > The "zero the rest" is what generates the fugly code. The underlying problem is > that the SEAMCALL assembly functions unpack _all_ registers from tdx_module_args. > As a result, tdx_module_args needs to be zeroed to avoid loading registers with > unitialized stack data. It's the "zero the rest" and also the copy: > + if (out) { > + *out = *in; > + ret = seamcall_ret(op, out); > + } else > + ret = seamcall(op, in); Things get a wee bit nicer if you do an out-of-line mempcy() instead of the structure copy. But the really fun part is that 'out' is NULL and the compiler *SHOULD* know it. I'm not actually sure what trips it up. In any case, I think it ends up generating code for both sides of the if/else including the entirely superfluous copy. The two nested while loops (one for TDX_RND_NO_ENTROPY and the other for TDX_ERROR_SEPT_BUSY) also don't make for great code generation. So, sure, the generated code here could be a better. But there's a lot more going on here than just shuffling gunk in and out of the 'struct tdx_module_args', and there's a _lot_ more work to do for one of these than for a plain old kvm_hypercall*(). It might make sense to separate out the "out" functionality into and maybe to uninline _some_ of the helper levels. But after that, there's not a lot of low hanging fruit.
On Fri, 2024-03-15 at 17:48 +0000, Edgecombe, Rick P wrote: > On Fri, 2024-03-15 at 09:33 -0700, Sean Christopherson wrote: > > Heh, Like this one? > > > > static inline u64 tdh_sys_lp_shutdown(void) > > { > > struct tdx_module_args in = { > > }; > > > > return tdx_seamcall(TDH_SYS_LP_SHUTDOWN, &in, NULL); > > } > > > > Which isn't actually used... > > Looks like is was turned into a NOP in TDX 1.5. So will even forever be > dead code. I see one other that is unused. Thanks for pointing it out. > > > > > > But I'd also defer to the KVM maintainers on this. They're the > > > ones > > > that have to play the symbol exporting game a lot more than I ever > > > do. > > > If they cringe at the idea of adding 20 (or whatever) exports, then > > > that's a lot more important than the possibility of some other > > > silly > > > module abusing the generic exported __seamcall. > > > > I don't care much about exports. What I do care about is sane code, > > and while > > the current code _looks_ pretty, it's actually quite insane. > > > > I get why y'all put SEAMCALL in assembly subroutines; the macro > > shenanigans I > > originally wrote years ago were their own brand of crazy, and dealing > > with GPRs > > that can't be asm() constraints often results in brittle code. > > I guess it must be this, for the initiated: > https://lore.kernel.org/lkml/25f0d2c2f73c20309a1b578cc5fc15f4fd6b9a13.1605232743.git.isaku.yamahata@intel.com/ Hmm.. I lost memory of this. :-( > > > > > But the tdx_module_args structure approach generates truly atrocious > > code. Yes, > > SEAMCALL is inherently slow, but that doesn't mean that we shouldn't > > at least try > > to generate efficient code. And it's not just efficiency that is > > lost, the > > generated code ends up being much harder to read than it ought to be. > > > > > [snip] > > > > So my feedback is to not worry about the exports, and instead focus > > on figuring > > out a way to make the generated code less bloated and easier to > > read/debug. > > > > Thanks for the feedback both! It sounds like everyone is flexible on > the exports. As for the generated code, oof. > > Kai, I see the solution has gone through some iterations already. First > the macro one linked above, then that was dropped pretty quick to > something that loses the asm constraints: > https://lore.kernel.org/lkml/e777bbbe10b1ec2c37d85dcca2e175fe3bc565ec.1625186503.git.isaku.yamahata@intel.com/ Sorry I forgot for what reason we changed from the (bunch of) macros to this. > > Then next the struct grew here, and here: > https://lore.kernel.org/linux-mm/20230628211132.GS38236@hirez.programming.kicks-ass.net/ > https://lore.kernel.org/linux-mm/20230630102141.GA2534364@hirez.programming.kicks-ass.net/ > > Not sure I understand all of the constraints yet. Do you have any > ideas? This was due to Peter's request to unify the TDCALL/SEAMCALL and TDVMCALL assembly, which was done by this series: https://lore.kernel.org/lkml/cover.1692096753.git.kai.huang@intel.com/ So when Peter requested this, the __seamcall() and __tdcall() (or __tdx_module_call() before the above series) were already sharing one assembly macro. The TDVMCALL were using a different macro, though. However the two assembly macros used similar structure and similar code, so we tried to unify them. And another reason that we changed from ... u64 __seamcall(u64 fn, u64 rcx, u64 rdx, ..., struct tdx_module_args *out); to ... u64 __seamcall(u64, struct tdx_module_args *args); ... was the former doesn't extend. E.g., live migration related new SEAMCALLs use more registers as input. It is just insane to have so many individual regs as function argument. And Peter wanted to use __seamcall() to cover TDH.VP.ENTER too, which historically had been implemented in KVM using its own assembly, because VP.ENTER basically uses all GPRs as input/output.
> > then loads and zeros a ton of memory (tdx_kvm_hypercall()): > > [...] > > and then unpacks all of that memory back into registers, and reverses that last > part on the way back, (__tdcall_saved_ret()): > > [...] > > It's honestly quite amusing, because y'all took one what I see as one of the big > advantages of TDX over SEV (using registers instead of shared memory), and managed > to effectively turn it into a disadvantage. > > Again, I completely understand the maintenance and robustness benefits, but IMO > the pendulum swung a bit too far in that direction. Having to zero-and-init the structure and store output regs to the structure is an unfortunately burden if we want to have a single API __seamcall() for all SEAMCALL leafs. To (precisely) avoid writing to the unnecessary structure members before and after the SEAMCALL instruction, we need to use your old way to have bunch of macros. But we may end up with *a lot* macros due to needing to cover new (e.g., live migration) SEAMCALLs. Because essentially it's a game to implement wrappers for bunch of combinations of each individual input/output regs. I don't want to judge which way is better, but to be honest I think completely switching to the old way (using bunch of macros) isn't a realistic option at this stage. However, I think we might be able to change ... u64 __seamcall(u64 fn, struct tdx_module_args *args); ... to u64 __seamcall(u64 fn, struct tdx_module_args *in, struct tdx_module_args *out); .. so that the assembly can actually skip "storing output regs to the structure" if the SEAMCALL doesn't really have any output regs except RAX. I can try to do if you guys believe this should be done, and should be done earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will have meaningful performance improvement.
On 3/20/24 05:09, Huang, Kai wrote: > I can try to do if you guys believe this should be done, and should be done > earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will > have meaningful performance improvement. I don't think Sean had performance concerns. I think he was having a justifiably violent reaction to how much more complicated the generated code is to do a SEAMCALL versus a good ol' KVM hypercall. "Complicated" in this case means lots more instructions and control flow. That's slower, sure, but the main impact is that when you go to debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a KVM hypercall. My takeaway from this, though, is that we are relying on the compiler for a *LOT*. There are also so many levels in the helpers that it's hard to avoid silly things like two _separate_ retry loops. We should probably be looking at the generated code a _bit_ more often than never, and it's OK to tinker a _bit_ to make things out-of-line or make sure that the compiler optimizes everything that we think that it should. Also remember that there are very fun tools out there that can make this much easier than recompiling the kernel a billion times: https://godbolt.org/z/8ooE4d465
On 21/03/2024 4:07 am, Dave Hansen wrote: > On 3/20/24 05:09, Huang, Kai wrote: >> I can try to do if you guys believe this should be done, and should be done >> earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will >> have meaningful performance improvement. > > I don't think Sean had performance concerns. > > I think he was having a justifiably violent reaction to how much more > complicated the generated code is to do a SEAMCALL versus a good ol' KVM > hypercall. Ah, I automatically linked "generating better code" to "in order to have better performance". My bad. > > "Complicated" in this case means lots more instructions and control > flow. That's slower, sure, but the main impact is that when you go to > debug it, it's *MUCH* harder to debug the SEAMCALL entry assembly than a > KVM hypercall. [...] > > My takeaway from this, though, is that we are relying on the compiler > for a *LOT*. There are also so many levels in the helpers that it's > hard to avoid silly things like two _separate_ retry loops. > > We should probably be looking at the generated code a _bit_ more often > than never, and it's OK to tinker a _bit_ to make things out-of-line or > make sure that the compiler optimizes everything that we think that it > should. Yeah, agreed. > > Also remember that there are very fun tools out there that can make this > much easier than recompiling the kernel a billion times: > > https://godbolt.org/z/8ooE4d465 Ah, this is useful. Thanks for sharing :-)
On Wed, Mar 20, 2024, Dave Hansen wrote: > On 3/20/24 05:09, Huang, Kai wrote: > > I can try to do if you guys believe this should be done, and should be done > > earlier than later, but I am not sure _ANY_ optimization around SEAMCALL will > > have meaningful performance improvement. > > I don't think Sean had performance concerns. > > I think he was having a justifiably violent reaction to how much more > complicated the generated code is to do a SEAMCALL versus a good ol' KVM > hypercall. Yep. The code essentially violates the principle of least surprise. I genuinely thought I was dumping the wrong function(s) when I first looked at the output.
On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote: > So my feedback is to not worry about the exports, and instead focus on figuring > out a way to make the generated code less bloated and easier to read/debug. I think it was mistake trying to centralize TDCALL/SEAMCALL calls into few megawrappers. I think we can get better results by shifting leaf function wrappers into assembly. We are going to have more assembly, but it should produce better result. Adding macros can help to write such wrapper and minimizer boilerplate. Below is an example of how it can look like. It's not complete. I only converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be more complex. Any opinions? Is it something worth investing more time? .set offset_rcx, TDX_MODULE_rcx .set offset_rdx, TDX_MODULE_rdx .set offset_r8, TDX_MODULE_r8 .set offset_r9, TDX_MODULE_r9 .set offset_r10, TDX_MODULE_r10 .set offset_r11, TDX_MODULE_r11 .macro save_output struct_reg regs:vararg .irp reg,\regs movq %\reg, offset_\reg(%\struct_reg) .endr .endm .macro tdcall leaf movq \leaf, %rax .byte 0x66,0x0f,0x01,0xcc .endm .macro tdcall_or_panic leaf tdcall \leaf testq %rax, %rax jnz .Lpanic .endm SYM_FUNC_START(tdg_vm_rd) FRAME_BEGIN xorl %ecx, %ecx movq %rdi, %rdx tdcall_or_panic $TDG_VM_RD movq %r8, %rax RET FRAME_END SYM_FUNC_END(tdg_vm_rd) SYM_FUNC_START(tdg_vm_wr) FRAME_BEGIN xorl %ecx, %ecx movq %rsi, %r8 movq %rdx, %r9 movq %rdi, %rdx tdcall_or_panic $TDG_VM_WR /* Old value */ movq %r8, %rax RET FRAME_END SYM_FUNC_END(tdg_vm_wr) SYM_FUNC_START(tdcs_ctls_set) FRAME_BEGIN movq $TDCS_TD_CTLS, %rdx xorl %ecx, %ecx movq %rdi, %r8 movq %rdi, %r9 tdcall $TDG_VM_WR testq %rax, %rax setz %al RET FRAME_END SYM_FUNC_END(tdcs_ctls_set) SYM_FUNC_START(tdg_sys_rd) FRAME_BEGIN xorl %ecx, %ecx movq %rdi, %rdx tdcall_or_panic $TDG_SYS_RD movq %r8, %rax RET FRAME_END SYM_FUNC_END(tdg_sys_rd) SYM_FUNC_START(tdg_vp_veinfo_get) FRAME_BEGIN tdcall_or_panic $TDG_VP_VEINFO_GET save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10 FRAME_END RET SYM_FUNC_END(tdg_vp_veinfo_get) SYM_FUNC_START(tdg_vp_info) FRAME_BEGIN tdcall_or_panic $TDG_VP_INFO save_output struct_reg=rdi regs=rcx,rdx,r8,r9,r10,r11 FRAME_END RET SYM_FUNC_END(tdg_vp_info) SYM_FUNC_START(tdg_mem_page_accept) FRAME_BEGIN movq %rdi, %rcx tdcall $TDG_MEM_PAGE_ACCEPT FRAME_END RET SYM_FUNC_END(tdg_mem_page_accept) SYM_FUNC_START(tdg_mr_report) FRAME_BEGIN movq %rdx, %r8 movq %rdi, %rcx movq %rsi, %rdx tdcall $TDG_MR_REPORT FRAME_END RET SYM_FUNC_END(tdg_mr_report) .Lpanic: ud2
On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote: > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote: > > So my feedback is to not worry about the exports, and instead focus on > > figuring > > out a way to make the generated code less bloated and easier to read/debug. > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into > few megawrappers. I think we can get better results by shifting leaf > function wrappers into assembly. > > We are going to have more assembly, but it should produce better result. > Adding macros can help to write such wrapper and minimizer boilerplate. > > Below is an example of how it can look like. It's not complete. I only > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be > more complex. > > Any opinions? Is it something worth investing more time? We discussed offline how implementing these for each TDVM/SEAMCALL increases the chances of a bug in just one TDVM/SEAMCALL. Which could making debugging problems more challenging. Kirill raised the possibility of some code generating solution like cpufeatures.h, that could take a spec and generate correct calls. So far no big wins have presented themselves. Kirill, do we think the path to move the messy part out-of-line will not work?
On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote: > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote: > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote: > > > So my feedback is to not worry about the exports, and instead focus on > > > figuring > > > out a way to make the generated code less bloated and easier to read/debug. > > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into > > few megawrappers. I think we can get better results by shifting leaf > > function wrappers into assembly. > > > > We are going to have more assembly, but it should produce better result. > > Adding macros can help to write such wrapper and minimizer boilerplate. > > > > Below is an example of how it can look like. It's not complete. I only > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be > > more complex. > > > > Any opinions? Is it something worth investing more time? > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging > problems more challenging. Kirill raised the possibility of some code generating > solution like cpufeatures.h, that could take a spec and generate correct calls. > > So far no big wins have presented themselves. Kirill, do we think the path to > move the messy part out-of-line will not work? I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers. Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/ I still need to clean it up and write commit messages and comments for all wrappers. Now I think it worth the shot. Any feedback?
On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote: > On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote: > > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote: > > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote: > > > > So my feedback is to not worry about the exports, and instead focus on > > > > figuring > > > > out a way to make the generated code less bloated and easier to read/debug. > > > > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into > > > few megawrappers. I think we can get better results by shifting leaf > > > function wrappers into assembly. > > > > > > We are going to have more assembly, but it should produce better result. > > > Adding macros can help to write such wrapper and minimizer boilerplate. > > > > > > Below is an example of how it can look like. It's not complete. I only > > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be > > > more complex. > > > > > > Any opinions? Is it something worth investing more time? > > > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the > > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging > > problems more challenging. Kirill raised the possibility of some code generating > > solution like cpufeatures.h, that could take a spec and generate correct calls. > > > > So far no big wins have presented themselves. Kirill, do we think the path to > > move the messy part out-of-line will not work? > > I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers. > Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/ > > I still need to clean it up and write commit messages and comments for all > wrappers. > > Now I think it worth the shot. > > Any feedback? I find it hard to review for correctness, and extremely susceptible to developer error. E.g. lots of copy+paste, and manual encoding of RCX to expose registers. It also bleeds TDX ABI into C code, e.g. /* * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of * EAX, EBX, ECX, EDX registers after the CPUID instruction execution. * So copy the register contents back to pt_regs. */ regs->ax = args.r12; regs->bx = args.r13; regs->cx = args.r14; regs->dx = args.r15; Oh, and it requires input/output paramters, which is quite gross for C code *and* for assembly code, e.g. u64 tdvmcall_map_gpa(u64 *gpa, u64 size); and then the accompanying assembly code: FRAME_BEGIN save_regs r12,r13 movq (%rdi), %r12 movq %rsi, %r13 movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx tdvmcall $TDVMCALL_MAP_GPA movq %r11, (%rdi) restore_regs r13,r12 FRAME_END RET I think having one trampoline makes sense, e.g. to minimize the probability of leaking register state to the VMM. The part that I don't like, and which generates awful code, is shoving register state into a memory structure. The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm() constraints don't play nice with r8-15. But that doesn't mean we can't use asm() with macros, it just means we have to play games with registers. Because REG-REG moves are super cheap, and ignoring the fatal error goofiness, there are at most four inputs. That means having a single trampoline take *all* possible inputs is a non-issue. And we can avoiding polluting the inline code if we bury the register shuffling in the trampoline. And if we use asm() wrappers to call the trampoline, then the trampoline doesn't need to precisely follow the C calling convention. I.e. the trampoline can return with the outputs still in r12-r15, and let the asm() wrappers extract the outputs they want. As it stands, TDVMCALLs either have 0, 1, or 4 outputs. I.e. we only need three asm() wrappers. We could get away with one wrapper, but then users of the wrappers would need dummy variables for inputs *and* outputs, and the outputs get gross. Completely untested, but this is what I'm thinking. Side topic, I think making "tdcall" a macro that takes a leaf is a mistake. If/when an assembler learns what tdcall is, we're going to have to rewrite all of that code. And what a coincidence, my suggestion needs a bare TDCALL! :-) Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function. Side topic #3, the ud2 to induce panic should be out-of-line. Weird? Yeah. But at least we one need to document one weird calling convention, and the ugliness is contained to three macros and a small assembly function. .pushsection .noinstr.text, "ax" SYM_FUNC_START(tdvmcall_trampoline) movq $TDX_HYPERCALL_STANDARD, %r10 movq %rax, %r11 movq %rdi, %r12 movq %rsi, %r13 movq %rdx, %r14 movq %rcx, %r15 movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx tdcall testq %rax, %rax jnz .Lpanic ret .Lpanic: ud2 SYM_FUNC_END(tdvmcall_trampoline) .popsection #define TDVMCALL(reason, in1, in2, in3, in4) \ ({ \ long __ret; \ \ asm( \ "call tdvmcall_trampoline\n\t" \ "mov %%r10, %0\n\t" \ : "=r" (__ret) \ : "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ : "r12", "r13", "r14", "r15" \ ); \ __ret; \ }) #define TDVMCALL_1(reason, in1, in2, in3, in4, out1) \ ({ \ long __ret; \ \ asm( \ "call tdvmcall_trampoline\n\t" \ "mov %%r10, %0\n\t" \ "mov %%r12, %1\n\t" \ : "=r"(__ret) "=r" (out1) \ : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ : "r12", "r13", "r14", "r15" \ ); \ __ret; \ }) #define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4) \ ({ \ long __ret; \ \ asm( \ "call tdvmcall_trampoline\n\t" \ "mov %%r10, %0\n\t" \ "mov %%r12, %1\n\t" \ "mov %%r13, %2\n\t" \ "mov %%r14, %3\n\t" \ "mov %%r15, %4\n\t" \ : "=r" (__ret), \ "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4) \ : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ [reason] "i" (reason) \ : "r12", "r13", "r14", "r15" \ ); \ __ret; \ }) static int handle_halt(struct ve_info *ve) { if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0)) return -EIO; return ve_instr_len(ve); } void __cpuidle tdx_safe_halt(void) { WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0), "HLT instruction emulation failed"); } static int read_msr(struct pt_regs *regs, struct ve_info *ve) { u64 val; if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val)) return -EIO; regs->ax = lower_32_bits(val); regs->dx = upper_32_bits(val); return ve_instr_len(ve); } static int write_msr(struct pt_regs *regs, struct ve_info *ve) { u64 val = (u64)regs->dx << 32 | regs->ax; if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0)) return -EIO; return ve_instr_len(ve); } static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve) { /* * Only allow VMM to control range reserved for hypervisor * communication. * * Return all-zeros for any CPUID outside the range. It matches CPU * behaviour for non-supported leaf. */ if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) { regs->ax = regs->bx = regs->cx = regs->dx = 0; return ve_instr_len(ve); } if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, regs->ax, regs->bx, regs->cx, regs->dx)) return -EIO; return ve_instr_len(ve); } static bool mmio_read(int size, u64 gpa, u64 *val) { *val = 0; return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val); } static bool mmio_write(int size, u64 gpa, u64 val) { return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val); }
On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote: > On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote: > > On Tue, Apr 16, 2024 at 07:45:18PM +0000, Edgecombe, Rick P wrote: > > > On Wed, 2024-04-10 at 15:49 +0300, Kirill A. Shutemov wrote: > > > > On Fri, Mar 15, 2024 at 09:33:20AM -0700, Sean Christopherson wrote: > > > > > So my feedback is to not worry about the exports, and instead focus on > > > > > figuring > > > > > out a way to make the generated code less bloated and easier to read/debug. > > > > > > > > I think it was mistake trying to centralize TDCALL/SEAMCALL calls into > > > > few megawrappers. I think we can get better results by shifting leaf > > > > function wrappers into assembly. > > > > > > > > We are going to have more assembly, but it should produce better result. > > > > Adding macros can help to write such wrapper and minimizer boilerplate. > > > > > > > > Below is an example of how it can look like. It's not complete. I only > > > > converted TDCALLs, but TDVMCALLs or SEAMCALLs. TDVMCALLs are going to be > > > > more complex. > > > > > > > > Any opinions? Is it something worth investing more time? > > > > > > We discussed offline how implementing these for each TDVM/SEAMCALL increases the > > > chances of a bug in just one TDVM/SEAMCALL. Which could making debugging > > > problems more challenging. Kirill raised the possibility of some code generating > > > solution like cpufeatures.h, that could take a spec and generate correct calls. > > > > > > So far no big wins have presented themselves. Kirill, do we think the path to > > > move the messy part out-of-line will not work? > > > > I converted all TDCALL and TDVMCALL leafs to direct assembly wrappers. > > Here's WIP branch: https://github.com/intel/tdx/commits/guest-tdx-asm/ > > > > I still need to clean it up and write commit messages and comments for all > > wrappers. > > > > Now I think it worth the shot. > > > > Any feedback? > > I find it hard to review for correctness, and extremely susceptible to developer > error. E.g. lots of copy+paste, and manual encoding of RCX to expose registers. Yes, I agree. The approach requires careful manual work and is error-prone. I was planning to get around and stare at every wrapper to make sure it is correct. This approach is not scalable, though. > It also bleeds TDX ABI into C code, e.g. > > /* > * As per TDX GHCI CPUID ABI, r12-r15 registers contain contents of > * EAX, EBX, ECX, EDX registers after the CPUID instruction execution. > * So copy the register contents back to pt_regs. > */ > regs->ax = args.r12; > regs->bx = args.r13; > regs->cx = args.r14; > regs->dx = args.r15; > > Oh, and it requires input/output paramters, which is quite gross for C code *and* > for assembly code, e.g. > > u64 tdvmcall_map_gpa(u64 *gpa, u64 size); > > and then the accompanying assembly code: > > FRAME_BEGIN > > save_regs r12,r13 > > movq (%rdi), %r12 > movq %rsi, %r13 > > movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13), %rcx > > tdvmcall $TDVMCALL_MAP_GPA > > movq %r11, (%rdi) > > restore_regs r13,r12 > > FRAME_END > RET > > I think having one trampoline makes sense, e.g. to minimize the probability of > leaking register state to the VMM. The part that I don't like, and which generates > awful code, is shoving register state into a memory structure. I don't think we can get away with single trampoline. We have outliers. See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any time soon. It uses all possible output registers. But I guess we can make a *few* wrappers that covers all needed cases. > The annoying part with the TDX ABI is that it heavily uses r8-r15, and asm() > constraints don't play nice with r8-15. But that doesn't mean we can't use asm() > with macros, it just means we have to play games with registers. > > Because REG-REG moves are super cheap, and ignoring the fatal error goofiness, > there are at most four inputs. That means having a single trampoline take *all* > possible inputs is a non-issue. And we can avoiding polluting the inline code if > we bury the register shuffling in the trampoline. > > And if we use asm() wrappers to call the trampoline, then the trampoline doesn't > need to precisely follow the C calling convention. I.e. the trampoline can return > with the outputs still in r12-r15, and let the asm() wrappers extract the outputs > they want. > > As it stands, TDVMCALLs either have 0, 1, or 4 outputs. I.e. we only need three > asm() wrappers. We could get away with one wrapper, but then users of the wrappers > would need dummy variables for inputs *and* outputs, and the outputs get gross. > > Completely untested, but this is what I'm thinking. Side topic, I think making > "tdcall" a macro that takes a leaf is a mistake. If/when an assembler learns what > tdcall is, we're going to have to rewrite all of that code. And what a coincidence, > my suggestion needs a bare TDCALL! :-) I guess rename the macro (if it still needed) to something like tdcall_leaf or something. > Side topic #2, I don't think the trampoline needs a stack frame, its a leaf function. Hm. I guess. > Side topic #3, the ud2 to induce panic should be out-of-line. Yeah. I switched to the inline one while debugging one section mismatch issue and forgot to switch back. > Weird? Yeah. But at least we one need to document one weird calling convention, > and the ugliness is contained to three macros and a small assembly function. Okay, the approach is worth exploring. I can work on it. You focuses here on TDVMCALL. What is your take on the rest of TDCALL? > .pushsection .noinstr.text, "ax" > SYM_FUNC_START(tdvmcall_trampoline) > movq $TDX_HYPERCALL_STANDARD, %r10 > movq %rax, %r11 > movq %rdi, %r12 > movq %rsi, %r13 > movq %rdx, %r14 > movq %rcx, %r15 > > movq $(TDX_R10 | TDX_R11 | TDX_R12 | TDX_R13 | TDX_R14 | TDX_R15), %rcx > > tdcall > > testq %rax, %rax > jnz .Lpanic > > ret > > .Lpanic: > ud2 > SYM_FUNC_END(tdvmcall_trampoline) > .popsection > > > #define TDVMCALL(reason, in1, in2, in3, in4) \ > ({ \ > long __ret; \ > \ > asm( \ > "call tdvmcall_trampoline\n\t" \ > "mov %%r10, %0\n\t" \ > : "=r" (__ret) \ > : "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ > : "r12", "r13", "r14", "r15" \ > ); \ > __ret; \ > }) > > #define TDVMCALL_1(reason, in1, in2, in3, in4, out1) \ > ({ \ > long __ret; \ > \ > asm( \ > "call tdvmcall_trampoline\n\t" \ > "mov %%r10, %0\n\t" \ > "mov %%r12, %1\n\t" \ It is r11, not r12. > : "=r"(__ret) "=r" (out1) \ > : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ > : "r12", "r13", "r14", "r15" \ > ); \ > __ret; \ > }) > > #define TDVMCALL_4(reason, in1, in2, in3, in4, out1, out2, out3, out4) \ > ({ \ > long __ret; \ > \ > asm( \ > "call tdvmcall_trampoline\n\t" \ > "mov %%r10, %0\n\t" \ > "mov %%r12, %1\n\t" \ > "mov %%r13, %2\n\t" \ > "mov %%r14, %3\n\t" \ > "mov %%r15, %4\n\t" \ > : "=r" (__ret), \ > "=r" (out1), "=r" (out2), "=r" (out3), "=r" (out4) \ > : "a"(reason), "D" (in1), "S"(in2), "d"(in3), "c" (in4) \ > [reason] "i" (reason) \ > : "r12", "r13", "r14", "r15" \ > ); \ > __ret; \ > }) > > static int handle_halt(struct ve_info *ve) > { > if (TDVMCALL(EXIT_REASON_HALT, irqs_disabled(), 0, 0, 0)) > return -EIO; > > return ve_instr_len(ve); > } > > void __cpuidle tdx_safe_halt(void) > { > WARN_ONCE(TDVMCALL(EXIT_REASON_HALT, false, 0, 0, 0), > "HLT instruction emulation failed"); > } > > static int read_msr(struct pt_regs *regs, struct ve_info *ve) > { > u64 val; > > if (TDVMCALL_1(EXIT_REASON_MSR_READ, regs->cx, 0, 0, 0, val)) > return -EIO; > > regs->ax = lower_32_bits(val); > regs->dx = upper_32_bits(val); > > return ve_instr_len(ve); > } > > static int write_msr(struct pt_regs *regs, struct ve_info *ve) > { > u64 val = (u64)regs->dx << 32 | regs->ax; > > if (TDVMCALL(EXIT_REASON_MSR_WRITE, regs->cx, val, 0, 0)) > return -EIO; > > return ve_instr_len(ve); > } > static int handle_cpuid(struct pt_regs *regs, struct ve_info *ve) > { > /* > * Only allow VMM to control range reserved for hypervisor > * communication. > * > * Return all-zeros for any CPUID outside the range. It matches CPU > * behaviour for non-supported leaf. > */ > if (regs->ax < 0x40000000 || regs->ax > 0x4FFFFFFF) { > regs->ax = regs->bx = regs->cx = regs->dx = 0; > return ve_instr_len(ve); > } > > if (TDVMCALL_4(EXIT_REASON_CPUID, regs->ax, regs->cx, 0, 0, > regs->ax, regs->bx, regs->cx, regs->dx)) > return -EIO; > > return ve_instr_len(ve); > } > > static bool mmio_read(int size, u64 gpa, u64 *val) > { > *val = 0; > return !TDVMCALL_1(EXIT_REASON_EPT_VIOLATION, size, EPT_READ, gpa, 0, val); > } > > static bool mmio_write(int size, u64 gpa, u64 val) > { > return !TDVMCALL(EXIT_REASON_EPT_VIOLATION, size, EPT_WRITE, gpa, val); > }
On Fri, Apr 19, 2024, kirill.shutemov@linux.intel.com wrote: > On Thu, Apr 18, 2024 at 11:26:11AM -0700, Sean Christopherson wrote: > > On Thu, Apr 18, 2024, kirill.shutemov@linux.intel.com wrote: > > I think having one trampoline makes sense, e.g. to minimize the probability of > > leaking register state to the VMM. The part that I don't like, and which generates > > awful code, is shoving register state into a memory structure. > > I don't think we can get away with single trampoline. We have outliers. Yeah, I should have said "one trampoline for all of the not insane APIs" :-) > See TDG.VP.VMCALL<ReportFatalError> that uses pretty much all registers as > input. And I hope we wouldn't need TDG.VP.VMCALL<Instruction.PCONFIG> any > time soon. It uses all possible output registers. XMM: just say no. > But I guess we can make a *few* wrappers that covers all needed cases. Yeah. I suspect two will suffice. One for the calls that say at or below four inputs, and one for the fat ones like ReportFatalError that use everything under the sun. For the latter one, marshalling data into registers via an in-memory structure makes, especially if we move away from tdx_module_args, e.g. this is quite clean and reasonable: union { /* Define register order according to the GHCI */ struct { u64 r14, r15, rbx, rdi, rsi, r8, r9, rdx; }; char str[64]; } message; /* VMM assumes '\0' in byte 65, if the message took all 64 bytes */ strtomem_pad(message.str, msg, '\0'); /* * This hypercall should never return and it is not safe * to keep the guest running. Call it forever if it * happens to return. */ while (1) tdx_fat_tdvmcall(&message); > > Weird? Yeah. But at least we one need to document one weird calling convention, > > and the ugliness is contained to three macros and a small assembly function. > > Okay, the approach is worth exploring. I can work on it. > > You focuses here on TDVMCALL. What is your take on the rest of TDCALL? Not sure, haven't looked at them recently. At a glance, something similar? The use of high registers instead of RDI and RSI is damn annoying :-/ Hmm, but it looks like there are enough simple TDCALLs that stay away from high registers that open coding inline asm() is a viable (best?) approach. RAX being the leaf and the return value is annoying, so maybe a simple macro to make it easier to deal with that? It won't allow for perfectly optimal code generation, but forcing a MOV for a TDCALL isn't going to affect performance, and it will make reading the code dead simple. #define tdcall_leaf(leaf) "mov $" leaf ", %%eax\n\t.byte 0x66,0x0f,0x01,0xcc\n\t" Then PAGE_ACCEPT is simply: asm(tdcall_leaf(TDG_MEM_PAGE_ACCEPT) : "=a"(ret), : "c"(start | page_size)); if (ret) return 0; And even the meanies that use R8 are reasonably easy to handle: asm("xor %%r8d, %%r8d\n\t" tdcall_leaf(TDG_MR_REPORT) : "=a"(ret) : "c"(__pa(report)), "d"(__pa(data))); and (though using names for the outputs, I just can't remember the syntax as I'm typing this :-/) asm(tdcall_leaf(TDG_VM_RD) "mov %%r8, %0\n\t" : "=r"(value), "=a"(ret) : "c"(0), "d"(no_idea_what_this_is)); if (ret) <cry> return value; Or, if you wanted to get fancy, use asm_goto_output() to bail on an error so that you could optimize for using RAX as the return value, because *that's going to make all the difference :-D
On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote: > > > Side topic #3, the ud2 to induce panic should be out-of-line. > > Yeah. I switched to the inline one while debugging one section mismatch > issue and forgot to switch back. Sorry, why do we need to panic?
On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote: > On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote: > > > > > Side topic #3, the ud2 to induce panic should be out-of-line. > > > > Yeah. I switched to the inline one while debugging one section mismatch > > issue and forgot to switch back. > > Sorry, why do we need to panic? It panics in cases that should never occur if the TDX module is functioning properly. For example, TDVMCALL itself should never fail, although the leaf function could.
On Mon, 2024-04-22 at 14:46 +0300, kirill.shutemov@linux.intel.com wrote: > On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote: > > On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote: > > > > > > > Side topic #3, the ud2 to induce panic should be out-of-line. > > > > > > Yeah. I switched to the inline one while debugging one section mismatch > > > issue and forgot to switch back. > > > > Sorry, why do we need to panic? > > It panics in cases that should never occur if the TDX module is > functioning properly. For example, TDVMCALL itself should never fail, > although the leaf function could. Panic should normally be for desperate situations when horrible things will likely happen if we continue, right? Why are we adding a panic when we didn't have one before? Is it a second change, or a side affect of the refactor?
On Mon, Apr 22, 2024, Rick P Edgecombe wrote: > On Mon, 2024-04-22 at 14:46 +0300, kirill.shutemov@linux.intel.com wrote: > > On Fri, Apr 19, 2024 at 08:04:26PM +0000, Edgecombe, Rick P wrote: > > > On Fri, 2024-04-19 at 17:46 +0300, kirill.shutemov@linux.intel.com wrote: > > > > > > > > > Side topic #3, the ud2 to induce panic should be out-of-line. > > > > > > > > Yeah. I switched to the inline one while debugging one section mismatch > > > > issue and forgot to switch back. > > > > > > Sorry, why do we need to panic? > > > > It panics in cases that should never occur if the TDX module is > > functioning properly. For example, TDVMCALL itself should never fail, > > although the leaf function could. > > Panic should normally be for desperate situations when horrible things will > likely happen if we continue, right? Why are we adding a panic when we didn't > have one before? Is it a second change, or a side affect of the refactor? The kernel already does panic() if TDCALL itself fails, static inline void tdcall(u64 fn, struct tdx_module_args *args) { if (__tdcall_ret(fn, args)) panic("TDCALL %lld failed (Buggy TDX module!)\n", fn); } /* Called from __tdx_hypercall() for unrecoverable failure */ noinstr void __noreturn __tdx_hypercall_failed(void) { instrumentation_begin(); panic("TDVMCALL failed. TDX module bug?"); } it's just doesn in C code via panic(), not in asm via a bare ud2.
On Mon, 2024-04-22 at 12:50 -0700, Sean Christopherson wrote: > The kernel already does panic() if TDCALL itself fails, > > static inline void tdcall(u64 fn, struct tdx_module_args *args) > { > if (__tdcall_ret(fn, args)) > panic("TDCALL %lld failed (Buggy TDX module!)\n", fn); > } > > /* Called from __tdx_hypercall() for unrecoverable failure */ > noinstr void __noreturn __tdx_hypercall_failed(void) > { > instrumentation_begin(); > panic("TDVMCALL failed. TDX module bug?"); > } > > it's just doesn in C code via panic(), not in asm via a bare ud2. Hmm, I didn't realize. It looks like today some calls do and some don't. I don't mean to reopen old debates. Just surprised that these are able to bring down the system. Which funnily enough connects back to the original issue of the patch: whether they are safe to export for module use.
On Fri, Apr 19, 2024 at 12:53:19PM -0700, Sean Christopherson wrote: > > But I guess we can make a *few* wrappers that covers all needed cases. > > Yeah. I suspect two will suffice. One for the calls that say at or below four > inputs, and one for the fat ones like ReportFatalError that use everything under > the sun. I ended up with three helpers: - tdvmcall_trampoline() as you proposed. - tdvmcall_report_fatal_error() that does ReportFatalError specificly. Pointer to char array as an input. Never returns: no need to save/restore registers. - hv_tdx_hypercall(). Hyper-V annoyingly uses R8 and RDX as parameters :/ > Not sure, haven't looked at them recently. At a glance, something similar? The > use of high registers instead of RDI and RSI is damn annoying :-/ I defined three helpers: TDCALL_0(), TDCALL_1() and TDCALL_5(). All takes 4 input arguments. I've updated the WIP branch with the new version: https://github.com/intel/tdx/commits/guest-tdx-asm/ Any feedback is welcome.
diff --git a/arch/x86/include/asm/asm-prototypes.h b/arch/x86/include/asm/asm-prototypes.h index b1a98fa38828..0ec572ad75f1 100644 --- a/arch/x86/include/asm/asm-prototypes.h +++ b/arch/x86/include/asm/asm-prototypes.h @@ -13,6 +13,7 @@ #include <asm/preempt.h> #include <asm/asm.h> #include <asm/gsseg.h> +#include <asm/tdx.h> #ifndef CONFIG_X86_CMPXCHG64 extern void cmpxchg8b_emu(void); diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S index 5b1f2286aea9..e32cf82ed47e 100644 --- a/arch/x86/virt/vmx/tdx/seamcall.S +++ b/arch/x86/virt/vmx/tdx/seamcall.S @@ -1,5 +1,6 @@ /* SPDX-License-Identifier: GPL-2.0 */ #include <linux/linkage.h> +#include <linux/export.h> #include <asm/frame.h> #include "tdxcall.S" @@ -21,6 +22,7 @@ SYM_FUNC_START(__seamcall) TDX_MODULE_CALL host=1 SYM_FUNC_END(__seamcall) +EXPORT_SYMBOL_GPL(__seamcall); /* * __seamcall_ret() - Host-side interface functions to SEAM software @@ -40,6 +42,7 @@ SYM_FUNC_END(__seamcall) SYM_FUNC_START(__seamcall_ret) TDX_MODULE_CALL host=1 ret=1 SYM_FUNC_END(__seamcall_ret) +EXPORT_SYMBOL_GPL(__seamcall_ret); /* * __seamcall_saved_ret() - Host-side interface functions to SEAM software @@ -59,3 +62,4 @@ SYM_FUNC_END(__seamcall_ret) SYM_FUNC_START(__seamcall_saved_ret) TDX_MODULE_CALL host=1 ret=1 saved=1 SYM_FUNC_END(__seamcall_saved_ret) +EXPORT_SYMBOL_GPL(__seamcall_saved_ret);