Message ID | 20230221163655.920289-2-mizhang@google.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Overhauling amx_test | expand |
Mingwei! On Tue, Feb 21 2023 at 16:36, Mingwei Zhang wrote: > Avoid getting xstate address of init_fpstate if fpstate contains the xstate > component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls > __raw_xsave_addr(xinit, 18), it triggers a warning as follows. This does not parse. Aside of that it gets the ordering of the changelog wrong. It explain first what the patch is doing by repeating the way too long subject line and then tries to give some explanation about the problem. KVM does not call __raw_xsave_addr() and the problem is completely independent of KVM. > __raw_xsave_addr() is an internal function that assume caller does the > checking, ie., all function arguments should be checked before calling. > So, instead of removing the WARNING, add checks in > __copy_xstate_to_uabi_buf(). I don't see checks added. The patch open codes copy_feature() and makes the __raw_xsave_addr() invocations conditional. So something like this: Subject: x86/fpu/xstate: Prevent false-positive warning in __copy_xstate_to_uabi_buf() __copy_xstate_to_uabi_buf() copies either from the tasks XSAVE buffer or from init_fpstate into the ptrace buffer. Dynamic features, like XTILEDATA, have an all zeroes init state and are not saved in init_fpstate, which means the corresponding bit is not set in the xfeatures bitmap of the init_fpstate header. But __copy_xstate_to_uabi_buf() retrieves addresses for both the tasks xstate and init_fpstate unconditionally via __raw_xsave_addr(). So if the tasks XSAVE buffer has a dynamic feature set, then the address retrieval for init_fpstate triggers the warning in __raw_xsave_addr() which checks the feature bit in the init_fpstate header. Prevent this by open coding copy_feature() and making the address retrieval conditional on the tasks XSAVE header bit. So the order here is (in paragraphs): 1) Explain the context 2) Explain whats wrong 3) Explain the consequence 4) Explain the solution briefly It does not even need a backtrace, because the backtrace does not give any relevant information which is not in the text above already. The actual callchain is completely irrelevant for desribing this issue. If you really want to add a backtrace, then please trim it by removing the irrelevant information. See https://www.kernel.org/doc/html/latest/process/submitting-patches.html#backtraces for further information. So for this case this would be: WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0 Call Trace: <TASK> __copy_xstate_to_uabi_buf+0x3cb/0x520 fpu_copy_guest_fpstate_to_uabi+0x29/0x50 But even fpu_copy_guest_fpstate_to_uabi() is already useless because __copy_xstate_to_uabi_buf() has multiple callers which all have the very same problem and they are very simple to find. Backtraces are useful to illustrate a hard to understand callchain, but having ~40 lines of backtrace which only contains two relevant lines is just a distraction. See? > Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()") The problem did not exist at this point in time because dynamic xfeatures did not exist, neither in hardware nor in kernel support. Even if dynamic features would have existed then the commit would not be the one which introduced the problem, All the commit does is to move back then valid code into a conditional code path. It fixes: 471f0aa7fa64 ("x86/fpu: Fix copy_xstate_to_uabi() to copy init states correctly") which attempted to fix exactly the problem you are seeing, but only addressed half of it. The underlying problem was introduced with 2308ee57d93d ("x86/fpu/amx: Enable the AMX feature in 64-bit mode") Random fixes tags are not really helpful. > @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > - __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > - xstate_sizes[i]); Please add a comment here to explain why this is open coded and does not use copy_feature(). Something like: /* * Open code copy_feature() to prevent retrieval * of a dynamic features address from xinit, which * is invalid because xinit does not contain the * all zeros init states of dynamic features and * emits a warning. */ > + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? > + __raw_xsave_addr(xsave, i) : > + __raw_xsave_addr(xinit, i); > + > + membuf_write(&to, xsave_addr, xstate_sizes[i]); Other than that. Nice detective work! Thanks, tglx
On 2/21/2023 8:36 AM, Mingwei Zhang wrote: > Avoid getting xstate address of init_fpstate if fpstate contains the xstate > component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls > __raw_xsave_addr(xinit, 18), it triggers a warning as follows. > > __raw_xsave_addr() is an internal function that assume caller does the > checking, ie., all function arguments should be checked before calling. > So, instead of removing the WARNING, add checks in > __copy_xstate_to_uabi_buf(). > <snip> > @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > - __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > - xstate_sizes[i]); > + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? > + __raw_xsave_addr(xsave, i) : > + __raw_xsave_addr(xinit, i); > + > + membuf_write(&to, xsave_addr, xstate_sizes[i]); > } > /* > * Keep track of the last copied state in the non-compacted So this hunk is under for_each_extended_xfeature(i, mask) -- it skips the copy routine if mask[i] == 0; instead, it fills zeros. We have this [1]: if (fpu_state_size_dynamic()) mask &= (header.xfeatures | xinit->header.xcomp_bv); If header.xfeatures[18] = 0 then mask[18] = 0 because xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm confused about the problem that you described here. Can you elaborate on your test case a bit? Let me try to reproduce the issue on my end. Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n1134
On Tue, Feb 21 2023 at 19:05, Chang S. Bae wrote: > On 2/21/2023 8:36 AM, Mingwei Zhang wrote: >> @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, >> pkru.pkru = pkru_val; >> membuf_write(&to, &pkru, sizeof(pkru)); >> } else { >> - copy_feature(header.xfeatures & BIT_ULL(i), &to, >> - __raw_xsave_addr(xsave, i), >> - __raw_xsave_addr(xinit, i), >> - xstate_sizes[i]); >> + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? >> + __raw_xsave_addr(xsave, i) : >> + __raw_xsave_addr(xinit, i); >> + >> + membuf_write(&to, xsave_addr, xstate_sizes[i]); >> } >> /* >> * Keep track of the last copied state in the non-compacted > > So this hunk is under for_each_extended_xfeature(i, mask) -- it skips > the copy routine if mask[i] == 0; instead, it fills zeros. > > We have this [1]: > > if (fpu_state_size_dynamic()) > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > If header.xfeatures[18] = 0 then mask[18] = 0 because > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > confused about the problem that you described here. Read the suggested changelog I wrote in my reply to Mingwei. TLDR: xsave.header.xfeatures[18] = 1 xinit.header.xfeatures[18] = 0 -> mask[18] = 1 -> __raw_xsave_addr(xsave, 18) <- Success -> __raw_xsave_addr(xinit, 18) <- WARN Thanks, tglx
> > We have this [1]: > > > > if (fpu_state_size_dynamic()) > > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > > > If header.xfeatures[18] = 0 then mask[18] = 0 because > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > > confused about the problem that you described here. > > Read the suggested changelog I wrote in my reply to Mingwei. > > TLDR: > > xsave.header.xfeatures[18] = 1 > xinit.header.xfeatures[18] = 0 > -> mask[18] = 1 > -> __raw_xsave_addr(xsave, 18) <- Success > -> __raw_xsave_addr(xinit, 18) <- WARN > > Thanks, > > tglx Hi Thomas, Thanks for the review and I will provide the next version separately from the series, since this one is independent from the rest. Chang: to reproduce this issue, you can simply run the amx_test in the kvm selftest directory. Thanks. -Mingwei
On 2/22/2023 10:40 AM, Mingwei Zhang wrote: >>> We have this [1]: >>> >>> if (fpu_state_size_dynamic()) >>> mask &= (header.xfeatures | xinit->header.xcomp_bv); >>> >>> If header.xfeatures[18] = 0 then mask[18] = 0 because >>> xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm >>> confused about the problem that you described here. >> >> Read the suggested changelog I wrote in my reply to Mingwei. >> >> TLDR: >> >> xsave.header.xfeatures[18] = 1 >> xinit.header.xfeatures[18] = 0 >> -> mask[18] = 1 >> -> __raw_xsave_addr(xsave, 18) <- Success >> -> __raw_xsave_addr(xinit, 18) <- WARN Oh, sigh.. This should be caught last time. Hmm, then since we store init state for legacy ones [1], unless it is too aggressive, perhaps the loop can be simplified like this: diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 714166cc25f2..2dac6f5f3ade 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, zerofrom = offsetof(struct xregs_state, extended_state_area); /* - * The ptrace buffer is in non-compacted XSAVE format. In - * non-compacted format disabled features still occupy state space, - * but there is no state to copy from in the compacted - * init_fpstate. The gap tracking will zero these states. + * Indicate which states to copy from fpstate. When not present in + * fpstate, those extended states are either initialized or + * disabled. They are also known to have an all zeros init state. + * Thus, remove them from 'mask' to zero those features in the user + * buffer instead of retrieving them from init_fpstate. */ - mask = fpstate->user_xfeatures; - - /* - * Dynamic features are not present in init_fpstate. When they are - * in an all zeros init state, remove those from 'mask' to zero - * those features in the user buffer instead of retrieving them - * from init_fpstate. - */ - if (fpu_state_size_dynamic()) - mask &= (header.xfeatures | xinit->header.xcomp_bv); + mask = header.xfeatures; for_each_extended_xfeature(i, mask) { /* @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, pkru.pkru = pkru_val; membuf_write(&to, &pkru, sizeof(pkru)); } else { - copy_feature(header.xfeatures & BIT_ULL(i), &to, + membuf_write(&to, __raw_xsave_addr(xsave, i), - __raw_xsave_addr(xinit, i), xstate_sizes[i]); } /* > Chang: to reproduce this issue, you can simply run the amx_test in the > kvm selftest directory. Yeah, I was able to reproduce it with this ptrace test: diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c index 625e42901237..ae02bc81846d 100644 --- a/tools/testing/selftests/x86/amx.c +++ b/tools/testing/selftests/x86/amx.c @@ -14,8 +14,10 @@ #include <sys/auxv.h> #include <sys/mman.h> #include <sys/shm.h> +#include <sys/ptrace.h> #include <sys/syscall.h> #include <sys/wait.h> +#include <sys/uio.h> #include "../kselftest.h" /* For __cpuid_count() */ @@ -826,6 +828,76 @@ static void test_context_switch(void) free(finfo); } +/* Ptrace test */ + +static bool inject_tiledata(pid_t target) +{ + struct xsave_buffer *xbuf; + struct iovec iov; + + xbuf = alloc_xbuf(); + if (!xbuf) + fatal_error("unable to allocate XSAVE buffer"); + + load_rand_tiledata(xbuf); + + memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset], + &xbuf->bytes[xtiledata.xbuf_offset], + xtiledata.size); + + iov.iov_base = xbuf; + iov.iov_len = xbuf_size; + + if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) + fatal_error("PTRACE_SETREGSET"); + + if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) + err(1, "PTRACE_GETREGSET"); + + if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset], + &xbuf->bytes[xtiledata.xbuf_offset], + xtiledata.size)) + return true; + else + return false; +} + +static void test_ptrace(void) +{ + pid_t child; + int status; + + child = fork(); + if (child < 0) { + err(1, "fork"); + } else if (!child) { + if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) + err(1, "PTRACE_TRACEME"); + + /* Use the state to expand the kernel buffer */ + load_rand_tiledata(stashed_xsave); + + raise(SIGTRAP); + _exit(0); + } + + do { + wait(&status); + } while (WSTOPSIG(status) != SIGTRAP); + + printf("\tInject tile data via ptrace()\n"); + + if (inject_tiledata(child)) + printf("[OK]\tTile data was written on ptracee.\n"); + else + printf("[FAIL]\tTile data was not written on ptracee.\n"); + + ptrace(PTRACE_DETACH, child, NULL, NULL); + wait(&status); + if (!WIFEXITED(status) || WEXITSTATUS(status)) + err(1, "ptrace test"); +} + int main(void) { /* Check hardware availability at first */ @@ -846,6 +918,8 @@ int main(void) ctxtswtest_config.num_threads = 5; test_context_switch(); + test_ptrace(); + clearhandler(SIGILL); free_stashed_xsave(); Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
On Wed, Feb 22, 2023, Chang S. Bae wrote: > On 2/22/2023 10:40 AM, Mingwei Zhang wrote: > > > > We have this [1]: > > > > > > > > if (fpu_state_size_dynamic()) > > > > mask &= (header.xfeatures | xinit->header.xcomp_bv); > > > > > > > > If header.xfeatures[18] = 0 then mask[18] = 0 because > > > > xinit->header.xcomp_bv[18] = 0. Then, it won't hit that code. So, I'm > > > > confused about the problem that you described here. > > > > > > Read the suggested changelog I wrote in my reply to Mingwei. > > > > > > TLDR: > > > > > > xsave.header.xfeatures[18] = 1 > > > xinit.header.xfeatures[18] = 0 > > > -> mask[18] = 1 > > > -> __raw_xsave_addr(xsave, 18) <- Success > > > -> __raw_xsave_addr(xinit, 18) <- WARN > > Oh, sigh.. This should be caught last time. > > Hmm, then since we store init state for legacy ones [1], unless it is too > aggressive, perhaps the loop can be simplified like this: > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c > index 714166cc25f2..2dac6f5f3ade 100644 > --- a/arch/x86/kernel/fpu/xstate.c > +++ b/arch/x86/kernel/fpu/xstate.c > @@ -1118,21 +1118,13 @@ void __copy_xstate_to_uabi_buf(struct membuf to, > struct fpstate *fpstate, > zerofrom = offsetof(struct xregs_state, extended_state_area); > > /* > - * The ptrace buffer is in non-compacted XSAVE format. In > - * non-compacted format disabled features still occupy state space, > - * but there is no state to copy from in the compacted > - * init_fpstate. The gap tracking will zero these states. > + * Indicate which states to copy from fpstate. When not present in > + * fpstate, those extended states are either initialized or > + * disabled. They are also known to have an all zeros init state. > + * Thus, remove them from 'mask' to zero those features in the user > + * buffer instead of retrieving them from init_fpstate. > */ > - mask = fpstate->user_xfeatures; Do we need to change this line and the comments? I don't see any of these was relevant to this issue. The original code semantic is to traverse all user_xfeatures, if it is available in fpstate, copy it from there; otherwise, copy it from init_fpstate. We do not assume the component in init_fpstate (but not in fpstate) are all zeros, do we? If it is safe to assume that, then it might be ok. But at least in this patch, I want to keep the original semantics as is without the assumption. > - > - /* > - * Dynamic features are not present in init_fpstate. When they are > - * in an all zeros init state, remove those from 'mask' to zero > - * those features in the user buffer instead of retrieving them > - * from init_fpstate. > - */ > - if (fpu_state_size_dynamic()) > - mask &= (header.xfeatures | xinit->header.xcomp_bv); > + mask = header.xfeatures; Same here. Let's not adding this optimization in this patch. > > for_each_extended_xfeature(i, mask) { > /* > @@ -1151,9 +1143,8 @@ void __copy_xstate_to_uabi_buf(struct membuf to, > struct fpstate *fpstate, > pkru.pkru = pkru_val; > membuf_write(&to, &pkru, sizeof(pkru)); > } else { > - copy_feature(header.xfeatures & BIT_ULL(i), &to, > + membuf_write(&to, > __raw_xsave_addr(xsave, i), > - __raw_xsave_addr(xinit, i), > xstate_sizes[i]); > } > /* > > > Chang: to reproduce this issue, you can simply run the amx_test in the > > kvm selftest directory. > > Yeah, I was able to reproduce it with this ptrace test: > > diff --git a/tools/testing/selftests/x86/amx.c > b/tools/testing/selftests/x86/amx.c > index 625e42901237..ae02bc81846d 100644 > --- a/tools/testing/selftests/x86/amx.c > +++ b/tools/testing/selftests/x86/amx.c > @@ -14,8 +14,10 @@ > #include <sys/auxv.h> > #include <sys/mman.h> > #include <sys/shm.h> > +#include <sys/ptrace.h> > #include <sys/syscall.h> > #include <sys/wait.h> > +#include <sys/uio.h> > > #include "../kselftest.h" /* For __cpuid_count() */ > > @@ -826,6 +828,76 @@ static void test_context_switch(void) > free(finfo); > } > > +/* Ptrace test */ > + > +static bool inject_tiledata(pid_t target) > +{ > + struct xsave_buffer *xbuf; > + struct iovec iov; > + > + xbuf = alloc_xbuf(); > + if (!xbuf) > + fatal_error("unable to allocate XSAVE buffer"); > + > + load_rand_tiledata(xbuf); > + > + memcpy(&stashed_xsave->bytes[xtiledata.xbuf_offset], > + &xbuf->bytes[xtiledata.xbuf_offset], > + xtiledata.size); > + > + iov.iov_base = xbuf; > + iov.iov_len = xbuf_size; > + > + if (ptrace(PTRACE_SETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) > + fatal_error("PTRACE_SETREGSET"); > + > + if (ptrace(PTRACE_GETREGSET, target, (uint32_t)NT_X86_XSTATE, &iov)) > + err(1, "PTRACE_GETREGSET"); > + > + if (!memcmp(&stashed_xsave->bytes[xtiledata.xbuf_offset], > + &xbuf->bytes[xtiledata.xbuf_offset], > + xtiledata.size)) > + return true; > + else > + return false; > +} > + > +static void test_ptrace(void) > +{ > + pid_t child; > + int status; > + > + child = fork(); > + if (child < 0) { > + err(1, "fork"); > + } else if (!child) { > + if (ptrace(PTRACE_TRACEME, 0, NULL, NULL)) > + err(1, "PTRACE_TRACEME"); > + > + /* Use the state to expand the kernel buffer */ > + load_rand_tiledata(stashed_xsave); > + > + raise(SIGTRAP); > + _exit(0); > + } > + > + do { > + wait(&status); > + } while (WSTOPSIG(status) != SIGTRAP); > + > + printf("\tInject tile data via ptrace()\n"); > + > + if (inject_tiledata(child)) > + printf("[OK]\tTile data was written on ptracee.\n"); > + else > + printf("[FAIL]\tTile data was not written on ptracee.\n"); > + > + ptrace(PTRACE_DETACH, child, NULL, NULL); > + wait(&status); > + if (!WIFEXITED(status) || WEXITSTATUS(status)) > + err(1, "ptrace test"); > +} > + > int main(void) > { > /* Check hardware availability at first */ > @@ -846,6 +918,8 @@ int main(void) > ctxtswtest_config.num_threads = 5; > test_context_switch(); > > + test_ptrace(); > + > clearhandler(SIGILL); > free_stashed_xsave(); > > Thanks, > Chang > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386 > Nice one. Yeah both ptrace and KVM are calling this function so the above code would also be enough to trigger the bug. Thanks. -Mingwei
On 2/24/2023 3:56 PM, Mingwei Zhang wrote: > On Wed, Feb 22, 2023, Chang S. Bae wrote: >> >> /* >> - * The ptrace buffer is in non-compacted XSAVE format. In >> - * non-compacted format disabled features still occupy state space, >> - * but there is no state to copy from in the compacted >> - * init_fpstate. The gap tracking will zero these states. >> + * Indicate which states to copy from fpstate. When not present in >> + * fpstate, those extended states are either initialized or >> + * disabled. They are also known to have an all zeros init state. >> + * Thus, remove them from 'mask' to zero those features in the user >> + * buffer instead of retrieving them from init_fpstate. >> */ >> - mask = fpstate->user_xfeatures; > > Do we need to change this line and the comments? I don't see any of > these was relevant to this issue. The original code semantic is to > traverse all user_xfeatures, if it is available in fpstate, copy it from > there; otherwise, copy it from init_fpstate. We do not assume the > component in init_fpstate (but not in fpstate) are all zeros, do we? If > it is safe to assume that, then it might be ok. But at least in this > patch, I want to keep the original semantics as is without the > assumption. Here it has [1]: * * XSAVE could be used, but that would require to reshuffle the * data when XSAVEC/S is available because XSAVEC/S uses xstate * compaction. But doing so is a pointless exercise because most * components have an all zeros init state except for the legacy * ones (FP and SSE). Those can be saved with FXSAVE into the * legacy area. Adding new features requires to ensure that init * state is all zeroes or if not to add the necessary handling * here. */ fxsave(&init_fpstate.regs.fxsave); Thus, init_fpstate has zeros for those extended states. Then, copying from init_fpstate is the same as membuf_zero() by the gap tracking. But, we have two ways to do the same thing here. So I think it works that simply copying the state from fpstate only for those present there, then letting the gap tracking zero out for the rest of the userspace buffer for features that are either disabled or initialized. Then, we can remove accessing init_fpstate in the copy loop and which is the source of the problem. So I think this line change is relevant and also makes the code simple. I guess I'm fine if you don't want to do this. Then, let me follow up with something like this at first. Something like yours could be a fallback option for other good reasons, otherwise. Thanks, Chang [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386
On Fri, Feb 24, 2023, Chang S. Bae wrote: > On 2/24/2023 3:56 PM, Mingwei Zhang wrote: > > On Wed, Feb 22, 2023, Chang S. Bae wrote: > > > > > > /* > > > - * The ptrace buffer is in non-compacted XSAVE format. In > > > - * non-compacted format disabled features still occupy state space, > > > - * but there is no state to copy from in the compacted > > > - * init_fpstate. The gap tracking will zero these states. > > > + * Indicate which states to copy from fpstate. When not present in > > > + * fpstate, those extended states are either initialized or > > > + * disabled. They are also known to have an all zeros init state. > > > + * Thus, remove them from 'mask' to zero those features in the user > > > + * buffer instead of retrieving them from init_fpstate. > > > */ > > > - mask = fpstate->user_xfeatures; > > > > Do we need to change this line and the comments? I don't see any of > > these was relevant to this issue. The original code semantic is to > > traverse all user_xfeatures, if it is available in fpstate, copy it from > > there; otherwise, copy it from init_fpstate. We do not assume the > > component in init_fpstate (but not in fpstate) are all zeros, do we? If > > it is safe to assume that, then it might be ok. But at least in this > > patch, I want to keep the original semantics as is without the > > assumption. > > Here it has [1]: > > * > * XSAVE could be used, but that would require to reshuffle the > * data when XSAVEC/S is available because XSAVEC/S uses xstate > * compaction. But doing so is a pointless exercise because most > * components have an all zeros init state except for the legacy > * ones (FP and SSE). Those can be saved with FXSAVE into the > * legacy area. Adding new features requires to ensure that init > * state is all zeroes or if not to add the necessary handling > * here. > */ > fxsave(&init_fpstate.regs.fxsave); ah, I see. > > Thus, init_fpstate has zeros for those extended states. Then, copying from > init_fpstate is the same as membuf_zero() by the gap tracking. But, we have > two ways to do the same thing here. > > So I think it works that simply copying the state from fpstate only for > those present there, then letting the gap tracking zero out for the rest of > the userspace buffer for features that are either disabled or initialized. > > Then, we can remove accessing init_fpstate in the copy loop and which is the > source of the problem. So I think this line change is relevant and also > makes the code simple. > > I guess I'm fine if you don't want to do this. Then, let me follow up with > something like this at first. Something like yours could be a fallback > option for other good reasons, otherwise. hmm. I see. But this is still because of the software implementation. What if there is a new hardware component that requires a non-zero init state. For instance, in the past, we had PKRU component, whose init value is 0x555...54. Of course, that is a bad example because now we kick it out of the XSAVE/XRSTOR and special handling that, but there is no guarantee that in the future we will never need a non-zero init state. So, I will send out my fix and let you, Thomas and potentially other folks to decide what is the best option. Overall, I get your point. Thanks -Mingwei > > Thanks, > Chang > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kernel/fpu/xstate.c#n386 > >
On 2/24/2023 5:09 PM, Mingwei Zhang wrote: > On Fri, Feb 24, 2023, Chang S. Bae wrote: >> On 2/24/2023 3:56 PM, Mingwei Zhang wrote: >>> On Wed, Feb 22, 2023, Chang S. Bae wrote: >>>> >>>> /* >>>> - * The ptrace buffer is in non-compacted XSAVE format. In >>>> - * non-compacted format disabled features still occupy state space, >>>> - * but there is no state to copy from in the compacted >>>> - * init_fpstate. The gap tracking will zero these states. >>>> + * Indicate which states to copy from fpstate. When not present in >>>> + * fpstate, those extended states are either initialized or >>>> + * disabled. They are also known to have an all zeros init state. >>>> + * Thus, remove them from 'mask' to zero those features in the user >>>> + * buffer instead of retrieving them from init_fpstate. >>>> */ >>>> - mask = fpstate->user_xfeatures; >>> >>> Do we need to change this line and the comments? I don't see any of >>> these was relevant to this issue. The original code semantic is to >>> traverse all user_xfeatures, if it is available in fpstate, copy it from >>> there; otherwise, copy it from init_fpstate. We do not assume the >>> component in init_fpstate (but not in fpstate) are all zeros, do we? If >>> it is safe to assume that, then it might be ok. But at least in this >>> patch, I want to keep the original semantics as is without the >>> assumption. >> >> Here it has [1]: >> >> * >> * XSAVE could be used, but that would require to reshuffle the >> * data when XSAVEC/S is available because XSAVEC/S uses xstate >> * compaction. But doing so is a pointless exercise because most >> * components have an all zeros init state except for the legacy >> * ones (FP and SSE). Those can be saved with FXSAVE into the >> * legacy area. Adding new features requires to ensure that init >> * state is all zeroes or if not to add the necessary handling >> * here. >> */ >> fxsave(&init_fpstate.regs.fxsave); > > ah, I see. >> >> Thus, init_fpstate has zeros for those extended states. Then, copying from >> init_fpstate is the same as membuf_zero() by the gap tracking. But, we have >> two ways to do the same thing here. >> >> So I think it works that simply copying the state from fpstate only for >> those present there, then letting the gap tracking zero out for the rest of >> the userspace buffer for features that are either disabled or initialized. >> >> Then, we can remove accessing init_fpstate in the copy loop and which is the >> source of the problem. So I think this line change is relevant and also >> makes the code simple. >> >> I guess I'm fine if you don't want to do this. Then, let me follow up with >> something like this at first. Something like yours could be a fallback >> option for other good reasons, otherwise. > > hmm. I see. But this is still because of the software implementation. > What if there is a new hardware component that requires a non-zero init > state. > > For instance, in the past, we had PKRU component, whose init value is > 0x555...54. Of course, that is a bad example because now we kick it out > of the XSAVE/XRSTOR and special handling that, but there is no guarantee > that in the future we will never need a non-zero init state. Yeah, then that feature enabling has to update [1] to record the special init value there. So one who writes that code has to responsibly check what has to be adjusted like for other feature enabling in general. Now we have established the dynamic states which are also assumed to have an all-zeros init state. Anything new state located after that should be a dynamic state because of the sigaltstack. Of course, with that though, I don't know whether we will see anything with a non-zero init state in the future or not. > So, I will send out my fix and let you, Thomas and potentially other > folks to decide what is the best option. Overall, I get your point. Let's coordinate this. We shouldn't throw multiple versions at the same time. (Let me follow up with you.) Thanks, Chang
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c index 714166cc25f2..5cc1426c3800 100644 --- a/arch/x86/kernel/fpu/xstate.c +++ b/arch/x86/kernel/fpu/xstate.c @@ -1063,6 +1063,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, struct xregs_state *xsave = &fpstate->regs.xsave; struct xstate_header header; unsigned int zerofrom; + void *xsave_addr; u64 mask; int i; @@ -1151,10 +1152,11 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate, pkru.pkru = pkru_val; membuf_write(&to, &pkru, sizeof(pkru)); } else { - copy_feature(header.xfeatures & BIT_ULL(i), &to, - __raw_xsave_addr(xsave, i), - __raw_xsave_addr(xinit, i), - xstate_sizes[i]); + xsave_addr = (header.xfeatures & BIT_ULL(i)) ? + __raw_xsave_addr(xsave, i) : + __raw_xsave_addr(xinit, i); + + membuf_write(&to, xsave_addr, xstate_sizes[i]); } /* * Keep track of the last copied state in the non-compacted
Avoid getting xstate address of init_fpstate if fpstate contains the xstate component. Since XTILEDATA (bit 18) was turned off in xinit, when KVM calls __raw_xsave_addr(xinit, 18), it triggers a warning as follows. __raw_xsave_addr() is an internal function that assume caller does the checking, ie., all function arguments should be checked before calling. So, instead of removing the WARNING, add checks in __copy_xstate_to_uabi_buf(). [ 168.814082] ------------[ cut here ]------------ [ 168.814083] WARNING: CPU: 35 PID: 15304 at arch/x86/kernel/fpu/xstate.c:934 __raw_xsave_addr+0xc8/0xe0 [ 168.814088] Modules linked in: kvm_intel dummy bridge stp llc cdc_ncm cdc_eem cdc_ether usbnet mii ehci_pci ehci_hcd vfat fat cdc_acm xhci_pci xhci_hcd idpf(O) [ 168.814100] CPU: 35 PID: 15304 Comm: amx_test Tainted: G S O 6.2.0-smp-DEV #6 [ 168.814103] RIP: 0010:__raw_xsave_addr+0xc8/0xe0 [ 168.814105] Code: 83 f9 40 72 b0 eb 10 48 63 ca 44 8b 04 8d 60 13 1e 82 eb 03 41 89 f8 44 89 c1 48 01 c8 48 83 c4 08 5d c3 cc 0f 0b 31 c0 eb f3 <0f> 0b 48 c7 c7 c7 28 11 82 e8 da 30 b0 00 31 c0 eb e1 66 0f 1f 44 [ 168.814106] RSP: 0018:ff110020ef79bc90 EFLAGS: 00010246 [ 168.814108] RAX: ffffffff821e0340 RBX: 0000000000000012 RCX: 0000000000000012 [ 168.814109] RDX: 0000000000000012 RSI: 80000000000206e7 RDI: 0000000000040000 [ 168.814110] RBP: ff110020ef79bc98 R08: 0000000000000a00 R09: 0000000000000012 [ 168.814112] R10: 0000000000000012 R11: 0000000000000004 R12: ffa00000089f2a40 [ 168.814113] R13: 0000001200000000 R14: 0000000000000012 R15: ff110020ef288b00 [ 168.814114] FS: 00007f1812761300(0000) GS:ff11003fff4c0000(0000) knlGS:0000000000000000 [ 168.814116] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 168.814117] CR2: 00007f1812555008 CR3: 0000002093a80002 CR4: 0000000000373ee0 [ 168.814118] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 168.814119] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400 [ 168.814120] Call Trace: [ 168.814121] <TASK> [ 168.814122] __copy_xstate_to_uabi_buf+0x3cb/0x520 [ 168.814125] fpu_copy_guest_fpstate_to_uabi+0x29/0x50 [ 168.814127] kvm_arch_vcpu_ioctl+0x9f7/0xee0 [ 168.814130] ? __kmem_cache_free+0x16b/0x220 [ 168.814133] kvm_vcpu_ioctl+0x47c/0x5a0 [ 168.814136] __se_sys_ioctl+0x77/0xc0 [ 168.814138] __x64_sys_ioctl+0x1d/0x20 [ 168.814139] do_syscall_64+0x3d/0x80 [ 168.814142] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 168.814146] RIP: 0033:0x7f1812892c87 [ 168.814148] Code: 5d c3 cc 48 8b 05 39 1d 07 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 cc cc cc cc cc cc cc cc cc cc b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 09 1d 07 00 f7 d8 64 89 01 48 [ 168.814149] RSP: 002b:00007ffc4cebf538 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 168.814151] RAX: ffffffffffffffda RBX: 00007f1812761280 RCX: 00007f1812892c87 [ 168.814152] RDX: 00000000004dcda0 RSI: 000000009000aecf RDI: 0000000000000007 [ 168.814153] RBP: 0000000000002b00 R08: 00000000004d5010 R09: 0000000000002710 [ 168.814154] R10: 00007f1812906980 R11: 0000000000000246 R12: 00000000004d8110 [ 168.814155] R13: 0000000000000004 R14: 00000000004d78b0 R15: 0000000000000004 [ 168.814156] </TASK> [ 168.814157] ---[ end trace 0000000000000000 ]--- Fixes: e84ba47e313d ("x86/fpu: Hook up PKRU into ptrace()") Signed-off-by: Mingwei Zhang <mizhang@google.com> --- arch/x86/kernel/fpu/xstate.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-)