S3 resume issue in xstate_init

Message ID	YRsQArpQcpLB/Q5h@mail-itl (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=kHjl=NI=lists.xenproject.org=xen-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4B99560EFE Errors-To: xen-devel-bounces@lists.xenproject.org Precedence: list Sender: "Xen-devel" <xen-devel-bounces@lists.xenproject.org> Date: Tue, 17 Aug 2021 03:25:21 +0200 From: Marek =?utf-8?q?Marczykowski-G=C3=B3recki?= <marmarek@invisiblethingslab.com> To: xen-devel <xen-devel@lists.xenproject.org> Subject: S3 resume issue in xstate_init Message-ID: <YRsQArpQcpLB/Q5h@mail-itl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="7VhbuY6gWKyUon2r" Content-Disposition: inline
Series	S3 resume issue in xstate_init \| expand S3 resume issue in xstate_init

Marek Marczykowski-Górecki Aug. 17, 2021, 1:25 a.m. UTC

Hi,

I've got another S3 issue:

(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Broke affinity for IRQ1, new: ffff
(XEN) Broke affinity for IRQ16, new: ffff
(XEN) Broke affinity for IRQ9, new: ffff
(XEN) Broke affinity for IRQ139, new: ffff
(XEN) Broke affinity for IRQ8, new: ffff
(XEN) Broke affinity for IRQ14, new: ffff
(XEN) Broke affinity for IRQ20, new: ffff
(XEN) Broke affinity for IRQ137, new: ffff
(XEN) Broke affinity for IRQ138, new: ffff
(XEN) Entering ACPI S3 state.
(XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
(XEN) CPU0 CMCI LVT vector (0xf1) already installed
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
(XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
(XEN) Enabling non-boot CPUs  ...
(XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
(XEN) Xen BUG at xstate.c:673
(XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    1
(XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
(XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
(XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
(XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
(XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
(XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
(XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
(XEN) cr3: 0000000049656000   cr2: 0000000000000000
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
(XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
(XEN) Xen stack trace from rsp=ffff83025dc9fd20:
(XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
(XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
(XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
(XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
(XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
(XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
(XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
(XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
(XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
(XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
(XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
(XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
(XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
(XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
(XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
(XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
(XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
(XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
(XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
(XEN) Xen call trace:
(XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
(XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
(XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
(XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
(XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
(XEN) 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 1:
(XEN) Xen BUG at xstate.c:673
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

This is with added debug patch:


As can be seen above - the xsave size differs between BSP and other
CPU(s) - likely because of (not) loaded ucode update there.
I guess it's a matter of moving ucode loading somewhere else, right?

Marek Marczykowski-Górecki Aug. 17, 2021, 11:02 a.m. UTC | #1

On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
> Hi,
> 
> I've got another S3 issue:
> 
> (XEN) Preparing system for ACPI S3 state.
> (XEN) Disabling non-boot CPUs ...
> (XEN) Broke affinity for IRQ1, new: ffff
> (XEN) Broke affinity for IRQ16, new: ffff
> (XEN) Broke affinity for IRQ9, new: ffff
> (XEN) Broke affinity for IRQ139, new: ffff
> (XEN) Broke affinity for IRQ8, new: ffff
> (XEN) Broke affinity for IRQ14, new: ffff
> (XEN) Broke affinity for IRQ20, new: ffff
> (XEN) Broke affinity for IRQ137, new: ffff
> (XEN) Broke affinity for IRQ138, new: ffff
> (XEN) Entering ACPI S3 state.
> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
> (XEN) Enabling non-boot CPUs  ...
> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
> (XEN) Xen BUG at xstate.c:673
> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    1
> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
> (XEN) Xen call trace:
> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
> (XEN) 
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 1:
> (XEN) Xen BUG at xstate.c:673
> (XEN) ****************************************
> (XEN) 
> (XEN) Reboot in five seconds...
> 
> This is with added debug patch:
> 
> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> index 6aaf9a2f1546..7873a21b356a 100644
> --- a/xen/arch/x86/xstate.c
> +++ b/xen/arch/x86/xstate.c
> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>      else
>      {
>          BUG_ON(xfeature_mask != feature_mask);
> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>      }
>  
> 
> As can be seen above - the xsave size differs between BSP and other
> CPU(s) - likely because of (not) loaded ucode update there.
> I guess it's a matter of moving ucode loading somewhere else, right?


Few more data points:

1. The CPU is i7-8750H (family 6, model 158, stepping 10).
2. I do have "smt=off" on the Xen cmdline, if that matters.

I've tried the same without letting Xen load the ucode update (so,
staying at 0xca) and got the same effect. So, I think it isn't about
ucode...

Andrew Cooper Aug. 17, 2021, 11:14 a.m. UTC | #2

On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>> Hi,
>>
>> I've got another S3 issue:
>>
>> (XEN) Preparing system for ACPI S3 state.
>> (XEN) Disabling non-boot CPUs ...
>> (XEN) Broke affinity for IRQ1, new: ffff
>> (XEN) Broke affinity for IRQ16, new: ffff
>> (XEN) Broke affinity for IRQ9, new: ffff
>> (XEN) Broke affinity for IRQ139, new: ffff
>> (XEN) Broke affinity for IRQ8, new: ffff
>> (XEN) Broke affinity for IRQ14, new: ffff
>> (XEN) Broke affinity for IRQ20, new: ffff
>> (XEN) Broke affinity for IRQ137, new: ffff
>> (XEN) Broke affinity for IRQ138, new: ffff
>> (XEN) Entering ACPI S3 state.
>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>> (XEN) Finishing wakeup from ACPI S3 state.
>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>> (XEN) Enabling non-boot CPUs  ...
>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>> (XEN) Xen BUG at xstate.c:673
>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>> (XEN) CPU:    1
>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>> (XEN) Xen call trace:
>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>> (XEN) 
>> (XEN) 
>> (XEN) ****************************************
>> (XEN) Panic on CPU 1:
>> (XEN) Xen BUG at xstate.c:673
>> (XEN) ****************************************
>> (XEN) 
>> (XEN) Reboot in five seconds...
>>
>> This is with added debug patch:
>>
>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>> index 6aaf9a2f1546..7873a21b356a 100644
>> --- a/xen/arch/x86/xstate.c
>> +++ b/xen/arch/x86/xstate.c
>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>      else
>>      {
>>          BUG_ON(xfeature_mask != feature_mask);
>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>      }
>>  
>>
>> As can be seen above - the xsave size differs between BSP and other
>> CPU(s) - likely because of (not) loaded ucode update there.
>> I guess it's a matter of moving ucode loading somewhere else, right?
>
> Few more data points:
>
> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
> 2. I do have "smt=off" on the Xen cmdline, if that matters.

As a datapoint, it would be interesting to confirm what the behaviour is
with SMT enabled.

I'd expect it not to make a difference, because smt=off is a purely Xen
construct and doesn't change the hardware configuration.

>
> I've tried the same without letting Xen load the ucode update (so,
> staying at 0xca) and got the same effect. So, I think it isn't about
> ucode...

Any chance of a full boot log?

This is bizzare.  Looking through start_secondary(), we've got an
ordering error between updating microcode and checking for dropped
features, but again I don't think this would be relevant here.

I suspect this is going to take some more custom debugging logic.

~Andrew

Marek Marczykowski-Górecki Aug. 17, 2021, 11:44 a.m. UTC | #3

On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
> > On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
> >> Hi,
> >>
> >> I've got another S3 issue:
> >>
> >> (XEN) Preparing system for ACPI S3 state.
> >> (XEN) Disabling non-boot CPUs ...
> >> (XEN) Broke affinity for IRQ1, new: ffff
> >> (XEN) Broke affinity for IRQ16, new: ffff
> >> (XEN) Broke affinity for IRQ9, new: ffff
> >> (XEN) Broke affinity for IRQ139, new: ffff
> >> (XEN) Broke affinity for IRQ8, new: ffff
> >> (XEN) Broke affinity for IRQ14, new: ffff
> >> (XEN) Broke affinity for IRQ20, new: ffff
> >> (XEN) Broke affinity for IRQ137, new: ffff
> >> (XEN) Broke affinity for IRQ138, new: ffff
> >> (XEN) Entering ACPI S3 state.
> >> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
> >> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
> >> (XEN) Finishing wakeup from ACPI S3 state.
> >> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
> >> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
> >> (XEN) Enabling non-boot CPUs  ...
> >> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
> >> (XEN) Xen BUG at xstate.c:673
> >> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
> >> (XEN) CPU:    1
> >> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
> >> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
> >> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
> >> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
> >> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
> >> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
> >> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
> >> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
> >> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
> >> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> >> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> >> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
> >> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
> >> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
> >> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
> >> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
> >> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
> >> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
> >> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
> >> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
> >> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
> >> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
> >> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
> >> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
> >> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
> >> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
> >> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
> >> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
> >> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
> >> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
> >> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
> >> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
> >> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
> >> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
> >> (XEN) Xen call trace:
> >> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
> >> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
> >> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
> >> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
> >> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
> >> (XEN) 
> >> (XEN) 
> >> (XEN) ****************************************
> >> (XEN) Panic on CPU 1:
> >> (XEN) Xen BUG at xstate.c:673
> >> (XEN) ****************************************
> >> (XEN) 
> >> (XEN) Reboot in five seconds...
> >>
> >> This is with added debug patch:
> >>
> >> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
> >> index 6aaf9a2f1546..7873a21b356a 100644
> >> --- a/xen/arch/x86/xstate.c
> >> +++ b/xen/arch/x86/xstate.c
> >> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
> >>      else
> >>      {
> >>          BUG_ON(xfeature_mask != feature_mask);
> >> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
> >> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
> >>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
> >>      }
> >>  
> >>
> >> As can be seen above - the xsave size differs between BSP and other
> >> CPU(s) - likely because of (not) loaded ucode update there.
> >> I guess it's a matter of moving ucode loading somewhere else, right?
> >
> > Few more data points:
> >
> > 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
> > 2. I do have "smt=off" on the Xen cmdline, if that matters.
> 
> As a datapoint, it would be interesting to confirm what the behaviour is
> with SMT enabled.
> 
> I'd expect it not to make a difference, because smt=off is a purely Xen
> construct and doesn't change the hardware configuration.

Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!

Let me add CPU number to the above printk - is smp_processor_id() the
thing I want?
With that, I get:
https://gist.github.com/marmarek/ae604a1e5cf49639a1eec9e220c037ca
Note that at boot all CPUs reports 0x440 (but only later are parked).
Maybe resume path for the parked CPUs is missing some step?

> > I've tried the same without letting Xen load the ucode update (so,
> > staying at 0xca) and got the same effect. So, I think it isn't about
> > ucode...
> 
> Any chance of a full boot log?

No problem, see above :)

> This is bizzare.  Looking through start_secondary(), we've got an
> ordering error between updating microcode and checking for dropped
> features, but again I don't think this would be relevant here.
> 
> I suspect this is going to take some more custom debugging logic.

Hints welcome ;)
I can easily test any Xen patches there. 

PS I'm pretty happy with my Xen debug setup there - I boot via PXE,
which allows me quickly iterate (build and test with just one reboot,
not two), and then collect crash message via kexec (sadly this laptop
refuses to boot with anything non-SSD plugged into M.2 slot :/). One
thing that could use improvement, is extracting console messages from
memory dump - `crash` doesn't work for me, and with `gdb` I get
`conring` of all zero-es (likely invalid address?). So, I'm using
`strings` :/
I should make some writeup about this setup ;)

Jan Beulich Aug. 17, 2021, 12:21 p.m. UTC | #4

On 17.08.2021 13:44, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
>> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
>>> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>>>> Hi,
>>>>
>>>> I've got another S3 issue:
>>>>
>>>> (XEN) Preparing system for ACPI S3 state.
>>>> (XEN) Disabling non-boot CPUs ...
>>>> (XEN) Broke affinity for IRQ1, new: ffff
>>>> (XEN) Broke affinity for IRQ16, new: ffff
>>>> (XEN) Broke affinity for IRQ9, new: ffff
>>>> (XEN) Broke affinity for IRQ139, new: ffff
>>>> (XEN) Broke affinity for IRQ8, new: ffff
>>>> (XEN) Broke affinity for IRQ14, new: ffff
>>>> (XEN) Broke affinity for IRQ20, new: ffff
>>>> (XEN) Broke affinity for IRQ137, new: ffff
>>>> (XEN) Broke affinity for IRQ138, new: ffff
>>>> (XEN) Entering ACPI S3 state.
>>>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>>>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>>>> (XEN) Finishing wakeup from ACPI S3 state.
>>>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>>>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>>>> (XEN) Enabling non-boot CPUs  ...
>>>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>>>> (XEN) Xen BUG at xstate.c:673
>>>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>>>> (XEN) CPU:    1
>>>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>>>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>>>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>>>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>>>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>>>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>>>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>>>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>>>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>>>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>>>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>>>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>>>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>>>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>>>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>>>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>>>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>>>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>>>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>>>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>>>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>>>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>>>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>>>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>>>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>>>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>>>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>>>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>>>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>>>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>>>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>>>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>>>> (XEN) Xen call trace:
>>>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>>>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>>>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>>>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>>>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>>>> (XEN) 
>>>> (XEN) 
>>>> (XEN) ****************************************
>>>> (XEN) Panic on CPU 1:
>>>> (XEN) Xen BUG at xstate.c:673
>>>> (XEN) ****************************************
>>>> (XEN) 
>>>> (XEN) Reboot in five seconds...
>>>>
>>>> This is with added debug patch:
>>>>
>>>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>>>> index 6aaf9a2f1546..7873a21b356a 100644
>>>> --- a/xen/arch/x86/xstate.c
>>>> +++ b/xen/arch/x86/xstate.c
>>>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>>>      else
>>>>      {
>>>>          BUG_ON(xfeature_mask != feature_mask);
>>>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>>>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>>>      }
>>>>  
>>>>
>>>> As can be seen above - the xsave size differs between BSP and other
>>>> CPU(s) - likely because of (not) loaded ucode update there.
>>>> I guess it's a matter of moving ucode loading somewhere else, right?
>>>
>>> Few more data points:
>>>
>>> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
>>> 2. I do have "smt=off" on the Xen cmdline, if that matters.
>>
>> As a datapoint, it would be interesting to confirm what the behaviour is
>> with SMT enabled.
>>
>> I'd expect it not to make a difference, because smt=off is a purely Xen
>> construct and doesn't change the hardware configuration.
> 
> Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!
> 
> Let me add CPU number to the above printk - is smp_processor_id() the
> thing I want?
> With that, I get:
> https://gist.github.com/marmarek/ae604a1e5cf49639a1eec9e220c037ca
> Note that at boot all CPUs reports 0x440 (but only later are parked).

And for a feature mask of 0x1f only 0x440 can possibly be correct.
I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
write, due to the cached value matching the to-be-written one, yet
the cache having gone stale across S3. I think this is to be expected
for previously parked CPUs, as those don't have their per-CPU data
de-allocated (and hence also not re-setup, and thus also not starting
out as zero). I guess an easy fix would be to write 0 to
this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
to the early "if ( bsp )".

I'm not sure though this would be a good fix longer term, as there
might easily be other similar issues elsewhere. IOW we may need to
see whether per-CPU data initialization wouldn't want changing.

Jan

Jan Beulich Aug. 17, 2021, 12:49 p.m. UTC | #5

On 17.08.2021 14:21, Jan Beulich wrote:
> On 17.08.2021 13:44, Marek Marczykowski-Górecki wrote:
>> On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
>>> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
>>>> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>>>>> Hi,
>>>>>
>>>>> I've got another S3 issue:
>>>>>
>>>>> (XEN) Preparing system for ACPI S3 state.
>>>>> (XEN) Disabling non-boot CPUs ...
>>>>> (XEN) Broke affinity for IRQ1, new: ffff
>>>>> (XEN) Broke affinity for IRQ16, new: ffff
>>>>> (XEN) Broke affinity for IRQ9, new: ffff
>>>>> (XEN) Broke affinity for IRQ139, new: ffff
>>>>> (XEN) Broke affinity for IRQ8, new: ffff
>>>>> (XEN) Broke affinity for IRQ14, new: ffff
>>>>> (XEN) Broke affinity for IRQ20, new: ffff
>>>>> (XEN) Broke affinity for IRQ137, new: ffff
>>>>> (XEN) Broke affinity for IRQ138, new: ffff
>>>>> (XEN) Entering ACPI S3 state.
>>>>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>>>>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>>>>> (XEN) Finishing wakeup from ACPI S3 state.
>>>>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>>>>> (XEN) Enabling non-boot CPUs  ...
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>>>>> (XEN) CPU:    1
>>>>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>>>>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>>>>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>>>>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>>>>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>>>>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>>>>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>>>>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>>>>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>>>>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>>>>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>>>>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>>>>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>>>>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>>>>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>>>>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>>>>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>>>>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>>>>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>>>>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>>>>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>>>>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>>>>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>>>>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>>>>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>>>>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>>>>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>>>>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>>>>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>>>>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>>>>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>>>>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>>>>> (XEN) Xen call trace:
>>>>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>>>>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>>>>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>>>>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>>>>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>>>>> (XEN) 
>>>>> (XEN) 
>>>>> (XEN) ****************************************
>>>>> (XEN) Panic on CPU 1:
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ****************************************
>>>>> (XEN) 
>>>>> (XEN) Reboot in five seconds...
>>>>>
>>>>> This is with added debug patch:
>>>>>
>>>>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>>>>> index 6aaf9a2f1546..7873a21b356a 100644
>>>>> --- a/xen/arch/x86/xstate.c
>>>>> +++ b/xen/arch/x86/xstate.c
>>>>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>>>>      else
>>>>>      {
>>>>>          BUG_ON(xfeature_mask != feature_mask);
>>>>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>>>>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>>>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>>>>      }
>>>>>  
>>>>>
>>>>> As can be seen above - the xsave size differs between BSP and other
>>>>> CPU(s) - likely because of (not) loaded ucode update there.
>>>>> I guess it's a matter of moving ucode loading somewhere else, right?
>>>>
>>>> Few more data points:
>>>>
>>>> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
>>>> 2. I do have "smt=off" on the Xen cmdline, if that matters.
>>>
>>> As a datapoint, it would be interesting to confirm what the behaviour is
>>> with SMT enabled.
>>>
>>> I'd expect it not to make a difference, because smt=off is a purely Xen
>>> construct and doesn't change the hardware configuration.
>>
>> Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!
>>
>> Let me add CPU number to the above printk - is smp_processor_id() the
>> thing I want?
>> With that, I get:
>> https://gist.github.com/marmarek/ae604a1e5cf49639a1eec9e220c037ca
>> Note that at boot all CPUs reports 0x440 (but only later are parked).
> 
> And for a feature mask of 0x1f only 0x440 can possibly be correct.
> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
> write, due to the cached value matching the to-be-written one, yet
> the cache having gone stale across S3. I think this is to be expected
> for previously parked CPUs, as those don't have their per-CPU data
> de-allocated (and hence also not re-setup, and thus also not starting
> out as zero). I guess an easy fix would be to write 0 to
> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
> to the early "if ( bsp )".

What I can't spot though is where CPU0 would restore the cached value
to XCR0 (or update the cached value from XCR0, or clobber the cached
value to force a hardware register write). CPU0 is, after all, never
expecting its per-CPU data to get cleared (proven e.g. by the EFER
restore first thing after setting system_state to SYS_STATE_resume).
IOW I would have expected CPU0 to suffer the same issue, or it is
mere luck that there the cached value matches what xstate_init()
gets entered with.

As to other similar issues - MSR_IA32_XSS looks to also be (latently)
affected (latently because right now it won't ever end up non-zero).

Jan

Andrew Cooper Aug. 17, 2021, 12:53 p.m. UTC | #6

On 17/08/2021 13:21, Jan Beulich wrote:
> On 17.08.2021 13:44, Marek Marczykowski-Górecki wrote:
>> On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
>>> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
>>>> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>>>>> Hi,
>>>>>
>>>>> I've got another S3 issue:
>>>>>
>>>>> (XEN) Preparing system for ACPI S3 state.
>>>>> (XEN) Disabling non-boot CPUs ...
>>>>> (XEN) Broke affinity for IRQ1, new: ffff
>>>>> (XEN) Broke affinity for IRQ16, new: ffff
>>>>> (XEN) Broke affinity for IRQ9, new: ffff
>>>>> (XEN) Broke affinity for IRQ139, new: ffff
>>>>> (XEN) Broke affinity for IRQ8, new: ffff
>>>>> (XEN) Broke affinity for IRQ14, new: ffff
>>>>> (XEN) Broke affinity for IRQ20, new: ffff
>>>>> (XEN) Broke affinity for IRQ137, new: ffff
>>>>> (XEN) Broke affinity for IRQ138, new: ffff
>>>>> (XEN) Entering ACPI S3 state.
>>>>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>>>>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>>>>> (XEN) Finishing wakeup from ACPI S3 state.
>>>>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>>>>> (XEN) Enabling non-boot CPUs  ...
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>>>>> (XEN) CPU:    1
>>>>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>>>>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>>>>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>>>>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>>>>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>>>>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>>>>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>>>>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>>>>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>>>>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>>>>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>>>>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>>>>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>>>>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>>>>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>>>>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>>>>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>>>>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>>>>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>>>>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>>>>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>>>>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>>>>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>>>>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>>>>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>>>>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>>>>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>>>>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>>>>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>>>>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>>>>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>>>>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>>>>> (XEN) Xen call trace:
>>>>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>>>>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>>>>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>>>>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>>>>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>>>>> (XEN) 
>>>>> (XEN) 
>>>>> (XEN) ****************************************
>>>>> (XEN) Panic on CPU 1:
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ****************************************
>>>>> (XEN) 
>>>>> (XEN) Reboot in five seconds...
>>>>>
>>>>> This is with added debug patch:
>>>>>
>>>>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>>>>> index 6aaf9a2f1546..7873a21b356a 100644
>>>>> --- a/xen/arch/x86/xstate.c
>>>>> +++ b/xen/arch/x86/xstate.c
>>>>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>>>>      else
>>>>>      {
>>>>>          BUG_ON(xfeature_mask != feature_mask);
>>>>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>>>>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>>>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>>>>      }
>>>>>  
>>>>>
>>>>> As can be seen above - the xsave size differs between BSP and other
>>>>> CPU(s) - likely because of (not) loaded ucode update there.
>>>>> I guess it's a matter of moving ucode loading somewhere else, right?
>>>> Few more data points:
>>>>
>>>> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
>>>> 2. I do have "smt=off" on the Xen cmdline, if that matters.
>>> As a datapoint, it would be interesting to confirm what the behaviour is
>>> with SMT enabled.
>>>
>>> I'd expect it not to make a difference, because smt=off is a purely Xen
>>> construct and doesn't change the hardware configuration.
>> Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!
>>
>> Let me add CPU number to the above printk - is smp_processor_id() the
>> thing I want?
>> With that, I get:
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fmarmarek%2Fae604a1e5cf49639a1eec9e220c037ca&amp;data=04%7C01%7CAndrew.Cooper3%40citrix.com%7Cf7c03eb0fa0341ac60bf08d96179964f%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637647997122346495%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YZg2191dKLlppcbdCBwTnXuk08Vq37pcHOftyjrGO3Q%3D&amp;reserved=0
>> Note that at boot all CPUs reports 0x440 (but only later are parked).
> And for a feature mask of 0x1f only 0x440 can possibly be correct.

As a complete tangent, we really want an mpx=off option (or extend
cpuid=no-mpx to clobber bnd{regs,csr} too).  For systems not wanting to
use MPX in the first place - and I'd expect QubesOS falls into this
category - this will increase performance by not partitioning the
physical register file, as well as reducing the xstate size for client
parts.

> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
> write, due to the cached value matching the to-be-written one, yet
> the cache having gone stale across S3.

This is a rats nest, and area for concern, but ...

>  I think this is to be expected
> for previously parked CPUs, as those don't have their per-CPU data
> de-allocated (and hence also not re-setup, and thus also not starting
> out as zero).

... we don't deallocate regular APs either, so I don't see why the smt=
setting would make a difference in this case.

To be clear - I think your guess about set_xcr0() skipping the write is
correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
makes a difference.

>  I guess an easy fix would be to write 0 to
> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
> to the early "if ( bsp )".
>
> I'm not sure though this would be a good fix longer term, as there
> might easily be other similar issues elsewhere. IOW we may need to
> see whether per-CPU data initialization wouldn't want changing.

We've got other registers too, like MSR_TSC_AUX, but I don't think we
want to be doing anything as drastic as changing how the initialisation
works.

The S3 path needs to explicitly write every register we do lazy context
switching of.

~Andrew

Andrew Cooper Aug. 17, 2021, 1:06 p.m. UTC | #7

On 17/08/2021 13:53, Andrew Cooper wrote:
> On 17/08/2021 13:21, Jan Beulich wrote:
>> On 17.08.2021 13:44, Marek Marczykowski-Górecki wrote:
>>> On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
>>>> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
>>>>> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've got another S3 issue:
>>>>>>
>>>>>> (XEN) Preparing system for ACPI S3 state.
>>>>>> (XEN) Disabling non-boot CPUs ...
>>>>>> (XEN) Broke affinity for IRQ1, new: ffff
>>>>>> (XEN) Broke affinity for IRQ16, new: ffff
>>>>>> (XEN) Broke affinity for IRQ9, new: ffff
>>>>>> (XEN) Broke affinity for IRQ139, new: ffff
>>>>>> (XEN) Broke affinity for IRQ8, new: ffff
>>>>>> (XEN) Broke affinity for IRQ14, new: ffff
>>>>>> (XEN) Broke affinity for IRQ20, new: ffff
>>>>>> (XEN) Broke affinity for IRQ137, new: ffff
>>>>>> (XEN) Broke affinity for IRQ138, new: ffff
>>>>>> (XEN) Entering ACPI S3 state.
>>>>>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>>>>>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>>>>>> (XEN) Finishing wakeup from ACPI S3 state.
>>>>>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>>>>>> (XEN) Enabling non-boot CPUs  ...
>>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>>>>>> (XEN) Xen BUG at xstate.c:673
>>>>>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>>>>>> (XEN) CPU:    1
>>>>>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>>>>>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>>>>>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>>>>>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>>>>>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>>>>>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>>>>>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>>>>>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>>>>>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>>>>>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>>>>>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>>>>>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>>>>>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>>>>>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>>>>>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>>>>>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>>>>>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>>>>>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>>>>>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>>>>>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>>>>>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>>>>>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>>>>>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>>>>>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>>>>>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>>>>>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>>>>>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>>>>>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>>>>>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>>>>>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>>>>>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>>>>>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>>>>>> (XEN) Xen call trace:
>>>>>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>>>>>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>>>>>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>>>>>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>>>>>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>>>>>> (XEN) 
>>>>>> (XEN) 
>>>>>> (XEN) ****************************************
>>>>>> (XEN) Panic on CPU 1:
>>>>>> (XEN) Xen BUG at xstate.c:673
>>>>>> (XEN) ****************************************
>>>>>> (XEN) 
>>>>>> (XEN) Reboot in five seconds...
>>>>>>
>>>>>> This is with added debug patch:
>>>>>>
>>>>>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>>>>>> index 6aaf9a2f1546..7873a21b356a 100644
>>>>>> --- a/xen/arch/x86/xstate.c
>>>>>> +++ b/xen/arch/x86/xstate.c
>>>>>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>>>>>      else
>>>>>>      {
>>>>>>          BUG_ON(xfeature_mask != feature_mask);
>>>>>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>>>>>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>>>>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>>>>>      }
>>>>>>  
>>>>>>
>>>>>> As can be seen above - the xsave size differs between BSP and other
>>>>>> CPU(s) - likely because of (not) loaded ucode update there.
>>>>>> I guess it's a matter of moving ucode loading somewhere else, right?
>>>>> Few more data points:
>>>>>
>>>>> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
>>>>> 2. I do have "smt=off" on the Xen cmdline, if that matters.
>>>> As a datapoint, it would be interesting to confirm what the behaviour is
>>>> with SMT enabled.
>>>>
>>>> I'd expect it not to make a difference, because smt=off is a purely Xen
>>>> construct and doesn't change the hardware configuration.
>>> Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!
>>>
>>> Let me add CPU number to the above printk - is smp_processor_id() the
>>> thing I want?
>>> With that, I get:
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgist.github.com%2Fmarmarek%2Fae604a1e5cf49639a1eec9e220c037ca&amp;data=04%7C01%7CAndrew.Cooper3%40citrix.com%7C9bcffe997fb34fe8b89408d9617e2986%7C335836de42ef43a2b145348c2ee9ca5b%7C0%7C0%7C637648016779334734%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=xK014jSmKmoKdw%2BiqvsX7TLpLswzf7uzCc04xM1C8co%3D&amp;reserved=0
>>> Note that at boot all CPUs reports 0x440 (but only later are parked).
>> And for a feature mask of 0x1f only 0x440 can possibly be correct.
> As a complete tangent, we really want an mpx=off option (or extend
> cpuid=no-mpx to clobber bnd{regs,csr} too).  For systems not wanting to
> use MPX in the first place - and I'd expect QubesOS falls into this
> category - this will increase performance by not partitioning the
> physical register file, as well as reducing the xstate size for client
> parts.
>
>> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
>> write, due to the cached value matching the to-be-written one, yet
>> the cache having gone stale across S3.
> This is a rats nest, and area for concern, but ...
>
>>  I think this is to be expected
>> for previously parked CPUs, as those don't have their per-CPU data
>> de-allocated (and hence also not re-setup, and thus also not starting
>> out as zero).
> ... we don't deallocate regular APs either, so I don't see why the smt=
> setting would make a difference in this case.
>
> To be clear - I think your guess about set_xcr0() skipping the write is
> correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
> makes a difference.
>
>>  I guess an easy fix would be to write 0 to
>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
>> to the early "if ( bsp )".
>>
>> I'm not sure though this would be a good fix longer term, as there
>> might easily be other similar issues elsewhere. IOW we may need to
>> see whether per-CPU data initialization wouldn't want changing.
> We've got other registers too, like MSR_TSC_AUX, but I don't think we
> want to be doing anything as drastic as changing how the initialisation
> works.
>
> The S3 path needs to explicitly write every register we do lazy context
> switching of.

Actually no - that's a dumb suggestion because the APs don't know
better, and we don't want for_each_cpu() loops running from the BSP.

Perhaps we want the cpu_down() logic to explicitly invalidate their
lazily cached values?

~Andrew

Jan Beulich Aug. 17, 2021, 1:09 p.m. UTC | #8

On 17.08.2021 14:21, Jan Beulich wrote:
> On 17.08.2021 13:44, Marek Marczykowski-Górecki wrote:
>> On Tue, Aug 17, 2021 at 12:14:36PM +0100, Andrew Cooper wrote:
>>> On 17/08/2021 12:02, Marek Marczykowski-Górecki wrote:
>>>> On Tue, Aug 17, 2021 at 03:25:21AM +0200, Marek Marczykowski-Górecki wrote:
>>>>> Hi,
>>>>>
>>>>> I've got another S3 issue:
>>>>>
>>>>> (XEN) Preparing system for ACPI S3 state.
>>>>> (XEN) Disabling non-boot CPUs ...
>>>>> (XEN) Broke affinity for IRQ1, new: ffff
>>>>> (XEN) Broke affinity for IRQ16, new: ffff
>>>>> (XEN) Broke affinity for IRQ9, new: ffff
>>>>> (XEN) Broke affinity for IRQ139, new: ffff
>>>>> (XEN) Broke affinity for IRQ8, new: ffff
>>>>> (XEN) Broke affinity for IRQ14, new: ffff
>>>>> (XEN) Broke affinity for IRQ20, new: ffff
>>>>> (XEN) Broke affinity for IRQ137, new: ffff
>>>>> (XEN) Broke affinity for IRQ138, new: ffff
>>>>> (XEN) Entering ACPI S3 state.
>>>>> (XEN) mce_intel.c:773: MCA Capability: firstbank 0, extended MCE MSR 0, BCAST, CMCI
>>>>> (XEN) CPU0 CMCI LVT vector (0xf1) already installed
>>>>> (XEN) Finishing wakeup from ACPI S3 state.
>>>>> (XEN) microcode: CPU0 updated from revision 0xca to 0xea, date = 2021-01-05
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
>>>>> (XEN) Enabling non-boot CPUs  ...
>>>>> (XEN) xstate: size: 0x440 (uncompressed 0x240) and states: 0x1f
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
>>>>> (XEN) CPU:    1
>>>>> (XEN) RIP:    e008:[<ffff82d040350ee4>] xstate_init+0x24b/0x2ff
>>>>> (XEN) RFLAGS: 0000000000010087   CONTEXT: hypervisor
>>>>> (XEN) rax: 0000000000000240   rbx: 000000000000001f   rcx: 0000000000000440
>>>>> (XEN) rdx: 0000000000000001   rsi: 000000000000000a   rdi: 000000000000001f
>>>>> (XEN) rbp: ffff83025dc9fd38   rsp: ffff83025dc9fd20   r8:  0000000000000001
>>>>> (XEN) r9:  ffff83025dc9fc88   r10: 0000000000000001   r11: 0000000000000001
>>>>> (XEN) r12: ffff83025dc9fd80   r13: 000000000000001f   r14: 0000000000000001
>>>>> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003526e0
>>>>> (XEN) cr3: 0000000049656000   cr2: 0000000000000000
>>>>> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
>>>>> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
>>>>> (XEN) Xen code around <ffff82d040350ee4> (xstate_init+0x24b/0x2ff):
>>>>> (XEN)  ff e9 a2 00 00 00 0f 0b <0f> 0b 89 f8 89 f1 0f a2 89 f2 4c 8b 0d cb b4 0f
>>>>> (XEN) Xen stack trace from rsp=ffff83025dc9fd20:
>>>>> (XEN)    0000000000000240 ffff83025dc9fd80 0000000000000001 ffff83025dc9fd70
>>>>> (XEN)    ffff82d04027e7a1 000000004035a7f1 7ffafbbf01100800 00000000bfebfbff
>>>>> (XEN)    0000000000000001 00000000000000c8 ffff83025dc9feb8 ffff82d0402e43ce
>>>>> (XEN)    000000160a9e0106 bfebfbff80000008 2c1008007ffaf3bf 0000000f00000121
>>>>> (XEN)    00000000029c6fbf 0000000000000100 000000009c002e00 02afcd7f00000000
>>>>> (XEN)    756e654700000000 6c65746e49656e69 65746e4904b21920 726f43202952286c
>>>>> (XEN)    376920294d542865 432048303537382d 322e322040205550 000000007a484730
>>>>> (XEN)    ffff830000000000 ffff83025dc9fe18 00002400402e8e0b 000000085dc9fe30
>>>>> (XEN)    00000002402e9f21 0000000000000001 ffffffff00000000 ffff82d0402e0040
>>>>> (XEN)    00000000003526e0 ffff83025dc9fe68 ffff82d04027bd15 0000000000000001
>>>>> (XEN)    ffff8302590a0000 0000000000000000 00000000000000c8 0000000000000001
>>>>> (XEN)    0000000000000001 ffff83025dc9feb8 ffff82d0402e32b7 0000000000000001
>>>>> (XEN)    0000000000000001 00000000000000c8 0000000000000001 ffff83025dc9fee8
>>>>> (XEN)    ffff82d04030e401 0000000000000001 0000000000000000 0000000000000000
>>>>> (XEN)    0000000000000000 0000000000000000 ffff82d040200122 0800002000000002
>>>>> (XEN)    0100000400010000 0000002000000000 2000000000100000 0000001000000000
>>>>> (XEN)    2000000000000000 0000000029000000 0000008000000000 00110000a0000000
>>>>> (XEN)    8000000080000000 4000000000000008 0000100000000000 0200000040000080
>>>>> (XEN)    0004000000000000 0000010000000002 0400002030000000 0000000060000000
>>>>> (XEN)    0400001000010000 0000000010000000 0000004010000000 0000000000000000
>>>>> (XEN) Xen call trace:
>>>>> (XEN)    [<ffff82d040350ee4>] R xstate_init+0x24b/0x2ff
>>>>> (XEN)    [<ffff82d04027e7a1>] F identify_cpu+0x318/0x4af
>>>>> (XEN)    [<ffff82d0402e43ce>] F recheck_cpu_features+0x1f/0x72
>>>>> (XEN)    [<ffff82d04030e401>] F start_secondary+0x255/0x38a
>>>>> (XEN)    [<ffff82d040200122>] F __high_start+0x82/0x91
>>>>> (XEN) 
>>>>> (XEN) 
>>>>> (XEN) ****************************************
>>>>> (XEN) Panic on CPU 1:
>>>>> (XEN) Xen BUG at xstate.c:673
>>>>> (XEN) ****************************************
>>>>> (XEN) 
>>>>> (XEN) Reboot in five seconds...
>>>>>
>>>>> This is with added debug patch:
>>>>>
>>>>> diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
>>>>> index 6aaf9a2f1546..7873a21b356a 100644
>>>>> --- a/xen/arch/x86/xstate.c
>>>>> +++ b/xen/arch/x86/xstate.c
>>>>> @@ -668,6 +668,8 @@ void xstate_init(struct cpuinfo_x86 *c)
>>>>>      else
>>>>>      {
>>>>>          BUG_ON(xfeature_mask != feature_mask);
>>>>> +        printk("xstate: size: %#x (uncompressed %#x) and states: %#"PRIx64"\n",
>>>>> +               xsave_cntxt_size, hw_uncompressed_size(feature_mask), feature_mask);
>>>>>          BUG_ON(xsave_cntxt_size != hw_uncompressed_size(feature_mask));
>>>>>      }
>>>>>  
>>>>>
>>>>> As can be seen above - the xsave size differs between BSP and other
>>>>> CPU(s) - likely because of (not) loaded ucode update there.
>>>>> I guess it's a matter of moving ucode loading somewhere else, right?
>>>>
>>>> Few more data points:
>>>>
>>>> 1. The CPU is i7-8750H (family 6, model 158, stepping 10).
>>>> 2. I do have "smt=off" on the Xen cmdline, if that matters.
>>>
>>> As a datapoint, it would be interesting to confirm what the behaviour is
>>> with SMT enabled.
>>>
>>> I'd expect it not to make a difference, because smt=off is a purely Xen
>>> construct and doesn't change the hardware configuration.
>>
>> Uhm, changing to smt=on actually _did_ change it. Now it doesn't crash!
>>
>> Let me add CPU number to the above printk - is smp_processor_id() the
>> thing I want?
>> With that, I get:
>> https://gist.github.com/marmarek/ae604a1e5cf49639a1eec9e220c037ca
>> Note that at boot all CPUs reports 0x440 (but only later are parked).
> 
> And for a feature mask of 0x1f only 0x440 can possibly be correct.
> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
> write, due to the cached value matching the to-be-written one, yet
> the cache having gone stale across S3. I think this is to be expected
> for previously parked CPUs, as those don't have their per-CPU data
> de-allocated (and hence also not re-setup, and thus also not starting
> out as zero). I guess an easy fix would be to write 0 to
> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
> to the early "if ( bsp )".
> 
> I'm not sure though this would be a good fix longer term, as there
> might easily be other similar issues elsewhere. IOW we may need to
> see whether per-CPU data initialization wouldn't want changing.

See Jürgen's aca2a985a55a ("xen: don't free percpu areas during suspend"),
which was an (extended) part of his core scheduling work. I'm afraid this
rules out zeroing per-CPU areas when they're already allocated. In the
description he says "OTOH a component not registering itself for cpu
down/up and expecting to see a zeroed percpu variable after suspend/resume
is kind of broken already," which in retrospect I don't fully agree with
anymore: It's been virtually forever (before that change) that per-CPU
data started out all zero.

Then again this makes me wonder why you're observing different behavior
with "smt=0" vs "smt=1": APs' per-CPU data never gets freed during
suspend, so a stale earlier cached XCR0 value would always be in place.
Perhaps the connection is more indirect, and things largely depend on
what ended up running last (before S3) on the respective CPUs?

Jan

Jan Beulich Aug. 17, 2021, 1:21 p.m. UTC | #9

On 17.08.2021 15:06, Andrew Cooper wrote:
> On 17/08/2021 13:53, Andrew Cooper wrote:
>> On 17/08/2021 13:21, Jan Beulich wrote:
>>> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
>>> write, due to the cached value matching the to-be-written one, yet
>>> the cache having gone stale across S3.
>> This is a rats nest, and area for concern, but ...
>>
>>>  I think this is to be expected
>>> for previously parked CPUs, as those don't have their per-CPU data
>>> de-allocated (and hence also not re-setup, and thus also not starting
>>> out as zero).
>> ... we don't deallocate regular APs either, so I don't see why the smt=
>> setting would make a difference in this case.
>>
>> To be clear - I think your guess about set_xcr0() skipping the write is
>> correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
>> makes a difference.

Right - as per my later reply to Marek I don't see an immediate reason
anymore either. I could see an indirect reason via different scheduler
decisions, affecting what ran last on the respective CPUs.

>>>  I guess an easy fix would be to write 0 to
>>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
>>> to the early "if ( bsp )".
>>>
>>> I'm not sure though this would be a good fix longer term, as there
>>> might easily be other similar issues elsewhere. IOW we may need to
>>> see whether per-CPU data initialization wouldn't want changing.
>> We've got other registers too, like MSR_TSC_AUX, but I don't think we
>> want to be doing anything as drastic as changing how the initialisation
>> works.
>>
>> The S3 path needs to explicitly write every register we do lazy context
>> switching of.
> 
> Actually no - that's a dumb suggestion because the APs don't know
> better, and we don't want for_each_cpu() loops running from the BSP.
> 
> Perhaps we want the cpu_down() logic to explicitly invalidate their
> lazily cached values?

I'd rather do this on the cpu_up() path (no point clobbering what may
get further clobbered, and then perhaps not to a value of our liking),
yet then we can really avoid doing this from a notifier and instead do
it early enough in xstate_init() (taking care of XSS at the same time).

Jan

Andrew Cooper Aug. 17, 2021, 1:29 p.m. UTC | #10

On 17/08/2021 14:21, Jan Beulich wrote:
> On 17.08.2021 15:06, Andrew Cooper wrote:
>> On 17/08/2021 13:53, Andrew Cooper wrote:
>>> On 17/08/2021 13:21, Jan Beulich wrote:
>>>> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
>>>> write, due to the cached value matching the to-be-written one, yet
>>>> the cache having gone stale across S3.
>>> This is a rats nest, and area for concern, but ...
>>>
>>>>  I think this is to be expected
>>>> for previously parked CPUs, as those don't have their per-CPU data
>>>> de-allocated (and hence also not re-setup, and thus also not starting
>>>> out as zero).
>>> ... we don't deallocate regular APs either, so I don't see why the smt=
>>> setting would make a difference in this case.
>>>
>>> To be clear - I think your guess about set_xcr0() skipping the write is
>>> correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
>>> makes a difference.
> Right - as per my later reply to Marek I don't see an immediate reason
> anymore either. I could see an indirect reason via different scheduler
> decisions, affecting what ran last on the respective CPUs.

Modern Linux has stripped all MPX support, so won't set
%xcr0.bnd{reg,csr} in the first place, and will differ from Xen's
default setting.

Therefore, I suppose we have a real difference in Xen's idea of the
lazily-cached value depending on whether we're in half or full idle context.

>>>>  I guess an easy fix would be to write 0 to
>>>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
>>>> to the early "if ( bsp )".
>>>>
>>>> I'm not sure though this would be a good fix longer term, as there
>>>> might easily be other similar issues elsewhere. IOW we may need to
>>>> see whether per-CPU data initialization wouldn't want changing.
>>> We've got other registers too, like MSR_TSC_AUX, but I don't think we
>>> want to be doing anything as drastic as changing how the initialisation
>>> works.
>>>
>>> The S3 path needs to explicitly write every register we do lazy context
>>> switching of.
>> Actually no - that's a dumb suggestion because the APs don't know
>> better, and we don't want for_each_cpu() loops running from the BSP.
>>
>> Perhaps we want the cpu_down() logic to explicitly invalidate their
>> lazily cached values?
> I'd rather do this on the cpu_up() path (no point clobbering what may
> get further clobbered, and then perhaps not to a value of our liking),
> yet then we can really avoid doing this from a notifier and instead do
> it early enough in xstate_init() (taking care of XSS at the same time).

What we actually want to do is read the hardware register into the
cached location.  %xcr0 is possibly the only lazy register we also do
extra sanity checks on.

~Andrew

Marek Marczykowski-Górecki Aug. 17, 2021, 1:48 p.m. UTC | #11

On Tue, Aug 17, 2021 at 02:29:20PM +0100, Andrew Cooper wrote:
> On 17/08/2021 14:21, Jan Beulich wrote:
> > On 17.08.2021 15:06, Andrew Cooper wrote:
> >> On 17/08/2021 13:53, Andrew Cooper wrote:
> >>> On 17/08/2021 13:21, Jan Beulich wrote:
> >>>> I'm kind of guessing that set_xcr0() mistakenly skips the actual XCR0
> >>>> write, due to the cached value matching the to-be-written one, yet
> >>>> the cache having gone stale across S3.
> >>> This is a rats nest, and area for concern, but ...
> >>>
> >>>>  I think this is to be expected
> >>>> for previously parked CPUs, as those don't have their per-CPU data
> >>>> de-allocated (and hence also not re-setup, and thus also not starting
> >>>> out as zero).
> >>> ... we don't deallocate regular APs either, so I don't see why the smt=
> >>> setting would make a difference in this case.
> >>>
> >>> To be clear - I think your guess about set_xcr0() skipping the write is
> >>> correct, because 0x240 is correct for xcr0=X87, but I don't see why smt=
> >>> makes a difference.
> > Right - as per my later reply to Marek I don't see an immediate reason
> > anymore either. I could see an indirect reason via different scheduler
> > decisions, affecting what ran last on the respective CPUs.
> 
> Modern Linux has stripped all MPX support, so won't set
> %xcr0.bnd{reg,csr} in the first place, and will differ from Xen's
> default setting.
> 
> Therefore, I suppose we have a real difference in Xen's idea of the
> lazily-cached value depending on whether we're in half or full idle context.

:(

> >>>>  I guess an easy fix would be to write 0 to
> >>>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
> >>>> to the early "if ( bsp )".
> >>>>
> >>>> I'm not sure though this would be a good fix longer term, as there
> >>>> might easily be other similar issues elsewhere. IOW we may need to
> >>>> see whether per-CPU data initialization wouldn't want changing.
> >>> We've got other registers too, like MSR_TSC_AUX, but I don't think we
> >>> want to be doing anything as drastic as changing how the initialisation
> >>> works.
> >>>
> >>> The S3 path needs to explicitly write every register we do lazy context
> >>> switching of.
> >> Actually no - that's a dumb suggestion because the APs don't know
> >> better, and we don't want for_each_cpu() loops running from the BSP.
> >>
> >> Perhaps we want the cpu_down() logic to explicitly invalidate their
> >> lazily cached values?
> > I'd rather do this on the cpu_up() path (no point clobbering what may
> > get further clobbered, and then perhaps not to a value of our liking),
> > yet then we can really avoid doing this from a notifier and instead do
> > it early enough in xstate_init() (taking care of XSS at the same time).

Funny you mention notifiers. Apparently cpufreq driver does use it to
initialize things. And fails to do so:

(XEN) Finishing wakeup from ACPI S3 state.
(XEN) CPU0: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
(XEN) Enabling non-boot CPUs  ...
(XEN) CPU1: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
(XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff82d04024ad2b>] vcpu_runstate_get+0x153/0x244
(XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: ffff830049667c50   rcx: 0000000000000001
(XEN) rdx: 000000321d74d000   rsi: ffff830049667c50   rdi: ffff83025dcc0000
(XEN) rbp: ffff830049667c40   rsp: ffff830049667c10   r8:  ffff83020511a820
(XEN) r9:  ffff82d04057ef78   r10: 0180000000000000   r11: 8000000000000000
(XEN) r12: ffff83025dcc0000   r13: ffff830205118c60   r14: 0000000000000001
(XEN) r15: 0000000000000010   cr0: 000000008005003b   cr4: 00000000003526e0
(XEN) cr3: 0000000049656000   cr2: 0000000000000028
(XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
(XEN) Xen code around <ffff82d04024ad2b> (vcpu_runstate_get+0x153/0x244):
(XEN)  48 8b 14 ca 48 8b 04 02 <4c> 8b 70 28 e9 01 ff ff ff 4c 8d 3d dd 64 32 00
(XEN) Xen stack trace from rsp=ffff830049667c10:
(XEN)    0000000000000180 ffff83025dcbd410 ffff83020511bf30 ffff830205118c60
(XEN)    0000000000000001 0000000000000010 ffff830049667c80 ffff82d04024ae73
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 ffff830049667cb8 ffff82d0402560a9
(XEN)    ffff830205118320 0000000000000001 ffff83020511bf30 ffff83025dc7a6f0
(XEN)    0000000000000000 ffff830049667d58 ffff82d040254cb1 00000001402e9f74
(XEN)    0000000000000000 ffff830049667d10 ffff82d040224eda 000000000025dc81
(XEN)    000000321d74d000 ffff82d040571278 0000000000000001 ffff830049667d28
(XEN)    ffff82d040228b44 ffff82d0403102cf 0000000000000000 ffff82d0402283a4
(XEN)    ffff82d040459688 ffff82d040459680 ffff82d040459240 0000000000000004
(XEN)    0000000000000000 ffff830049667d68 ffff82d04025510e ffff830049667db0
(XEN)    ffff82d040221ba4 0000000000000000 0000000000000001 0000000000000001
(XEN)    0000000000000000 ffff830049667e00 0000000000000001 ffff82d04058a5c0
(XEN)    ffff830049667dc8 ffff82d040203867 0000000000000001 ffff830049667df0
(XEN)    ffff82d040203c51 ffff82d040459400 0000000000000001 0000000000000010
(XEN)    ffff830049667e20 ffff82d040203e26 ffff830049667ef8 0000000000000000
(XEN)    0000000000000003 0000000000000200 ffff830049667e50 ffff82d040270bac
(XEN)    ffff83020116a640 ffff830258ff6000 0000000000000000 0000000000000000
(XEN)    ffff830049667e70 ffff82d0402056aa ffff830258ff61b8 ffff82d0405701b0
(XEN)    ffff830049667e88 ffff82d04022963c ffff82d0405701a0 ffff830049667eb8
(XEN) Xen call trace:
(XEN)    [<ffff82d04024ad2b>] R vcpu_runstate_get+0x153/0x244
(XEN)    [<ffff82d04024ae73>] F get_cpu_idle_time+0x57/0x59
(XEN)    [<ffff82d0402560a9>] F cpufreq_statistic_init+0x191/0x210
(XEN)    [<ffff82d040254cb1>] F cpufreq_add_cpu+0x3cc/0x5bb
(XEN)    [<ffff82d04025510e>] F cpufreq.c#cpu_callback+0x27/0x32
(XEN)    [<ffff82d040221ba4>] F notifier_call_chain+0x6c/0x96
(XEN)    [<ffff82d040203867>] F cpu.c#cpu_notifier_call_chain+0x1b/0x36
(XEN)    [<ffff82d040203c51>] F cpu_up+0xaf/0xc8
(XEN)    [<ffff82d040203e26>] F enable_nonboot_cpus+0x6b/0x1f8
(XEN)    [<ffff82d040270bac>] F power.c#enter_state_helper+0x152/0x60a
(XEN)    [<ffff82d0402056aa>] F domain.c#continue_hypercall_tasklet_handler+0x4c/0xb9
(XEN)    [<ffff82d04022963c>] F tasklet.c#do_tasklet_work+0x76/0xac
(XEN)    [<ffff82d040229920>] F do_tasklet+0x58/0x8a
(XEN)    [<ffff82d0402e6607>] F domain.c#idle_loop+0x74/0xdd
(XEN) 
(XEN) Pagetable walk from 0000000000000028:
(XEN)  L4[0x000] = 000000025dce1063 ffffffffffffffff
(XEN)  L3[0x000] = 000000025dce0063 ffffffffffffffff
(XEN)  L2[0x000] = 000000025dcdf063 ffffffffffffffff
(XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 0000000000000028
(XEN) ****************************************

This is after adding brutal `this_cpu(xcr0) = 0` in xstate_init().

> What we actually want to do is read the hardware register into the
> cached location.  %xcr0 is possibly the only lazy register we also do
> extra sanity checks on.

Yes, better load the actual XCR0 value into cache, instead of 0
(although in this very case, it will get immediately overwritten).
I've added similar cache init for XSS - and this one should be safe-ish
- get_msr_xss() is not used anywhere.

Andrew Cooper Aug. 17, 2021, 1:57 p.m. UTC | #12

On 17/08/2021 14:48, Marek Marczykowski-Górecki wrote:
> I've added similar cache init for XSS - and this one should be safe-ish
> - get_msr_xss() is not used anywhere.

XSS, and compressed xsave state handing in general, is in a dire state. 
What exists in the tree currently is a lot of dead code.

I started working to fix it, as it is a hard prerequisite to supporting
CET for guests, but as usual has been pre-empted by more important things.

~Andrew

Jan Beulich Aug. 17, 2021, 1:59 p.m. UTC | #13

On 17.08.2021 15:29, Andrew Cooper wrote:
> On 17/08/2021 14:21, Jan Beulich wrote:
>> On 17.08.2021 15:06, Andrew Cooper wrote:
>>> On 17/08/2021 13:53, Andrew Cooper wrote:
>>>> On 17/08/2021 13:21, Jan Beulich wrote:
>>>>>  I guess an easy fix would be to write 0 to
>>>>> this_cpu(xcr0) directly early in xstate_init(), maybe in an "else"
>>>>> to the early "if ( bsp )".
>>>>>
>>>>> I'm not sure though this would be a good fix longer term, as there
>>>>> might easily be other similar issues elsewhere. IOW we may need to
>>>>> see whether per-CPU data initialization wouldn't want changing.
>>>> We've got other registers too, like MSR_TSC_AUX, but I don't think we
>>>> want to be doing anything as drastic as changing how the initialisation
>>>> works.
>>>>
>>>> The S3 path needs to explicitly write every register we do lazy context
>>>> switching of.
>>> Actually no - that's a dumb suggestion because the APs don't know
>>> better, and we don't want for_each_cpu() loops running from the BSP.
>>>
>>> Perhaps we want the cpu_down() logic to explicitly invalidate their
>>> lazily cached values?
>> I'd rather do this on the cpu_up() path (no point clobbering what may
>> get further clobbered, and then perhaps not to a value of our liking),
>> yet then we can really avoid doing this from a notifier and instead do
>> it early enough in xstate_init() (taking care of XSS at the same time).
> 
> What we actually want to do is read the hardware register into the
> cached location.

Would you mind explaining why? I did consider doing so, but didn't
see the point when the first thing we do is write the register with
xfeature_mask (which can't be zero, and which hence would always get
written through to the register when the cached value is zero).

>  %xcr0 is possibly the only lazy register we also do
> extra sanity checks on.

I don't think we do?

(All of this is independent of my "x86/xstate: replace xsave_cntxt_size
and drop XCNTXT_MASK" I think, despite its reworking of the logic, and
which I'm still hoping to get reviewed at some point.)

Jan

Jan Beulich Aug. 17, 2021, 2:04 p.m. UTC | #14

On 17.08.2021 15:48, Marek Marczykowski-Górecki wrote:
> On Tue, Aug 17, 2021 at 02:29:20PM +0100, Andrew Cooper wrote:
>> On 17/08/2021 14:21, Jan Beulich wrote:
>>> On 17.08.2021 15:06, Andrew Cooper wrote:
>>>> Perhaps we want the cpu_down() logic to explicitly invalidate their
>>>> lazily cached values?
>>> I'd rather do this on the cpu_up() path (no point clobbering what may
>>> get further clobbered, and then perhaps not to a value of our liking),
>>> yet then we can really avoid doing this from a notifier and instead do
>>> it early enough in xstate_init() (taking care of XSS at the same time).
> 
> Funny you mention notifiers. Apparently cpufreq driver does use it to
> initialize things. And fails to do so:
> 
> (XEN) Finishing wakeup from ACPI S3 state.
> (XEN) CPU0: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
> (XEN) Enabling non-boot CPUs  ...
> (XEN) CPU1: xstate: size: 0x440 (uncompressed 0x440) and states: 0x1f
> (XEN) ----[ Xen-4.16-unstable  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e008:[<ffff82d04024ad2b>] vcpu_runstate_get+0x153/0x244
> (XEN) RFLAGS: 0000000000010246   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: ffff830049667c50   rcx: 0000000000000001
> (XEN) rdx: 000000321d74d000   rsi: ffff830049667c50   rdi: ffff83025dcc0000
> (XEN) rbp: ffff830049667c40   rsp: ffff830049667c10   r8:  ffff83020511a820
> (XEN) r9:  ffff82d04057ef78   r10: 0180000000000000   r11: 8000000000000000
> (XEN) r12: ffff83025dcc0000   r13: ffff830205118c60   r14: 0000000000000001
> (XEN) r15: 0000000000000010   cr0: 000000008005003b   cr4: 00000000003526e0
> (XEN) cr3: 0000000049656000   cr2: 0000000000000028
> (XEN) fsb: 0000000000000000   gsb: 0000000000000000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
> (XEN) Xen code around <ffff82d04024ad2b> (vcpu_runstate_get+0x153/0x244):
> (XEN)  48 8b 14 ca 48 8b 04 02 <4c> 8b 70 28 e9 01 ff ff ff 4c 8d 3d dd 64 32 00
> (XEN) Xen stack trace from rsp=ffff830049667c10:
> (XEN)    0000000000000180 ffff83025dcbd410 ffff83020511bf30 ffff830205118c60
> (XEN)    0000000000000001 0000000000000010 ffff830049667c80 ffff82d04024ae73
> (XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
> (XEN)    0000000000000000 0000000000000000 ffff830049667cb8 ffff82d0402560a9
> (XEN)    ffff830205118320 0000000000000001 ffff83020511bf30 ffff83025dc7a6f0
> (XEN)    0000000000000000 ffff830049667d58 ffff82d040254cb1 00000001402e9f74
> (XEN)    0000000000000000 ffff830049667d10 ffff82d040224eda 000000000025dc81
> (XEN)    000000321d74d000 ffff82d040571278 0000000000000001 ffff830049667d28
> (XEN)    ffff82d040228b44 ffff82d0403102cf 0000000000000000 ffff82d0402283a4
> (XEN)    ffff82d040459688 ffff82d040459680 ffff82d040459240 0000000000000004
> (XEN)    0000000000000000 ffff830049667d68 ffff82d04025510e ffff830049667db0
> (XEN)    ffff82d040221ba4 0000000000000000 0000000000000001 0000000000000001
> (XEN)    0000000000000000 ffff830049667e00 0000000000000001 ffff82d04058a5c0
> (XEN)    ffff830049667dc8 ffff82d040203867 0000000000000001 ffff830049667df0
> (XEN)    ffff82d040203c51 ffff82d040459400 0000000000000001 0000000000000010
> (XEN)    ffff830049667e20 ffff82d040203e26 ffff830049667ef8 0000000000000000
> (XEN)    0000000000000003 0000000000000200 ffff830049667e50 ffff82d040270bac
> (XEN)    ffff83020116a640 ffff830258ff6000 0000000000000000 0000000000000000
> (XEN)    ffff830049667e70 ffff82d0402056aa ffff830258ff61b8 ffff82d0405701b0
> (XEN)    ffff830049667e88 ffff82d04022963c ffff82d0405701a0 ffff830049667eb8
> (XEN) Xen call trace:
> (XEN)    [<ffff82d04024ad2b>] R vcpu_runstate_get+0x153/0x244
> (XEN)    [<ffff82d04024ae73>] F get_cpu_idle_time+0x57/0x59
> (XEN)    [<ffff82d0402560a9>] F cpufreq_statistic_init+0x191/0x210
> (XEN)    [<ffff82d040254cb1>] F cpufreq_add_cpu+0x3cc/0x5bb
> (XEN)    [<ffff82d04025510e>] F cpufreq.c#cpu_callback+0x27/0x32
> (XEN)    [<ffff82d040221ba4>] F notifier_call_chain+0x6c/0x96
> (XEN)    [<ffff82d040203867>] F cpu.c#cpu_notifier_call_chain+0x1b/0x36
> (XEN)    [<ffff82d040203c51>] F cpu_up+0xaf/0xc8
> (XEN)    [<ffff82d040203e26>] F enable_nonboot_cpus+0x6b/0x1f8
> (XEN)    [<ffff82d040270bac>] F power.c#enter_state_helper+0x152/0x60a
> (XEN)    [<ffff82d0402056aa>] F domain.c#continue_hypercall_tasklet_handler+0x4c/0xb9
> (XEN)    [<ffff82d04022963c>] F tasklet.c#do_tasklet_work+0x76/0xac
> (XEN)    [<ffff82d040229920>] F do_tasklet+0x58/0x8a
> (XEN)    [<ffff82d0402e6607>] F domain.c#idle_loop+0x74/0xdd
> (XEN) 
> (XEN) Pagetable walk from 0000000000000028:
> (XEN)  L4[0x000] = 000000025dce1063 ffffffffffffffff
> (XEN)  L3[0x000] = 000000025dce0063 ffffffffffffffff
> (XEN)  L2[0x000] = 000000025dcdf063 ffffffffffffffff
> (XEN)  L1[0x000] = 0000000000000000 ffffffffffffffff
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 0:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: 0000000000000028
> (XEN) ****************************************
> 
> This is after adding brutal `this_cpu(xcr0) = 0` in xstate_init().

And presumably again only with "smt=0"? In any event, for us to not mix
things, may I ask that you start a new thread for this further issue?

Jan

S3 resume issue in xstate_init

Commit Message

Comments

Patch