diff mbox series

[XEN] x86/ACPI: Ignore entries with invalid APIC IDs when parsing MADT

Message ID 7f158a54548456daba9f2e105d099d2e5e2c2f38.1691399031.git.simon@invisiblethingslab.com (mailing list archive)
State New, archived
Headers show
Series [XEN] x86/ACPI: Ignore entries with invalid APIC IDs when parsing MADT | expand

Commit Message

Simon Gaiser Aug. 7, 2023, 9:38 a.m. UTC
It seems some firmwares put dummy entries in the ACPI MADT table for non
existing processors. On my NUC11TNHi5 those have the invalid APIC ID
0xff. Linux already has code to handle those cases both in
acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
same check to Xen.

Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
entries with a valid APIC ID. Linux would still ignore those because
they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
this check is only active for madt_revision >= 5. But since this version
check seems to be intentionally I leave that alone.

Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
---
 xen/arch/x86/acpi/boot.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

Comments

Andrew Cooper Aug. 7, 2023, 10:01 a.m. UTC | #1
On 07/08/2023 10:38 am, Simon Gaiser wrote:
> It seems some firmwares put dummy entries in the ACPI MADT table for non
> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> 0xff. Linux already has code to handle those cases both in
> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> same check to Xen.
>
> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
> entries with a valid APIC ID. Linux would still ignore those because
> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
> this check is only active for madt_revision >= 5. But since this version
> check seems to be intentionally I leave that alone.

I recall there being a discussion over this, ultimately with the version
check being removed.  IIRC it was a defacto standard used for a long
time prior to actually getting written into the ACPI spec, so does exist
in practice in older MADTs.

Otherwise LGTM.  I'd suggest dropping this paragraph as it's not related
to the change.  It will also help if we do decide to follow up and drop
the MADT version check.

>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]

https://git.kernel.org/torvalds/c/$SHA

Somewhat less verbose. https://korg.docs.kernel.org/git-url-shorteners.html

> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
> ---
>  xen/arch/x86/acpi/boot.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
> index 54b72d716b..4a62822fa9 100644
> --- a/xen/arch/x86/acpi/boot.c
> +++ b/xen/arch/x86/acpi/boot.c
> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>  	if (BAD_MADT_ENTRY(processor, end))
>  		return -EINVAL;
>  
> +	/* Ignore entries with invalid apicid */

x2apic ID.

~Andrew
Andrew Cooper Aug. 7, 2023, 10:06 a.m. UTC | #2
On 07/08/2023 11:01 am, Andrew Cooper wrote:
> On 07/08/2023 10:38 am, Simon Gaiser wrote:
>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>> 0xff. Linux already has code to handle those cases both in
>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>> same check to Xen.
>>
>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>> entries with a valid APIC ID. Linux would still ignore those because
>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>> this check is only active for madt_revision >= 5. But since this version
>> check seems to be intentionally I leave that alone.
> I recall there being a discussion over this, ultimately with the version
> check being removed.  IIRC it was a defacto standard used for a long
> time prior to actually getting written into the ACPI spec, so does exist
> in practice in older MADTs.
>
> Otherwise LGTM.  I'd suggest dropping this paragraph as it's not related
> to the change.  It will also help if we do decide to follow up and drop
> the MADT version check.
>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
> https://git.kernel.org/torvalds/c/$SHA
>
> Somewhat less verbose. https://korg.docs.kernel.org/git-url-shorteners.html
>
>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>> ---
>>  xen/arch/x86/acpi/boot.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
>> index 54b72d716b..4a62822fa9 100644
>> --- a/xen/arch/x86/acpi/boot.c
>> +++ b/xen/arch/x86/acpi/boot.c
>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>  	if (BAD_MADT_ENTRY(processor, end))
>>  		return -EINVAL;
>>  
>> +	/* Ignore entries with invalid apicid */
> x2apic ID.

Oh, and I forgot to say.  I'm happy to fix all of this up on commit if
you're happy.

~Andrew
Jan Beulich Aug. 7, 2023, 10:13 a.m. UTC | #3
On 07.08.2023 11:38, Simon Gaiser wrote:
> It seems some firmwares put dummy entries in the ACPI MADT table for non
> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> 0xff. Linux already has code to handle those cases both in
> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> same check to Xen.

I'm afraid it doesn't become clear to me what problem you're trying to
solve.

> --- a/xen/arch/x86/acpi/boot.c
> +++ b/xen/arch/x86/acpi/boot.c
> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>  	if (BAD_MADT_ENTRY(processor, end))
>  		return -EINVAL;
>  
> +	/* Ignore entries with invalid apicid */
> +	if (processor->local_apic_id == 0xffffffff)
> +		return 0;
> +
>  	/* Don't register processors that cannot be onlined. */
>  	if (madt_revision >= 5 &&
>  	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
>  	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
>  		return 0;
>  
> -	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
> -	    processor->local_apic_id != 0xffffffff || opt_cpu_info) {
> +	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info) {
>  		acpi_table_print_madt_entry(header);
>  		log = true;
>  	}

In particular you're now suppressing log messages which may be relevant.

The one issue that I'm aware of (and that I use a local hack to deal
with; see bottom) is excess verbosity.

Jan

--- unstable.orig/xen/arch/x86/mpparse.c
+++ unstable/xen/arch/x86/mpparse.c
@@ -809,8 +809,13 @@ int mp_register_lapic(u32 id, bool enabl
 	};
 	
 	if (MAX_APICS <= id) {
-		printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
-			id, MAX_APICS);
+		static u32 max_warn = -1;
+
+		if (id <= max_warn) {
+			printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
+			       id, MAX_APICS);
+			max_warn = id - 1;
+		}
 		return -EINVAL;
 	}
Simon Gaiser Aug. 7, 2023, 10:17 a.m. UTC | #4
Andrew Cooper:
> On 07/08/2023 11:01 am, Andrew Cooper wrote:
>> On 07/08/2023 10:38 am, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>>
>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>> entries with a valid APIC ID. Linux would still ignore those because
>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>> this check is only active for madt_revision >= 5. But since this version
>>> check seems to be intentionally I leave that alone.
>> I recall there being a discussion over this, ultimately with the version
>> check being removed.  IIRC it was a defacto standard used for a long
>> time prior to actually getting written into the ACPI spec, so does exist
>> in practice in older MADTs.
>>
>> Otherwise LGTM.  I'd suggest dropping this paragraph as it's not related
>> to the change.  It will also help if we do decide to follow up and drop
>> the MADT version check.

It's in so far related as that I know this doesn't cover all cases I
actually wanted to address and I want to mention this. But I can see why
you might not want to have this in this commit message. Feel free to
drop.

>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>> https://git.kernel.org/torvalds/c/$SHA
>>
>> Somewhat less verbose. https://korg.docs.kernel.org/git-url-shorteners.html

Ah, handy.

>>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>>> ---
>>>  xen/arch/x86/acpi/boot.c | 14 ++++++++++----
>>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
>>> index 54b72d716b..4a62822fa9 100644
>>> --- a/xen/arch/x86/acpi/boot.c
>>> +++ b/xen/arch/x86/acpi/boot.c
>>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>>  	if (BAD_MADT_ENTRY(processor, end))
>>>  		return -EINVAL;
>>>  
>>> +	/* Ignore entries with invalid apicid */
>> x2apic ID.
> 
> Oh, and I forgot to say.  I'm happy to fix all of this up on commit if
> you're happy.

Sure, go ahead. Thanks!
Simon Gaiser Aug. 7, 2023, 12:55 p.m. UTC | #5
Jan Beulich:
> On 07.08.2023 11:38, Simon Gaiser wrote:
>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>> 0xff. Linux already has code to handle those cases both in
>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>> same check to Xen.
> 
> I'm afraid it doesn't become clear to me what problem you're trying to
> solve.

I want Xen to not think there are possible CPUs that actually never can
be there.

Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
that there are 8 logical CPUs that are currently disabled but could be
hotplugged.

I'm moderately sure that soldering in another CPU is not supported, even
less so at runtime ;]

>> --- a/xen/arch/x86/acpi/boot.c
>> +++ b/xen/arch/x86/acpi/boot.c
>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>  	if (BAD_MADT_ENTRY(processor, end))
>>  		return -EINVAL;
>>  
>> +	/* Ignore entries with invalid apicid */
>> +	if (processor->local_apic_id == 0xffffffff)
>> +		return 0;
>> +
>>  	/* Don't register processors that cannot be onlined. */
>>  	if (madt_revision >= 5 &&
>>  	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
>>  	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
>>  		return 0;
>>  
>> -	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
>> -	    processor->local_apic_id != 0xffffffff || opt_cpu_info) {
>> +	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info) {
>>  		acpi_table_print_madt_entry(header);
>>  		log = true;
>>  	}
> 
> In particular you're now suppressing log messages which may be relevant.

I intentionally mirrored the behavior of the check directly below.
Unlike the the version in Linux the existing code didn't log ignored
entries. So I did the same for the entries with an invalid ID.

> The one issue that I'm aware of (and that I use a local hack to deal
> with; see bottom) is excess verbosity.
> 
> Jan
> 
> --- unstable.orig/xen/arch/x86/mpparse.c
> +++ unstable/xen/arch/x86/mpparse.c
> @@ -809,8 +809,13 @@ int mp_register_lapic(u32 id, bool enabl
>  	};
>  	
>  	if (MAX_APICS <= id) {
> -		printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
> -			id, MAX_APICS);
> +		static u32 max_warn = -1;
> +
> +		if (id <= max_warn) {
> +			printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
> +			       id, MAX_APICS);
> +			max_warn = id - 1;
> +		}
>  		return -EINVAL;
>  	}
>
Jan Beulich Aug. 7, 2023, 1:17 p.m. UTC | #6
On 07.08.2023 14:55, Simon Gaiser wrote:
> Jan Beulich:
>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>
>> I'm afraid it doesn't become clear to me what problem you're trying to
>> solve.
> 
> I want Xen to not think there are possible CPUs that actually never can
> be there.

Did you try using "maxcpus=" on the command line? If that doesn't work
well enough (perhaps because of causing undesirable log messages), maybe
we need some way to say "no CPU hotplug" on the command line.

> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
> that there are 8 logical CPUs that are currently disabled but could be
> hotplugged.

Yet it's exactly this which ACPI is telling us here (with some vagueness,
which isn't easy to get around; see below).

> I'm moderately sure that soldering in another CPU is not supported, even
> less so at runtime ;]

On your system. What about others, which are hotplug-capable?

>>> --- a/xen/arch/x86/acpi/boot.c
>>> +++ b/xen/arch/x86/acpi/boot.c
>>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>>  	if (BAD_MADT_ENTRY(processor, end))
>>>  		return -EINVAL;
>>>  
>>> +	/* Ignore entries with invalid apicid */
>>> +	if (processor->local_apic_id == 0xffffffff)
>>> +		return 0;
>>> +
>>>  	/* Don't register processors that cannot be onlined. */
>>>  	if (madt_revision >= 5 &&
>>>  	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
>>>  	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
>>>  		return 0;
>>>  
>>> -	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
>>> -	    processor->local_apic_id != 0xffffffff || opt_cpu_info) {
>>> +	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info) {
>>>  		acpi_table_print_madt_entry(header);
>>>  		log = true;
>>>  	}
>>
>> In particular you're now suppressing log messages which may be relevant.
> 
> I intentionally mirrored the behavior of the check directly below.
> Unlike the the version in Linux the existing code didn't log ignored
> entries. So I did the same for the entries with an invalid ID.

I'm afraid I can't bring in line the check you add early in the function
with what is "directly below" [here]. I'm only inferring the "here" from
the placement of your reply. Maybe instead you meant the rev >= 5 one,
in which case I'm afraid the two are too different to compare: That
rev >= 5 one tells us that the entry isn't going to be hotpluggable. The
APIC ID check you add makes no such guarantees. That's why the new flag
was actually added in v5.

Jan
Andrew Cooper Aug. 7, 2023, 2:04 p.m. UTC | #7
On 07/08/2023 2:17 pm, Jan Beulich wrote:
> On 07.08.2023 14:55, Simon Gaiser wrote:
>> Jan Beulich:
>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>> 0xff. Linux already has code to handle those cases both in
>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>> same check to Xen.
>>> I'm afraid it doesn't become clear to me what problem you're trying to
>>> solve.
>> I want Xen to not think there are possible CPUs that actually never can
>> be there.
> Did you try using "maxcpus=" on the command line? If that doesn't work
> well enough (perhaps because of causing undesirable log messages), maybe
> we need some way to say "no CPU hotplug" on the command line.

The ACPI tables are broken, and Xen's parsing of them is wrong.

And irrespective - it's unreasonable to have users work around bugs in
Xen by adding more command line.

>
>> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
>> that there are 8 logical CPUs that are currently disabled but could be
>> hotplugged.
> Yet it's exactly this which ACPI is telling us here (with some vagueness,
> which isn't easy to get around; see below).
>
>> I'm moderately sure that soldering in another CPU is not supported, even
>> less so at runtime ;]
> On your system. What about others, which are hotplug-capable?

It is required that all APIC ID are stated *ahead of time*.

Entries with 0xFF and 0xFFFFFFFF at boot are entirely invalid.

Furthermore, given debugging that I just did with Thomas, it's very
clear that noone has tried really hotplugging CPUs under Xen.  The
layering violation is rather hilarious to watch.

~Andrew
Jan Beulich Aug. 7, 2023, 2:18 p.m. UTC | #8
On 07.08.2023 16:04, Andrew Cooper wrote:
> On 07/08/2023 2:17 pm, Jan Beulich wrote:
>> On 07.08.2023 14:55, Simon Gaiser wrote:
>>> Jan Beulich:
>>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>>> 0xff. Linux already has code to handle those cases both in
>>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>>> same check to Xen.
>>>> I'm afraid it doesn't become clear to me what problem you're trying to
>>>> solve.
>>> I want Xen to not think there are possible CPUs that actually never can
>>> be there.
>> Did you try using "maxcpus=" on the command line? If that doesn't work
>> well enough (perhaps because of causing undesirable log messages), maybe
>> we need some way to say "no CPU hotplug" on the command line.
> 
> The ACPI tables are broken, and Xen's parsing of them is wrong.
> 
> And irrespective - it's unreasonable to have users work around bugs in
> Xen by adding more command line.

Well, considering how rare CPU hotplug appears to be, such a new option
could default to "no hotplug".

>>> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
>>> that there are 8 logical CPUs that are currently disabled but could be
>>> hotplugged.
>> Yet it's exactly this which ACPI is telling us here (with some vagueness,
>> which isn't easy to get around; see below).
>>
>>> I'm moderately sure that soldering in another CPU is not supported, even
>>> less so at runtime ;]
>> On your system. What about others, which are hotplug-capable?
> 
> It is required that all APIC ID are stated *ahead of time*.

Would you mind pointing me at where this is stated? Aiui MADT entries
can be modified dynamically (we do so ourselves in the DSDT we supply
HVM guests with), and as I further understand this may be required
also for the APIC ID fields because firmware may not know how APIC IDs
are to be arranged for a new socket before being populated (by perhaps
a not entirely identical CPU, and this may not even need to be mixed
steppings).

Jan
Simon Gaiser Aug. 7, 2023, 2:45 p.m. UTC | #9
Andrew Cooper:
> On 07/08/2023 10:38 am, Simon Gaiser wrote:
>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>> 0xff. Linux already has code to handle those cases both in
>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>> same check to Xen.
>>
>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>> entries with a valid APIC ID. Linux would still ignore those because
>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>> this check is only active for madt_revision >= 5. But since this version
>> check seems to be intentionally I leave that alone.
> 
> I recall there being a discussion over this, ultimately with the version
> check being removed.  IIRC it was a defacto standard used for a long
> time prior to actually getting written into the ACPI spec, so does exist
> in practice in older MADTs.

So I noticed that the check in Linux is actually slightly different than
I thought. Since [3] it always considers the CPU usable if
ACPI_MADT_ENABLED is set. Otherwise it consider it only usable if
MADT revision >= 5 and ACPI_MADT_ONLINE_CAPABLE is set.

So I checked what the ACPI spec says:

Up to 6.2 Errata B [6] it only defines ACPI_MADT_ENABLE as:

    If zero, this processor is unusable, and the operating system
    support will not attempt to use it.

And the bit that later will be ACPI_MADT_ONLINE_CAPABLE is reserved with
"Must be zero".

6.3 [7] Then adds ACPI_MADT_ONLINE_CAPABLE and changes the meaning of
ACPI_MADT_ENABLE:

    Enabled
        If this bit is set the processor is ready for use. If this bit
        is clear and the Online Capable bit is set, system hardware
        supports enabling this processor during OS runtime. If this bit
        is clear and the Online Capable bit is also clear, this
        processor is unusable, and OSPM shall ignore the contents of the
        Processor Local APIC Structure.

    Online Capbable
        The information conveyed by this bit depends on the value of the
        Enabled bit. If the Enabled bit is set, this bit is reserved and
        must be zero. Otherwise, if this this bit is set, system
        hardware supports enabling this processor during OS runtime.

So with confirming firmwares it should be safe change the simply ignore
the entry if !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE

We can also do it like Linux and ignore ACPI_MADT_ONLINE_CAPABLE
completely if revision < 5.

Note that the revision was already increased to 5 before 6.3.

ACPI spec version    MADT revision
                  
6.2 [4]              4
6.2 Errata A [5]     45 (typo I guess)
6.2 Errata B         5
6.3                  5

[3]: https://git.kernel.org/torvalds/c/e2869bd7af608c343988429ceb1c2fe99644a01f
[4]: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
[5]: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
[6]: https://uefi.org/sites/default/files/resources/ACPI_6_2_B_final_Jan30.pdf
[7]: https://uefi.org/sites/default/files/resources/ACPI_6_3_May16.pdf

> Otherwise LGTM.  I'd suggest dropping this paragraph as it's not related
> to the change.  It will also help if we do decide to follow up and drop
> the MADT version check.
> 
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
> 
> https://git.kernel.org/torvalds/c/$SHA
> 
> Somewhat less verbose. https://korg.docs.kernel.org/git-url-shorteners.html
> 
>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>> ---
>>  xen/arch/x86/acpi/boot.c | 14 ++++++++++----
>>  1 file changed, 10 insertions(+), 4 deletions(-)
>>
>> diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
>> index 54b72d716b..4a62822fa9 100644
>> --- a/xen/arch/x86/acpi/boot.c
>> +++ b/xen/arch/x86/acpi/boot.c
>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>  	if (BAD_MADT_ENTRY(processor, end))
>>  		return -EINVAL;
>>  
>> +	/* Ignore entries with invalid apicid */
> 
> x2apic ID.
> 
> ~Andrew
Simon Gaiser Aug. 7, 2023, 3:05 p.m. UTC | #10
Jan Beulich:
> On 07.08.2023 14:55, Simon Gaiser wrote:
>> Jan Beulich:
>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>> 0xff. Linux already has code to handle those cases both in
>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>> same check to Xen.
>>>
>>> I'm afraid it doesn't become clear to me what problem you're trying to
>>> solve.
>>
>> I want Xen to not think there are possible CPUs that actually never can
>> be there.
> 
> Did you try using "maxcpus=" on the command line? If that doesn't work
> well enough

Yes. Then Xen says, as expected "SMP: Allowing 8 CPUs (0 hotplug CPUs)",
but disabled_cpus is still 8 and nr_sockets therefore calculated as 2.

> (perhaps because of causing undesirable log messages),

I don't mind some verbose log messages.

> maybe we need some way to say "no CPU hotplug" on the command line.

That indeed might make sense, but I'm not sure this is what I want here,
see below.

>> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
>> that there are 8 logical CPUs that are currently disabled but could be
>> hotplugged.
> 
> Yet it's exactly this which ACPI is telling us here (with some vagueness,
> which isn't easy to get around; see below).
> 
>> I'm moderately sure that soldering in another CPU is not supported, even
>> less so at runtime ;]
> 
> On your system. What about others, which are hotplug-capable?

Would those be listed with an invalid APIC ID in their entries? Then I
understand your complaint.

PS: based on your reply to Andrew, you think that this might be the case.

>>>> --- a/xen/arch/x86/acpi/boot.c
>>>> +++ b/xen/arch/x86/acpi/boot.c
>>>> @@ -87,14 +87,17 @@ acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
>>>>  	if (BAD_MADT_ENTRY(processor, end))
>>>>  		return -EINVAL;
>>>>  
>>>> +	/* Ignore entries with invalid apicid */
>>>> +	if (processor->local_apic_id == 0xffffffff)
>>>> +		return 0;
>>>> +
>>>>  	/* Don't register processors that cannot be onlined. */
>>>>  	if (madt_revision >= 5 &&
>>>>  	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
>>>>  	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
>>>>  		return 0;
>>>>  
>>>> -	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
>>>> -	    processor->local_apic_id != 0xffffffff || opt_cpu_info) {
>>>> +	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info) {
>>>>  		acpi_table_print_madt_entry(header);
>>>>  		log = true;
>>>>  	}
>>>
>>> In particular you're now suppressing log messages which may be relevant.
>>
>> I intentionally mirrored the behavior of the check directly below.
>> Unlike the the version in Linux the existing code didn't log ignored
>> entries. So I did the same for the entries with an invalid ID.
> 
> I'm afraid I can't bring in line the check you add early in the function
> with what is "directly below" [here]. I'm only inferring the "here" from
> the placement of your reply. Maybe instead you meant the rev >= 5 one,

Yes exactly. "directly below" was meant relative to the if I added.

> in which case I'm afraid the two are too different to compare: That
> rev >= 5 one tells us that the entry isn't going to be hotpluggable. The
> APIC ID check you add makes no such guarantees.

As mentioned above, if that's the case I see the problem.

> That's why the new flag was actually added in v5.

See other branch of this thread:
https://lore.kernel.org/xen-devel/80bae614-052e-0f90-cf13-0e5e4ed1a5cd@invisiblethingslab.com/
So we might cover at least my case regardless how invalid APIC IDs
should be interpreted.
Roger Pau Monne Aug. 23, 2023, 9:13 a.m. UTC | #11
On Mon, Aug 07, 2023 at 03:17:18PM +0200, Jan Beulich wrote:
> On 07.08.2023 14:55, Simon Gaiser wrote:
> > Jan Beulich:
> >> On 07.08.2023 11:38, Simon Gaiser wrote:
> >>> It seems some firmwares put dummy entries in the ACPI MADT table for non
> >>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> >>> 0xff. Linux already has code to handle those cases both in
> >>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> >>> same check to Xen.
> >>
> >> I'm afraid it doesn't become clear to me what problem you're trying to
> >> solve.
> > 
> > I want Xen to not think there are possible CPUs that actually never can
> > be there.
> 
> Did you try using "maxcpus=" on the command line? If that doesn't work
> well enough (perhaps because of causing undesirable log messages), maybe
> we need some way to say "no CPU hotplug" on the command line.

A Kconfig option to disable CPU hotplug at build time would also be
interesting IMO.

Thanks, Roger.
Andrew Cooper Aug. 23, 2023, 9:21 a.m. UTC | #12
On 07/08/2023 3:18 pm, Jan Beulich wrote:
> On 07.08.2023 16:04, Andrew Cooper wrote:
>> On 07/08/2023 2:17 pm, Jan Beulich wrote:
>>> On 07.08.2023 14:55, Simon Gaiser wrote:
>>>> Jan Beulich:
>>>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>>>> 0xff. Linux already has code to handle those cases both in
>>>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>>>> same check to Xen.
>>>>> I'm afraid it doesn't become clear to me what problem you're trying to
>>>>> solve.
>>>> I want Xen to not think there are possible CPUs that actually never can
>>>> be there.
>>> Did you try using "maxcpus=" on the command line? If that doesn't work
>>> well enough (perhaps because of causing undesirable log messages), maybe
>>> we need some way to say "no CPU hotplug" on the command line.
>> The ACPI tables are broken, and Xen's parsing of them is wrong.
>>
>> And irrespective - it's unreasonable to have users work around bugs in
>> Xen by adding more command line.
> Well, considering how rare CPU hotplug appears to be, such a new option
> could default to "no hotplug".

... or Xen could not be buggy and think there's a chance of hotplug on
the 99% of servers where there isn't.

>
>>>> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
>>>> that there are 8 logical CPUs that are currently disabled but could be
>>>> hotplugged.
>>> Yet it's exactly this which ACPI is telling us here (with some vagueness,
>>> which isn't easy to get around; see below).
>>>
>>>> I'm moderately sure that soldering in another CPU is not supported, even
>>>> less so at runtime ;]
>>> On your system. What about others, which are hotplug-capable?
>> It is required that all APIC ID are stated *ahead of time*.
> Would you mind pointing me at where this is stated?

In the spec, exactly where you'd expect to find them...

"OSPM does not expect the information provided in this table to be
updated if the processor information changes during the lifespan of an
OS boot."

Which is wordsmithing for "Some firmware was found to be modifying them
and this was deemed to be illegal under the spec".

~Andrew
Andrew Cooper Aug. 23, 2023, 9:32 a.m. UTC | #13
On 07/08/2023 3:45 pm, Simon Gaiser wrote:
> Andrew Cooper:
>> On 07/08/2023 10:38 am, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>>
>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>> entries with a valid APIC ID. Linux would still ignore those because
>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>> this check is only active for madt_revision >= 5. But since this version
>>> check seems to be intentionally I leave that alone.
>> I recall there being a discussion over this, ultimately with the version
>> check being removed.  IIRC it was a defacto standard used for a long
>> time prior to actually getting written into the ACPI spec, so does exist
>> in practice in older MADTs.
> So I noticed that the check in Linux is actually slightly different than
> I thought. Since [3] it always considers the CPU usable if
> ACPI_MADT_ENABLED is set. Otherwise it consider it only usable if
> MADT revision >= 5 and ACPI_MADT_ONLINE_CAPABLE is set.
>
> So I checked what the ACPI spec says:
>
> Up to 6.2 Errata B [6] it only defines ACPI_MADT_ENABLE as:
>
>     If zero, this processor is unusable, and the operating system
>     support will not attempt to use it.
>
> And the bit that later will be ACPI_MADT_ONLINE_CAPABLE is reserved with
> "Must be zero".
>
> 6.3 [7] Then adds ACPI_MADT_ONLINE_CAPABLE and changes the meaning of
> ACPI_MADT_ENABLE:
>
>     Enabled
>         If this bit is set the processor is ready for use. If this bit
>         is clear and the Online Capable bit is set, system hardware
>         supports enabling this processor during OS runtime. If this bit
>         is clear and the Online Capable bit is also clear, this
>         processor is unusable, and OSPM shall ignore the contents of the
>         Processor Local APIC Structure.
>
>     Online Capbable
>         The information conveyed by this bit depends on the value of the
>         Enabled bit. If the Enabled bit is set, this bit is reserved and
>         must be zero. Otherwise, if this this bit is set, system
>         hardware supports enabling this processor during OS runtime.
>
> So with confirming firmwares it should be safe change the simply ignore
> the entry if !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE
>
> We can also do it like Linux and ignore ACPI_MADT_ONLINE_CAPABLE
> completely if revision < 5.
>
> Note that the revision was already increased to 5 before 6.3.
>
> ACPI spec version    MADT revision
>                   
> 6.2 [4]              4
> 6.2 Errata A [5]     45 (typo I guess)
> 6.2 Errata B         5
> 6.3                  5
>
> [3]: https://git.kernel.org/torvalds/c/e2869bd7af608c343988429ceb1c2fe99644a01f
> [4]: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
> [5]: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
> [6]: https://uefi.org/sites/default/files/resources/ACPI_6_2_B_final_Jan30.pdf
> [7]: https://uefi.org/sites/default/files/resources/ACPI_6_3_May16.pdf

Honestly, the reserved must be zero means there's no need for a version
check at all.  That bit will not be set even in older MADT revisions.

That said, it's likely easier to retain the version check than to set up
a quirks infrastructure for buggy older MADTs.

~Andrew
Jan Beulich Aug. 23, 2023, 12:56 p.m. UTC | #14
On 23.08.2023 11:21, Andrew Cooper wrote:
> On 07/08/2023 3:18 pm, Jan Beulich wrote:
>> On 07.08.2023 16:04, Andrew Cooper wrote:
>>> On 07/08/2023 2:17 pm, Jan Beulich wrote:
>>>> On 07.08.2023 14:55, Simon Gaiser wrote:
>>>>> Jan Beulich:
>>>>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>>>>> 0xff. Linux already has code to handle those cases both in
>>>>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>>>>> same check to Xen.
>>>>>> I'm afraid it doesn't become clear to me what problem you're trying to
>>>>>> solve.
>>>>> I want Xen to not think there are possible CPUs that actually never can
>>>>> be there.
>>>> Did you try using "maxcpus=" on the command line? If that doesn't work
>>>> well enough (perhaps because of causing undesirable log messages), maybe
>>>> we need some way to say "no CPU hotplug" on the command line.
>>> The ACPI tables are broken, and Xen's parsing of them is wrong.
>>>
>>> And irrespective - it's unreasonable to have users work around bugs in
>>> Xen by adding more command line.
>> Well, considering how rare CPU hotplug appears to be, such a new option
>> could default to "no hotplug".
> 
> ... or Xen could not be buggy and think there's a chance of hotplug on
> the 99% of servers where there isn't.

Why do you say "buggy" when there's no clear cut indication of whether
hotplug is possible, even not all so old ACPI versions?

>>>>> Without ignoring those dummy entries Xen thinks my NUC has 2 sockets and
>>>>> that there are 8 logical CPUs that are currently disabled but could be
>>>>> hotplugged.
>>>> Yet it's exactly this which ACPI is telling us here (with some vagueness,
>>>> which isn't easy to get around; see below).
>>>>
>>>>> I'm moderately sure that soldering in another CPU is not supported, even
>>>>> less so at runtime ;]
>>>> On your system. What about others, which are hotplug-capable?
>>> It is required that all APIC ID are stated *ahead of time*.
>> Would you mind pointing me at where this is stated?
> 
> In the spec, exactly where you'd expect to find them...
> 
> "OSPM does not expect the information provided in this table to be
> updated if the processor information changes during the lifespan of an
> OS boot."

I don't think this tells us anything about the ID not possibly changing.
It merely tells us that OSPM is not expected to parse this table again
(IOW firmware updating just this table isn't going to be enough). IDs
possibly changing is expressed by (a) the "if the processor information
changes", and (b) the next sentence, forbidding them to change while the
system is asleep: "While in the sleeping state, logical processors must
not be added or removed, nor can their ... ID or ... Flags change."

> Which is wordsmithing for "Some firmware was found to be modifying them
> and this was deemed to be illegal under the spec".

That's your reading of it; I certainly don't infer such from that
sentence.

Jan
Thomas Gleixner Aug. 27, 2023, 3:44 p.m. UTC | #15
On Wed, Aug 23 2023 at 14:56, Jan Beulich wrote:
> On 23.08.2023 11:21, Andrew Cooper wrote:
>> In the spec, exactly where you'd expect to find them...
>> 
>> "OSPM does not expect the information provided in this table to be
>> updated if the processor information changes during the lifespan of an
>> OS boot."
>
> I don't think this tells us anything about the ID not possibly changing.
> It merely tells us that OSPM is not expected to parse this table again
> (IOW firmware updating just this table isn't going to be enough). IDs
> possibly changing is expressed by (a) the "if the processor information
> changes", and (b) the next sentence, forbidding them to change while the
> system is asleep: "While in the sleeping state, logical processors must
> not be added or removed, nor can their ... ID or ... Flags change."
>
>> Which is wordsmithing for "Some firmware was found to be modifying them
>> and this was deemed to be illegal under the spec".
>
> That's your reading of it; I certainly don't infer such from that
> sentence.

The APIC/X2APIC description of MADT specifies flags:

Enabled        	If this bit is set the processor is ready for use. If
		this bit is clear and the Online Capable bit is set,
		system hardware supports enabling this processor during
		OS runtime. If this bit is clear and the Online Capable
		bit is also clear, this processor is unusable, and OSPM
		shall ignore the contents of the Processor Local APIC
		Structure.

Online Capable	The information conveyed by this bit depends on the
		value of the Enabled bit. If the Enabled bit is set,
		this bit is reserved and must be zero. Otherwise, if
		this this bit is set, system hardware supports enabling
		this processor during OS runtime.

This is also related to SRAT which defines the proximity of memory to
processors at boot time with a similar set of flags.

Also 8.4 says:

  Each processor in the system must be declared in the ACPI namespace in
  the \_SB scope. .... A Device definition for a processor is declared
  using the ACPI0007 hardware identifier (HID). Processor configuration
  information is provided exclusively by objects in the processor
  device’s object list.

  When the platform uses the APIC interrupt model, UID object values
  under a processor device are used to associate processor devices with
  entries in the MADT.


MADT is the authoritative table for processor enumeration, whether
present or not. This is required because that's the only way to size
resources, which depend on the possible maximum topology.

Otherwise you'd end up with a CPU hotplugged which is outside of the
resource space allocated during init.

Thanks,

        tglx
Roger Pau Monne Aug. 29, 2023, 2:25 p.m. UTC | #16
On Sun, Aug 27, 2023 at 05:44:15PM +0200, Thomas Gleixner wrote:
> On Wed, Aug 23 2023 at 14:56, Jan Beulich wrote:
> > On 23.08.2023 11:21, Andrew Cooper wrote:
> >> In the spec, exactly where you'd expect to find them...
> >> 
> >> "OSPM does not expect the information provided in this table to be
> >> updated if the processor information changes during the lifespan of an
> >> OS boot."
> >
> > I don't think this tells us anything about the ID not possibly changing.
> > It merely tells us that OSPM is not expected to parse this table again
> > (IOW firmware updating just this table isn't going to be enough). IDs
> > possibly changing is expressed by (a) the "if the processor information
> > changes", and (b) the next sentence, forbidding them to change while the
> > system is asleep: "While in the sleeping state, logical processors must
> > not be added or removed, nor can their ... ID or ... Flags change."
> >
> >> Which is wordsmithing for "Some firmware was found to be modifying them
> >> and this was deemed to be illegal under the spec".
> >
> > That's your reading of it; I certainly don't infer such from that
> > sentence.
> 
> The APIC/X2APIC description of MADT specifies flags:
> 
> Enabled        	If this bit is set the processor is ready for use. If
> 		this bit is clear and the Online Capable bit is set,
> 		system hardware supports enabling this processor during
> 		OS runtime. If this bit is clear and the Online Capable
> 		bit is also clear, this processor is unusable, and OSPM
> 		shall ignore the contents of the Processor Local APIC
> 		Structure.
> 
> Online Capable	The information conveyed by this bit depends on the
> 		value of the Enabled bit. If the Enabled bit is set,
> 		this bit is reserved and must be zero. Otherwise, if
> 		this this bit is set, system hardware supports enabling
> 		this processor during OS runtime.

Sadly this flag is only present starting with MADT v5.

> This is also related to SRAT which defines the proximity of memory to
> processors at boot time with a similar set of flags.
> 
> Also 8.4 says:
> 
>   Each processor in the system must be declared in the ACPI namespace in
>   the \_SB scope. .... A Device definition for a processor is declared
>   using the ACPI0007 hardware identifier (HID). Processor configuration
>   information is provided exclusively by objects in the processor
>   device’s object list.
> 
>   When the platform uses the APIC interrupt model, UID object values
>   under a processor device are used to associate processor devices with
>   entries in the MADT.
> 
> 
> MADT is the authoritative table for processor enumeration, whether
> present or not. This is required because that's the only way to size
> resources, which depend on the possible maximum topology.
> 
> Otherwise you'd end up with a CPU hotplugged which is outside of the
> resource space allocated during init.

It's my understating that ACPI UIDs 0xff and 0xfffffff for local ACPI
and x2APIC structures respectively are invalid, as that's the
broadcast value used by the local (x2)APIC NMI Structures.

I think Jan's point (if I understood correctly) is that Processor or
Device objects can have a _MAT method that returns updated MADT
entries, and some ACPI implementations might modify the original
entries on the MADT and return them from that method when CPU
hotplug takes place.  The spec notes that "OSPM does not expect the
information provided in this table to be updated if the processor
information changes during the lifespan of an OS boot." so that the
MADT doesn't need to be updated when CPU hotplug happens, but I don't
see that sentence as preventing the MADT to be updated if CPU hotplug
takes place, it's just not required.

I don't see anywhere in the spec that states that APIC IDs 0xff and
0xffffffff are invalid and entries using those IDs should be ignored,
but I do think that any system that supports CPU hotplug better has
those IDs defined since boot.  Also it seems vendors have started
relying on using 0xff and 0xffffffff APIC IDs to signal non-present
CPUs, and Linux has been ignoring such entries for quite some time
already  without reported issues.  Overall I'm inclined to say we
should take this change, as it avoids flagging a lot of systems as CPU
hotplug capable when they are not.

Acked-by: Roger Pau Monné <roger.pau@citrix.com>

I think Andrew had some minor comments.

I wonder whether the check should be done after the entry has been
printed if `opt_cpu_info` is set, but seeing as the ONLINE_CAPABLE is
also done before possibly printing the entry I guess it's fine.

Thanks, Roger.
Thomas Gleixner Aug. 29, 2023, 10:54 p.m. UTC | #17
On Tue, Aug 29 2023 at 16:25, Roger Pau Monné wrote:
> On Sun, Aug 27, 2023 at 05:44:15PM +0200, Thomas Gleixner wrote:
>> The APIC/X2APIC description of MADT specifies flags:
>> 
>> Enabled        	If this bit is set the processor is ready for use. If
>> 		this bit is clear and the Online Capable bit is set,
>> 		system hardware supports enabling this processor during
>> 		OS runtime. If this bit is clear and the Online Capable
>> 		bit is also clear, this processor is unusable, and OSPM
>> 		shall ignore the contents of the Processor Local APIC
>> 		Structure.
>> 
>> Online Capable	The information conveyed by this bit depends on the
>> 		value of the Enabled bit. If the Enabled bit is set,
>> 		this bit is reserved and must be zero. Otherwise, if
>> 		this this bit is set, system hardware supports enabling
>> 		this processor during OS runtime.
>
> Sadly this flag is only present starting with MADT v5.

Correct. The difference between pre v5 MADT and v5+ is that the latter
allows the OS to make more informed decisions, but the lack of this flag
does not make the claim that randomly assigning APIC IDs after the
initial parsing is a valid approach any more correct. Why?

Simply because the other relationships vs. processor UIDs and SRAT/SLIT
are not magically going away due to the lack of that flag.

>> Otherwise you'd end up with a CPU hotplugged which is outside of the
>> resource space allocated during init.
>
> It's my understating that ACPI UIDs 0xff and 0xfffffff for local ACPI
> and x2APIC structures respectively are invalid, as that's the
> broadcast value used by the local (x2)APIC NMI Structures.

Correct. These IDs are invalid independent of any flag value. 

> I think Jan's point (if I understood correctly) is that Processor or
> Device objects can have a _MAT method that returns updated MADT
> entries, and some ACPI implementations might modify the original
> entries on the MADT and return them from that method when CPU
> hotplug takes place.  The spec notes that "OSPM does not expect the
> information provided in this table to be updated if the processor
> information changes during the lifespan of an OS boot." so that the
> MADT doesn't need to be updated when CPU hotplug happens, but I don't
> see that sentence as preventing the MADT to be updated if CPU hotplug
> takes place, it's just not required.

Right. But if you read carefully what I wrote then you will figure out
that any randomly made up APIC ID post MADT enumeration cannot work.

> I don't see anywhere in the spec that states that APIC IDs 0xff and
> 0xffffffff are invalid and entries using those IDs should be ignored,
> but I do think that any system that supports CPU hotplug better has
> those IDs defined since boot.  Also it seems vendors have started
> relying on using 0xff and 0xffffffff APIC IDs to signal non-present
> CPUs, and Linux has been ignoring such entries for quite some time
> already  without reported issues.

There is no requirement for the ACPI spec to state this simply because
these APIC IDs are invalid to address a processor at the architectural
level. ACPI does not care about architectural restrictions unless really
required, e.g. like the LAPIC vs. X2APIC exclusiveness.

Thanks,

        tglx
Jan Beulich Aug. 30, 2023, 7:20 a.m. UTC | #18
On 30.08.2023 00:54, Thomas Gleixner wrote:
> On Tue, Aug 29 2023 at 16:25, Roger Pau Monné wrote:
>> On Sun, Aug 27, 2023 at 05:44:15PM +0200, Thomas Gleixner wrote:
>>> The APIC/X2APIC description of MADT specifies flags:
>>>
>>> Enabled        	If this bit is set the processor is ready for use. If
>>> 		this bit is clear and the Online Capable bit is set,
>>> 		system hardware supports enabling this processor during
>>> 		OS runtime. If this bit is clear and the Online Capable
>>> 		bit is also clear, this processor is unusable, and OSPM
>>> 		shall ignore the contents of the Processor Local APIC
>>> 		Structure.
>>>
>>> Online Capable	The information conveyed by this bit depends on the
>>> 		value of the Enabled bit. If the Enabled bit is set,
>>> 		this bit is reserved and must be zero. Otherwise, if
>>> 		this this bit is set, system hardware supports enabling
>>> 		this processor during OS runtime.
>>
>> Sadly this flag is only present starting with MADT v5.
> 
> Correct. The difference between pre v5 MADT and v5+ is that the latter
> allows the OS to make more informed decisions, but the lack of this flag
> does not make the claim that randomly assigning APIC IDs after the
> initial parsing is a valid approach any more correct. Why?
> 
> Simply because the other relationships vs. processor UIDs and SRAT/SLIT
> are not magically going away due to the lack of that flag.
> 
>>> Otherwise you'd end up with a CPU hotplugged which is outside of the
>>> resource space allocated during init.
>>
>> It's my understating that ACPI UIDs 0xff and 0xfffffff for local ACPI
>> and x2APIC structures respectively are invalid, as that's the
>> broadcast value used by the local (x2)APIC NMI Structures.
> 
> Correct. These IDs are invalid independent of any flag value. 

What we apparently agree on is these special UID values to be invalid,
and that UIDs can't change. But that's not the same for the APIC IDs;
see below. (As a side note, Xen range-checks UID against its
implementation limit MAX_MADT_ENTRIES, so it's more than just the
all-ones values which we'd reject. Not really correct, I know. Looks
like Linux has done away with the simple x86_acpiid_to_apicid[]
translation mechanism. This being a statically sized array requires
this restriction in Xen, for the time being.)

>> I think Jan's point (if I understood correctly) is that Processor or
>> Device objects can have a _MAT method that returns updated MADT
>> entries, and some ACPI implementations might modify the original
>> entries on the MADT and return them from that method when CPU
>> hotplug takes place.

Just to mention it: It's not just "might". I've seen DSDT code doing
so on more than one occasion.

>>  The spec notes that "OSPM does not expect the
>> information provided in this table to be updated if the processor
>> information changes during the lifespan of an OS boot." so that the
>> MADT doesn't need to be updated when CPU hotplug happens, but I don't
>> see that sentence as preventing the MADT to be updated if CPU hotplug
>> takes place, it's just not required.
> 
> Right. But if you read carefully what I wrote then you will figure out
> that any randomly made up APIC ID post MADT enumeration cannot work.

I just went back and re-read your earlier reply. I can't see such
following from what you wrote.

>> I don't see anywhere in the spec that states that APIC IDs 0xff and
>> 0xffffffff are invalid and entries using those IDs should be ignored,
>> but I do think that any system that supports CPU hotplug better has
>> those IDs defined since boot.  Also it seems vendors have started
>> relying on using 0xff and 0xffffffff APIC IDs to signal non-present
>> CPUs, and Linux has been ignoring such entries for quite some time
>> already  without reported issues.
> 
> There is no requirement for the ACPI spec to state this simply because
> these APIC IDs are invalid to address a processor at the architectural
> level. ACPI does not care about architectural restrictions unless really
> required, e.g. like the LAPIC vs. X2APIC exclusiveness.

Correct, active processors can't use such APIC IDs. But placeholder
MADT entries can, so long as valid APIC IDs are put in place by
firmware (and announced via _MAT) at the time a socket is newly
populated. They'll be uniquely identified by their UID, so the OS very
well knows which originally parsed (from MADT) entries are affected.

As stated before, unless putting in place extra restrictions, I can't
even see how firmware would be able to up front determine APIC IDs for
unpopulated sockets: It simply can't know the topology of a package
that's not there yet. Requiring all packages to have identical
topology might be a restriction OSes put in place, but I'd be inclined
to call firmware buggy if it did (short of me being aware of there
being anything in the spec putting in place such a restriction).

Jan
Thomas Gleixner Aug. 30, 2023, 3:44 p.m. UTC | #19
Jan!

On Wed, Aug 30 2023 at 09:20, Jan Beulich wrote:
> On 30.08.2023 00:54, Thomas Gleixner wrote:
>> On Tue, Aug 29 2023 at 16:25, Roger Pau Monné wrote:
>> 
>> Correct. These IDs are invalid independent of any flag value. 
>
> What we apparently agree on is these special UID values to be invalid,
> and that UIDs can't change. But that's not the same for the APIC IDs;
> see below. (As a side note, Xen range-checks UID against its
> implementation limit MAX_MADT_ENTRIES, so it's more than just the
> all-ones values which we'd reject. Not really correct, I know. Looks
> like Linux has done away with the simple x86_acpiid_to_apicid[]
> translation mechanism. This being a statically sized array requires
> this restriction in Xen, for the time being.)

Linux ignores entries too once the the maximum number of CPUs is reached.

>>> I think Jan's point (if I understood correctly) is that Processor or
>>> Device objects can have a _MAT method that returns updated MADT
>>> entries, and some ACPI implementations might modify the original
>>> entries on the MADT and return them from that method when CPU
>>> hotplug takes place.
>
> Just to mention it: It's not just "might". I've seen DSDT code doing
> so on more than one occasion.

That does not make it more correct or better.

> As stated before, unless putting in place extra restrictions, I can't
> even see how firmware would be able to up front determine APIC IDs for
> unpopulated sockets: It simply can't know the topology of a package
> that's not there yet. Requiring all packages to have identical
> topology might be a restriction OSes put in place, but I'd be inclined
> to call firmware buggy if it did (short of me being aware of there
> being anything in the spec putting in place such a restriction).

The ACPI specification does not care about restrictions which are in the
realm of hardware. You simply cannot mix random CPUs in a system just as
you see fit.

But that aside. ACPI based hotplug is purely used by virtualization. The
efforts to support real physical hotplug have never advanced beyond the
proof of concept state as the required complexity turned out to be just
not worth the potential benefit.

Of course virtualization people might think that everything which is
imaginable and not explicitly forbidden by some specification is
something which should be supported. That's just a recipe for disaster
as it needlessly expands the complexity space for absolutely zero value.

In reality most OSes will require that all possible APIC IDs are
enumerated during initialization in order to size things correctly and
to make system wide decisions correctly during boot or they will either
fail to accept hot-added CPUs later on or end up in a situation where
after accepting a hot-added CPU it turns out that system wide boot time
decisions are wrong or data is sized incorrectly, which means it is
pretty much up a creek without a paddle.

So in order to avoid these hard to handle, hard to debug and diagnose
failure cases, it's sensible when OSes mandate enumeration requirements
which have been omitted from the specification for whatever reason.

Linux expects this today and in case the expectation is not met it has
issues due to the non-enforcement, which cause hard to diagnose
malfunction. Those issues might be fixable by some definition of
fixable, but the value of doing that is close to zero. In fact it'd be a
net negative because the increased complexity will just put a
maintainability burden on the code base which is completely
unjustifiable.

Coming back to the specification issues. As of v6.3 the Online Capable
flag was added with the following rationale (paraphrased):

     Operating systems need to size resources at boot time and therefore
     they count the APIC IDs which have the enabled flag cleared to size
     correctly for potential hotplug operations.

     But that has diametral effects on bare metal because the OS is not
     able to distinguish between hot-plugable APIC ID and truly disabled
     entries. That results in overallocation or suboptimal distribution
     of multi-queue devices.

The benefit (verbatim):

     The proposed “Online Capable” flag will allow OSPM to unequivocally
     discern the platform’s intention regarding processors that are not
     enabled at OS boot-time.

Now look at the outcome:

Enabled:

        If this bit is set the processor is ready for use. If this bit
        is clear and the Online Capable bit is set, system hardware
        supports enabling this processor during OS runtime. If this bit
        is clear and the Online Capable bit is also clear, this
        processor is unusable, and OSPM shall ignore the contents of the
        Processor Local APIC Structure.

So while I conceed that this does not expressis verbis mandate that
hot-pluggable CPUs must be enumerated it's not far fetched to interpret
it this way. The key is 'OSPM shall ignore the contents...'

As MADT is the only source of information from which an OS can deduce
sizing and also topology information during early boot, ignoring the
contents boils down to not allocating resources and if such an APIC ID
magically surfaces later on via hot-add, then the OS can rightfully
refuse to add it.

Having that information is simply a correctness requirement and there is
absolutely no justification to support made up ivory tower
configurations which try to explore the gaps and bluryness of the ACPI
specification.

Of course we can't rely on that flag yet if the table is not
implementing v6.3+, but from my testing there is no fallout from
refusing to hot-add CPUs which have not been enumerated in MADT during
early boot. The issue addressed by the Online Capable bit
vs. overallocation is obviosly still there when the enumerated
"disabled" APIC IDs can never be hot-added.

Feel free to disagree, but Linux will enforce that _all_ possible APIC
IDs are enumerated in MADT during early boot in the near future. This
makes a lot of things correct, simpler and more robust.

Thanks,

        tglx
Jan Beulich Sept. 1, 2023, 7:44 a.m. UTC | #20
On 07.08.2023 11:38, Simon Gaiser wrote:
> It seems some firmwares put dummy entries in the ACPI MADT table for non
> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> 0xff. Linux already has code to handle those cases both in
> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> same check to Xen.
> 
> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
> entries with a valid APIC ID. Linux would still ignore those because
> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
> this check is only active for madt_revision >= 5. But since this version
> check seems to be intentionally I leave that alone.
> 
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>

This patch was committed with unaddressed review comments. The normal action
in such a case would be to revert, expecting a v2 to arrive. One alternative
here would be a timely incremental patch submission. Another alternative,
considering in particular Thomas's most recent reply, would be to properly
downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
CHANGELOG.md).

I'll give it until the end of next week for either of the alternatives to be
carried out. If nothing appears by then, I'll assume I may rightfully revert.
(This timeline also allows putting this topic on the Community Call agenda,
should anyone think this is necessary.)

Regards, Jan
Stefano Stabellini Sept. 6, 2023, 8:49 p.m. UTC | #21
On Fri, 1 Sep 2023, Jan Beulich wrote:
> On 07.08.2023 11:38, Simon Gaiser wrote:
> > It seems some firmwares put dummy entries in the ACPI MADT table for non
> > existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> > 0xff. Linux already has code to handle those cases both in
> > acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> > same check to Xen.
> > 
> > Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
> > entries with a valid APIC ID. Linux would still ignore those because
> > they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
> > this check is only active for madt_revision >= 5. But since this version
> > check seems to be intentionally I leave that alone.
> > 
> > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
> > Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
> > Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
> 
> This patch was committed with unaddressed review comments. The normal action
> in such a case would be to revert, expecting a v2 to arrive. One alternative
> here would be a timely incremental patch submission. Another alternative,
> considering in particular Thomas's most recent reply, would be to properly
> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
> CHANGELOG.md).

I am in favor of downgrading physical CPU hotplug support in
SUPPORT.md.

I noticed that there is no entry for physical CPU hotplug support in
SUPPORT.md today. Should we assume that it is not supported already as
it is not listed as supported?

Specifically, would it be a good idea to add a sentence at the top of
the file saying that anything not explicitly listed is not supported?
Jan Beulich Sept. 7, 2023, 6:50 a.m. UTC | #22
On 06.09.2023 22:49, Stefano Stabellini wrote:
> On Fri, 1 Sep 2023, Jan Beulich wrote:
>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>>
>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>> entries with a valid APIC ID. Linux would still ignore those because
>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>> this check is only active for madt_revision >= 5. But since this version
>>> check seems to be intentionally I leave that alone.
>>>
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>>
>> This patch was committed with unaddressed review comments. The normal action
>> in such a case would be to revert, expecting a v2 to arrive. One alternative
>> here would be a timely incremental patch submission. Another alternative,
>> considering in particular Thomas's most recent reply, would be to properly
>> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
>> CHANGELOG.md).
> 
> I am in favor of downgrading physical CPU hotplug support in
> SUPPORT.md.
> 
> I noticed that there is no entry for physical CPU hotplug support in
> SUPPORT.md today. Should we assume that it is not supported already as
> it is not listed as supported?

Hmm, I see

## Host hardware support

### Physical CPU Hotplug

    Status, x86: Supported

pretty close to the top of the file.

> Specifically, would it be a good idea to add a sentence at the top of
> the file saying that anything not explicitly listed is not supported?

Iirc that was the plan to do for 4.18, but then we need to be sure that
things don't unintentionally become unsupported. I've no clear idea how
this plan was meant to be carried out, though.

Jan
Stefano Stabellini Sept. 7, 2023, 9:30 p.m. UTC | #23
On Thu, 7 Sep 2023, Jan Beulich wrote:
> On 06.09.2023 22:49, Stefano Stabellini wrote:
> > On Fri, 1 Sep 2023, Jan Beulich wrote:
> >> On 07.08.2023 11:38, Simon Gaiser wrote:
> >>> It seems some firmwares put dummy entries in the ACPI MADT table for non
> >>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
> >>> 0xff. Linux already has code to handle those cases both in
> >>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
> >>> same check to Xen.
> >>>
> >>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
> >>> entries with a valid APIC ID. Linux would still ignore those because
> >>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
> >>> this check is only active for madt_revision >= 5. But since this version
> >>> check seems to be intentionally I leave that alone.
> >>>
> >>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
> >>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
> >>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
> >>
> >> This patch was committed with unaddressed review comments. The normal action
> >> in such a case would be to revert, expecting a v2 to arrive. One alternative
> >> here would be a timely incremental patch submission. Another alternative,
> >> considering in particular Thomas's most recent reply, would be to properly
> >> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
> >> CHANGELOG.md).
> > 
> > I am in favor of downgrading physical CPU hotplug support in
> > SUPPORT.md.
> > 
> > I noticed that there is no entry for physical CPU hotplug support in
> > SUPPORT.md today. Should we assume that it is not supported already as
> > it is not listed as supported?
> 
> Hmm, I see
> 
> ## Host hardware support
> 
> ### Physical CPU Hotplug
> 
>     Status, x86: Supported
> 
> pretty close to the top of the file.

Ops, it must have been the case-sensitive search that failed me


> > Specifically, would it be a good idea to add a sentence at the top of
> > the file saying that anything not explicitly listed is not supported?
> 
> Iirc that was the plan to do for 4.18, but then we need to be sure that
> things don't unintentionally become unsupported. I've no clear idea how
> this plan was meant to be carried out, though.

it would be interesting to discuss it again
Simon Gaiser Sept. 11, 2023, 10:17 a.m. UTC | #24
Andrew Cooper:
> On 07/08/2023 3:45 pm, Simon Gaiser wrote:
>> Andrew Cooper:
>>> On 07/08/2023 10:38 am, Simon Gaiser wrote:
>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>> 0xff. Linux already has code to handle those cases both in
>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>> same check to Xen.
>>>>
>>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>>> entries with a valid APIC ID. Linux would still ignore those because
>>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>>> this check is only active for madt_revision >= 5. But since this version
>>>> check seems to be intentionally I leave that alone.
>>> I recall there being a discussion over this, ultimately with the version
>>> check being removed.  IIRC it was a defacto standard used for a long
>>> time prior to actually getting written into the ACPI spec, so does exist
>>> in practice in older MADTs.
>> So I noticed that the check in Linux is actually slightly different than
>> I thought. Since [3] it always considers the CPU usable if
>> ACPI_MADT_ENABLED is set. Otherwise it consider it only usable if
>> MADT revision >= 5 and ACPI_MADT_ONLINE_CAPABLE is set.
>>
>> So I checked what the ACPI spec says:
>>
>> Up to 6.2 Errata B [6] it only defines ACPI_MADT_ENABLE as:
>>
>>     If zero, this processor is unusable, and the operating system
>>     support will not attempt to use it.
>>
>> And the bit that later will be ACPI_MADT_ONLINE_CAPABLE is reserved with
>> "Must be zero".
>>
>> 6.3 [7] Then adds ACPI_MADT_ONLINE_CAPABLE and changes the meaning of
>> ACPI_MADT_ENABLE:
>>
>>     Enabled
>>         If this bit is set the processor is ready for use. If this bit
>>         is clear and the Online Capable bit is set, system hardware
>>         supports enabling this processor during OS runtime. If this bit
>>         is clear and the Online Capable bit is also clear, this
>>         processor is unusable, and OSPM shall ignore the contents of the
>>         Processor Local APIC Structure.
>>
>>     Online Capbable
>>         The information conveyed by this bit depends on the value of the
>>         Enabled bit. If the Enabled bit is set, this bit is reserved and
>>         must be zero. Otherwise, if this this bit is set, system
>>         hardware supports enabling this processor during OS runtime.
>>
>> So with confirming firmwares it should be safe change the simply ignore
>> the entry if !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE
>>
>> We can also do it like Linux and ignore ACPI_MADT_ONLINE_CAPABLE
>> completely if revision < 5.
>>
>> Note that the revision was already increased to 5 before 6.3.
>>
>> ACPI spec version    MADT revision
>>                   
>> 6.2 [4]              4
>> 6.2 Errata A [5]     45 (typo I guess)
>> 6.2 Errata B         5
>> 6.3                  5
>>
>> [3]: https://git.kernel.org/torvalds/c/e2869bd7af608c343988429ceb1c2fe99644a01f
>> [4]: http://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf
>> [5]: http://www.uefi.org/sites/default/files/resources/ACPI%206_2_A_Sept29.pdf
>> [6]: https://uefi.org/sites/default/files/resources/ACPI_6_2_B_final_Jan30.pdf
>> [7]: https://uefi.org/sites/default/files/resources/ACPI_6_3_May16.pdf
> 
> Honestly, the reserved must be zero means there's no need for a version
> check at all.  That bit will not be set even in older MADT revisions.
> 
> That said, it's likely easier to retain the version check than to set up
> a quirks infrastructure for buggy older MADTs.

Yes, that's what I was thinking too. Send a patch:
https://lore.kernel.org/xen-devel/20230911101238.18005-1-simon@invisiblethingslab.com/

Simon
Simon Gaiser Sept. 11, 2023, 6:05 p.m. UTC | #25
Jan Beulich:
> On 07.08.2023 11:38, Simon Gaiser wrote:
>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>> 0xff. Linux already has code to handle those cases both in
>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>> same check to Xen.
>>
>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>> entries with a valid APIC ID. Linux would still ignore those because
>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>> this check is only active for madt_revision >= 5. But since this version
>> check seems to be intentionally I leave that alone.
>>
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
> 
> This patch was committed with unaddressed review comments. The normal action
> in such a case would be to revert, expecting a v2 to arrive. One alternative
> here would be a timely incremental patch submission. Another alternative,
> considering in particular Thomas's most recent reply, would be to properly
> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
> CHANGELOG.md).
> 
> I'll give it until the end of next week for either of the alternatives to be
> carried out. If nothing appears by then, I'll assume I may rightfully revert.
> (This timeline also allows putting this topic on the Community Call agenda,
> should anyone think this is necessary.)

To avoid misunderstandings, since you mentioned the above in your
response to the related patch I submitted today [3]:

My understanding is that your main concern is that those entries with
invalid APIC IDs shouldn't be ignored, since some firmwares later update
them with valid IDs (on hotplug). This fully conflicts with the
intention of the patch. Resolving the disagreement that followed is
outside of my control, as being only the submitter of the patch.

You also commented about not logging the ignored entries. Until it's
clear if the change in general is accepted in the end, I considered it
pointless to address that detail. If you disagree and want a follow up
for that, just let me know.

Simon

[3]: https://lore.kernel.org/xen-devel/20230911101238.18005-1-simon@invisiblethingslab.com/
Andrew Cooper Sept. 11, 2023, 6:24 p.m. UTC | #26
On 06/09/2023 9:49 pm, Stefano Stabellini wrote:
> On Fri, 1 Sep 2023, Jan Beulich wrote:
>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>>
>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>> entries with a valid APIC ID. Linux would still ignore those because
>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>> this check is only active for madt_revision >= 5. But since this version
>>> check seems to be intentionally I leave that alone.
>>>
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>> This patch was committed with unaddressed review comments.

Count the number of x86 maintainer tags on the patch, and see that they
were given after discussion, not before.

You're outvoted 2:1 by Xen x86 maintainers alone, and more than 2:1 if
you include the x86 maintainers from other projects who participated in
the discussion.


I'm not willing to let Xen keep on malfunctioning on millions of systems
just because you think an unfinished spec is more correct that practical
reality which invalidates it.

>>  The normal action
>> in such a case would be to revert, expecting a v2 to arrive. One alternative
>> here would be a timely incremental patch submission. Another alternative,
>> considering in particular Thomas's most recent reply, would be to properly
>> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
>> CHANGELOG.md).
> I am in favor of downgrading physical CPU hotplug support in
> SUPPORT.md.

SUPPORT.md does look bogus and wants correcting, but it is laughable to
claim that this ever worked, let alone that it's supported.

Intel got half way through specifying ACPI CPU Hotplug, and abandoned it
as a technology because they couldn't get it to work.  Hence the feature
has never been tested.

Furthermore, cursory testing that Thomas did for the Linux topology work
demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.

Even furthermore, it's an area of the Xen / dom0 boundary which is
fundamentally broken for non-PV cases, and undocumented for the PV case,
hence why it's broken in Linux.


Physical CPU Hotplug does not pass the bar for being anything more than
experimental.  It's absolutely not tech-preview level because the only
demo it has had in an environment (admittedly virtual) which does
implement the spec in a usable way demonstrates that it doesn't function.

The fact no-one has noticed until now shows that the feature isn't used,
which comes back around full circle to the fact that Intel never made it
work and never shipped it.

~Andrew
Stefano Stabellini Sept. 11, 2023, 10:38 p.m. UTC | #27
On Mon, 11 Sep 2023, Andrew Cooper wrote:
> Physical CPU Hotplug does not pass the bar for being anything more than
> experimental.  It's absolutely not tech-preview level because the only
> demo it has had in an environment (admittedly virtual) which does
> implement the spec in a usable way demonstrates that it doesn't function.
> 
> The fact no-one has noticed until now shows that the feature isn't used,
> which comes back around full circle to the fact that Intel never made it
> work and never shipped it.

So we actually have agreement on how to move forward

---
SUPPORT: downgrade Physical CPU Hotplug to Experimental

The feature is not commonly used, and we don't have hardware to test it,
not in OSSTest, not in Gitlab, and not even ad-hoc manually by community
members.

Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>

diff --git a/SUPPORT.md b/SUPPORT.md
index 3461f5cf2f..02e2f6eaa8 100644
--- a/SUPPORT.md
+++ b/SUPPORT.md
@@ -48,7 +48,7 @@ For the Cortex A77 r0p0 - r1p0, see Errata 1508412.
 
 ### Physical CPU Hotplug
 
-    Status, x86: Supported
+    Status, x86: Experimental
 
 ### Physical Memory
Thomas Gleixner Sept. 12, 2023, 7:56 a.m. UTC | #28
On Mon, Sep 11 2023 at 19:24, Andrew Cooper wrote:
> Furthermore, cursory testing that Thomas did for the Linux topology work
> demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.
>
> Even furthermore, it's an area of the Xen / dom0 boundary which is
> fundamentally broken for non-PV cases, and undocumented for the PV case,
> hence why it's broken in Linux.
>
> Physical CPU Hotplug does not pass the bar for being anything more than
> experimental.  It's absolutely not tech-preview level because the only
> demo it has had in an environment (admittedly virtual) which does
> implement the spec in a usable way demonstrates that it doesn't function.
>
> The fact no-one has noticed until now shows that the feature isn't used,
> which comes back around full circle to the fact that Intel never made it
> work and never shipped it.

OTOH it _is_ used in virtualization. KVM supports it and it just
works. That's how I found out that XEN explodes in colourful ways :)

Thanks,

        tglx
Jan Beulich Sept. 12, 2023, 8:33 a.m. UTC | #29
On 11.09.2023 20:24, Andrew Cooper wrote:
> On 06/09/2023 9:49 pm, Stefano Stabellini wrote:
>> On Fri, 1 Sep 2023, Jan Beulich wrote:
>>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>>> 0xff. Linux already has code to handle those cases both in
>>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>>> same check to Xen.
>>>>
>>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>>> entries with a valid APIC ID. Linux would still ignore those because
>>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>>> this check is only active for madt_revision >= 5. But since this version
>>>> check seems to be intentionally I leave that alone.
>>>>
>>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>>>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>>> This patch was committed with unaddressed review comments.
> 
> Count the number of x86 maintainer tags on the patch, and see that they
> were given after discussion, not before.
> 
> You're outvoted 2:1 by Xen x86 maintainers alone, and more than 2:1 if
> you include the x86 maintainers from other projects who participated in
> the discussion.

I wasn't aware that ./MAINTAINERS having

4. There must be no "open" objections.

was possible to overrule by any number of acks.

>>>  The normal action
>>> in such a case would be to revert, expecting a v2 to arrive. One alternative
>>> here would be a timely incremental patch submission. Another alternative,
>>> considering in particular Thomas's most recent reply, would be to properly
>>> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
>>> CHANGELOG.md).
>> I am in favor of downgrading physical CPU hotplug support in
>> SUPPORT.md.
> 
> SUPPORT.md does look bogus and wants correcting, but it is laughable to
> claim that this ever worked, let alone that it's supported.
> 
> Intel got half way through specifying ACPI CPU Hotplug, and abandoned it
> as a technology because they couldn't get it to work.  Hence the feature
> has never been tested.
> 
> Furthermore, cursory testing that Thomas did for the Linux topology work
> demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.
> 
> Even furthermore, it's an area of the Xen / dom0 boundary which is
> fundamentally broken for non-PV cases, and undocumented for the PV case,
> hence why it's broken in Linux.
> 
> 
> Physical CPU Hotplug does not pass the bar for being anything more than
> experimental.  It's absolutely not tech-preview level because the only
> demo it has had in an environment (admittedly virtual) which does
> implement the spec in a usable way demonstrates that it doesn't function.
> 
> The fact no-one has noticed until now shows that the feature isn't used,
> which comes back around full circle to the fact that Intel never made it
> work and never shipped it.

And note how I offered a compromise by someone (Simon, you, Roger, or yet
someone else) sending a patch against SUPPORT.md. Yet that hasn't happened.
I therefore continue to be of the opinion that I can rightfully revert the
patch, based on process not having been followed and alternative actions
not having been taken.

Jan
Jan Beulich Sept. 12, 2023, 8:36 a.m. UTC | #30
On 12.09.2023 10:33, Jan Beulich wrote:
> And note how I offered a compromise by someone (Simon, you, Roger, or yet
> someone else) sending a patch against SUPPORT.md. Yet that hasn't happened.
> I therefore continue to be of the opinion that I can rightfully revert the
> patch, based on process not having been followed and alternative actions
> not having been taken.

Hmm, I now see that Stefano has actually sent such a patch, just not on a
fresh thread. With that acked and committed, the primary reason to revert
will disappear. I'll mention the other aspect in a reply to Simon.

Jan
Jan Beulich Sept. 12, 2023, 8:41 a.m. UTC | #31
On 11.09.2023 20:05, Simon Gaiser wrote:
> Jan Beulich:
>> On 07.08.2023 11:38, Simon Gaiser wrote:
>>> It seems some firmwares put dummy entries in the ACPI MADT table for non
>>> existing processors. On my NUC11TNHi5 those have the invalid APIC ID
>>> 0xff. Linux already has code to handle those cases both in
>>> acpi_parse_lapic [1] as well as in acpi_parse_x2apic [2]. So add the
>>> same check to Xen.
>>>
>>> Note that on some older (2nd gen Core i) laptop of mine I also saw dummy
>>> entries with a valid APIC ID. Linux would still ignore those because
>>> they have !ACPI_MADT_ENABLED && !ACPI_MADT_ONLINE_CAPABLE. But in Xen
>>> this check is only active for madt_revision >= 5. But since this version
>>> check seems to be intentionally I leave that alone.
>>>
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f3bf1dbe64b62a2058dd1944c00990df203e8e7a # [1]
>>> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=10daf10ab154e31237a8c07242be3063fb6a9bf4 # [2]
>>> Signed-off-by: Simon Gaiser <simon@invisiblethingslab.com>
>>
>> This patch was committed with unaddressed review comments. The normal action
>> in such a case would be to revert, expecting a v2 to arrive. One alternative
>> here would be a timely incremental patch submission. Another alternative,
>> considering in particular Thomas's most recent reply, would be to properly
>> downgrade CPU hotplug support in SUPPORT.md (with a corresponding entry in
>> CHANGELOG.md).
>>
>> I'll give it until the end of next week for either of the alternatives to be
>> carried out. If nothing appears by then, I'll assume I may rightfully revert.
>> (This timeline also allows putting this topic on the Community Call agenda,
>> should anyone think this is necessary.)
> 
> To avoid misunderstandings, since you mentioned the above in your
> response to the related patch I submitted today [3]:
> 
> My understanding is that your main concern is that those entries with
> invalid APIC IDs shouldn't be ignored, since some firmwares later update
> them with valid IDs (on hotplug). This fully conflicts with the
> intention of the patch. Resolving the disagreement that followed is
> outside of my control, as being only the submitter of the patch.

Why would that be outside of your control? Your change broke a feature
officially stated as supported. Now that there is a patch to downgrade
support, that aspect will be resolved as soon as that patch gets ack-ed
and committed.

> You also commented about not logging the ignored entries. Until it's
> clear if the change in general is accepted in the end, I considered it
> pointless to address that detail. If you disagree and want a follow up
> for that, just let me know.

I take a different perspective here: The patch shouldn't have been
committed without this aspect addressed, either verbally or by sending
a v2. I continue to think that an incremental change is warranted to
make sure logging of entries, at least with "cpuinfo" in use, remains
consistent with what we had before. Otherwise debugging of possible
issues becomes yet more difficult.

Jan
Jan Beulich Sept. 12, 2023, 8:45 a.m. UTC | #32
On 12.09.2023 10:41, Jan Beulich wrote:
> On 11.09.2023 20:05, Simon Gaiser wrote:
>> You also commented about not logging the ignored entries. Until it's
>> clear if the change in general is accepted in the end, I considered it
>> pointless to address that detail. If you disagree and want a follow up
>> for that, just let me know.
> 
> I take a different perspective here: The patch shouldn't have been
> committed without this aspect addressed, either verbally or by sending
> a v2. I continue to think that an incremental change is warranted to
> make sure logging of entries, at least with "cpuinfo" in use, remains
> consistent with what we had before. Otherwise debugging of possible
> issues becomes yet more difficult.

Oh, and just to add to this: Because of excess verbosity resulting from
such ambiguous MADT entries, for many years I have been carrying a
private patch (reproduced below, just fyi). But since those entries are
still in line with the spec, it didn't seem appropriate to me to
propose this change for upstream inclusion.

Jan

--- unstable.orig/xen/arch/x86/mpparse.c
+++ unstable/xen/arch/x86/mpparse.c
@@ -809,8 +809,13 @@ int mp_register_lapic(u32 id, bool enabl
 	};
 	
 	if (MAX_APICS <= id) {
-		printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
-			id, MAX_APICS);
+		static uint32_t max_warn = -1;
+
+		if (id <= max_warn) {
+			printk(KERN_WARNING "Processor #%d invalid (max %d)\n",
+			       id, MAX_APICS);
+			max_warn = id - 1;
+		}
 		return -EINVAL;
 	}
Roger Pau Monne Sept. 12, 2023, 9:38 a.m. UTC | #33
On Mon, Sep 11, 2023 at 03:38:05PM -0700, Stefano Stabellini wrote:
> On Mon, 11 Sep 2023, Andrew Cooper wrote:
> > Physical CPU Hotplug does not pass the bar for being anything more than
> > experimental.  It's absolutely not tech-preview level because the only
> > demo it has had in an environment (admittedly virtual) which does
> > implement the spec in a usable way demonstrates that it doesn't function.
> > 
> > The fact no-one has noticed until now shows that the feature isn't used,
> > which comes back around full circle to the fact that Intel never made it
> > work and never shipped it.
> 
> So we actually have agreement on how to move forward
> 
> ---
> SUPPORT: downgrade Physical CPU Hotplug to Experimental
> 
> The feature is not commonly used, and we don't have hardware to test it,
> not in OSSTest, not in Gitlab, and not even ad-hoc manually by community
> members.

We could use QEMU to test, so it's not impossible to test, just that
AFAICT there aren't (that m)any users of it.

> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>
> 
> diff --git a/SUPPORT.md b/SUPPORT.md
> index 3461f5cf2f..02e2f6eaa8 100644
> --- a/SUPPORT.md
> +++ b/SUPPORT.md
> @@ -48,7 +48,7 @@ For the Cortex A77 r0p0 - r1p0, see Errata 1508412.
>  
>  ### Physical CPU Hotplug

I think it would be clearer to rename this to "ACPI CPU Hotplug", as
to not be confused with the late CPU bringup done by the pvshim.

Thanks, Roger.
George Dunlap Sept. 13, 2023, 10:02 a.m. UTC | #34
On Tue, Sep 12, 2023 at 8:57 AM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Mon, Sep 11 2023 at 19:24, Andrew Cooper wrote:
> > Furthermore, cursory testing that Thomas did for the Linux topology work
> > demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.
> >
> > Even furthermore, it's an area of the Xen / dom0 boundary which is
> > fundamentally broken for non-PV cases, and undocumented for the PV case,
> > hence why it's broken in Linux.
> >
> > Physical CPU Hotplug does not pass the bar for being anything more than
> > experimental.  It's absolutely not tech-preview level because the only
> > demo it has had in an environment (admittedly virtual) which does
> > implement the spec in a usable way demonstrates that it doesn't function.
> >
> > The fact no-one has noticed until now shows that the feature isn't used,
> > which comes back around full circle to the fact that Intel never made it
> > work and never shipped it.
>
> OTOH it _is_ used in virtualization. KVM supports it and it just
> works. That's how I found out that XEN explodes in colourful ways :)

It should be pointed out that there's currently a start-up selling a
product that specifically runs Xen under cloud providers -- Exostellar
(was Exotanium) [1].  If cloud providers do use ACPI hotplug to allow
on-the-fly adjustments of the number of vcpus, Exostellar will
probably want support at some point.  (Perhaps it would be good to
rope them into testing / maintaining it.)

 -George

[1] https://exostellar.io/
Stefano Stabellini Sept. 13, 2023, 11:18 p.m. UTC | #35
On Wed, 13 Sep 2023, George Dunlap wrote:
> On Tue, Sep 12, 2023 at 8:57 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > On Mon, Sep 11 2023 at 19:24, Andrew Cooper wrote:
> > > Furthermore, cursory testing that Thomas did for the Linux topology work
> > > demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.
> > >
> > > Even furthermore, it's an area of the Xen / dom0 boundary which is
> > > fundamentally broken for non-PV cases, and undocumented for the PV case,
> > > hence why it's broken in Linux.
> > >
> > > Physical CPU Hotplug does not pass the bar for being anything more than
> > > experimental.  It's absolutely not tech-preview level because the only
> > > demo it has had in an environment (admittedly virtual) which does
> > > implement the spec in a usable way demonstrates that it doesn't function.
> > >
> > > The fact no-one has noticed until now shows that the feature isn't used,
> > > which comes back around full circle to the fact that Intel never made it
> > > work and never shipped it.
> >
> > OTOH it _is_ used in virtualization. KVM supports it and it just
> > works. That's how I found out that XEN explodes in colourful ways :)
> 
> It should be pointed out that there's currently a start-up selling a
> product that specifically runs Xen under cloud providers -- Exostellar
> (was Exotanium) [1].  If cloud providers do use ACPI hotplug to allow
> on-the-fly adjustments of the number of vcpus, Exostellar will
> probably want support at some point.  (Perhaps it would be good to
> rope them into testing / maintaining it.)

Supporting CPU hotplug in a nested virtualization setting is a different
proposition compared to supporting Physical CPU Hotplug. Typically
virtual firmware (hypervisor-provided firmware) has less unexpected
behaviors compared to baremetal firmware.

Could you make the distinction in SUPPORT.md?
George Dunlap Sept. 15, 2023, 10:41 a.m. UTC | #36
On Thu, Sep 14, 2023 at 12:18 AM Stefano Stabellini
<sstabellini@kernel.org> wrote:
>
> On Wed, 13 Sep 2023, George Dunlap wrote:
> > On Tue, Sep 12, 2023 at 8:57 AM Thomas Gleixner <tglx@linutronix.de> wrote:
> > >
> > > On Mon, Sep 11 2023 at 19:24, Andrew Cooper wrote:
> > > > Furthermore, cursory testing that Thomas did for the Linux topology work
> > > > demonstrates that it is broken anyway for reasons unrelated to ACPI parsing.
> > > >
> > > > Even furthermore, it's an area of the Xen / dom0 boundary which is
> > > > fundamentally broken for non-PV cases, and undocumented for the PV case,
> > > > hence why it's broken in Linux.
> > > >
> > > > Physical CPU Hotplug does not pass the bar for being anything more than
> > > > experimental.  It's absolutely not tech-preview level because the only
> > > > demo it has had in an environment (admittedly virtual) which does
> > > > implement the spec in a usable way demonstrates that it doesn't function.
> > > >
> > > > The fact no-one has noticed until now shows that the feature isn't used,
> > > > which comes back around full circle to the fact that Intel never made it
> > > > work and never shipped it.
> > >
> > > OTOH it _is_ used in virtualization. KVM supports it and it just
> > > works. That's how I found out that XEN explodes in colourful ways :)
> >
> > It should be pointed out that there's currently a start-up selling a
> > product that specifically runs Xen under cloud providers -- Exostellar
> > (was Exotanium) [1].  If cloud providers do use ACPI hotplug to allow
> > on-the-fly adjustments of the number of vcpus, Exostellar will
> > probably want support at some point.  (Perhaps it would be good to
> > rope them into testing / maintaining it.)
>
> Supporting CPU hotplug in a nested virtualization setting is a different
> proposition compared to supporting Physical CPU Hotplug. Typically
> virtual firmware (hypervisor-provided firmware) has less unexpected
> behaviors compared to baremetal firmware.
>
> Could you make the distinction in SUPPORT.md?

People say at the moment it doesn't actually work, even under QEMU; so
"Experimental" is probably the right status.

But if someone got it working, we might add "supported, with caveats"
and specify things in a comment.

 -George
diff mbox series

Patch

diff --git a/xen/arch/x86/acpi/boot.c b/xen/arch/x86/acpi/boot.c
index 54b72d716b..4a62822fa9 100644
--- a/xen/arch/x86/acpi/boot.c
+++ b/xen/arch/x86/acpi/boot.c
@@ -87,14 +87,17 @@  acpi_parse_x2apic(struct acpi_subtable_header *header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
+	/* Ignore entries with invalid apicid */
+	if (processor->local_apic_id == 0xffffffff)
+		return 0;
+
 	/* Don't register processors that cannot be onlined. */
 	if (madt_revision >= 5 &&
 	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
 	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
 		return 0;
 
-	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
-	    processor->local_apic_id != 0xffffffff || opt_cpu_info) {
+	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info) {
 		acpi_table_print_madt_entry(header);
 		log = true;
 	}
@@ -143,14 +146,17 @@  acpi_parse_lapic(struct acpi_subtable_header * header, const unsigned long end)
 	if (BAD_MADT_ENTRY(processor, end))
 		return -EINVAL;
 
+	/* Ignore entries with invalid apicid */
+	if (processor->id == 0xff)
+		return 0;
+
 	/* Don't register processors that cannot be onlined. */
 	if (madt_revision >= 5 &&
 	    !(processor->lapic_flags & ACPI_MADT_ENABLED) &&
 	    !(processor->lapic_flags & ACPI_MADT_ONLINE_CAPABLE))
 		return 0;
 
-	if ((processor->lapic_flags & ACPI_MADT_ENABLED) ||
-	    processor->id != 0xff || opt_cpu_info)
+	if ((processor->lapic_flags & ACPI_MADT_ENABLED) || opt_cpu_info)
 		acpi_table_print_madt_entry(header);
 
 	/* Record local apic id only when enabled */